How big data unveiled Shakespeare’s co-author

(And what it means for us)


According to ‘big data’ Shakespeare may not have authored his work alone. Recent findings have revealed a second literary fingerprint amongst The Bard’s work, giving us a glimpse into what the future of anonymity could look like for the rest of us.

For years, scholars have speculated that Shakespeare may not have been alone in the authorship of his work, with arguments that some of his major works were co-authored, prevailing for close to two centuries – until big data recently settled speculation.

Big data is the collection of data from various sources – both traditional and digital, and is key to the ongoing discovery and analysis of a subject. When examined, data sets can reveal particular patterns and trends, and unbeknownst to many, we produce this ‘big data’ every day at high velocities and volumes – from the moment we input life milestones into Facebook to when we sign up for loyalty cards.

Through the use of big data, 23 international scholars were able to perform text analysis of Shakespeare’s work, that of his rival, Christopher Marlowe, and fellow authors of the time. The results were computerised data sets of words, phrases and unique idiosyncrasies that were then cross referenced with Shakespeare’s plays to determine authorship.

The case of Shakespeare highlights the capacity innovations in data analysis have to unearth history’s greatest unknowns.

The tests revealed what has been, until this year, hearsay – Christopher Marlowe, Shakespeare’s fierce rival and arguably only creative equal during the Elizabethan era, co-authored all three of the Henry VI plays.

For the scholars who made the discovery, the findings did more than unveil the truth behind authorship. Big data provided key insight into one of the literary world’s most complicated rivalries, proving that both authors didn’t just influence each other – they collaborated closely and to extents still unknown.

According to the complete data sets, 17 of 44 of Shakespeare’s works most likely had input from other authors, with Marlowe’s unique word choices, phrasing and sentence structure proving to be, as researcher Gabriel Egan of Leicester’s De Montfort University put it, “pretty unmistakable” evidence of his collaboration with Shakespeare – finally closing the lid on one of the literary world’s most long-debated controversies.

For us, the revelation is a little more ominous. In 2012, researcher and data scientist, Arvind Narayanan, along with colleagues at Stanford and the University of California, published a study about an algorithm designed to unmask anonymous internet users by analysing their writing styles and word choices. In the same way researchers were able to cross reference Christopher Marlowe’s work with Shakespeare’s, data sets were cross examined with users who published their content under their name – with the experiment effectively matching 100,000 authors and internet users.

Six years earlier Narayanan explored anonymised Netflix customer information made publicly available following a contest designed to gather enough data to create an improved movie-ranking algorithm. 100 million movie ratings made by 480,000 Netflix customers were released to the public, while names were replaced with random, unique numbers in a bid to protect user privacy.

Both Narayanan and his research partner Vitaly Shmatikov were able to successfully unmask the identity of anonymous Netflix users by taking anonymised movie ratings and timestamps that showed when customers had submitted them, and then comparing this data against non-anonymised movie ratings posted on IMDb. Their research resulted in a privacy lawsuit filed against Netflix and the cancellation of plans for a second contest.

Despite Narayanan and Shmatikov’s successful attempt to ‘de-anonymise’ this data, there are those who argue that not all is grim in the world of data analysis, arguing that advancements in data analytics have the potential to unlock rich opportunities to use de-identified datasets in ways never before possible – including economic and social benefits that are to a scale yet to be fully realised.

As Joseph Lorenzo Hall, chief technologist at the Center for Democracy & Technology argued – “The big problem is public release of data sets that have been poorly anonymised and sharing between private parties, data sets that they consider to not contain personal information, when they definitely contain some sort of persistent identifier that could be trivially associated with an individual”.

With innovations in data analysis developing rapidly, companies need to be more diligent about the security measures they take to protect their customer’s data. Heidi Wachs, an expert in the field of privacy agrees, arguing that this is where data minimisation plays a part in protecting anonymity – “In any given data set, were all of the data elements necessary to accomplish a specific goal? Or was data just being collected because it could, or people were willing to offer it?”.

The case of Shakespeare highlights the capacity innovations in data analysis have to unearth history’s greatest unknowns. As for the future? Predictions include the development of autonomous agents – virtual personal assistants and ‘smart’ advisors, companies fine tuning how they interpret data to drive revenue and further conversations around the privacy of data sets.

As for consumers and brands, we’re facing an unprecedented development in the world of data, bringing with it new opportunities that will absolutely test our limits and challenge what we thought was possible.

Who can really say how deep the rabbit whole goes?

Thanks to:

Like what you’ve read? Subscribe.