Why Literature is the Ultimate Big-Data Challenge

  • 2017-03-30
  • The Economist

Number-crunching literary criticism was the butt of an academic in-joke in “Arcadia” (1993), Tom Stoppard’s cerebral play. Bernard Nightingale, a foppish poetry don, scoffs at a colleague who used a computer program to attribute an anonymous story to D.H. Lawrence. To Bernard’s “inexpressible joy”, he found that “on the same statistical basis, there was a ninety percent chance that Lawrence also wrote the ‘Just William’ books and much of the previous day’s Brighton and Hove Argus”. The “maths mob” skewered in Mr Stoppard’s play no longer seems so ridiculous; with the publication of the “New Oxford Shakespeare”, they have shaped the debate about authorship in Elizabethan England.

This new edition of the Complete Works made headlines last October as it identified 17 of Shakespeare’s 44 plays as collaborations (by comparison, the 1986 edition named only eight). The most thrilling new name on the contents page is that of Christopher Marlowe; his inclusion seems to give credence to authorship theories previously dismissed as conspiracies. What has really raised eyebrows, though, is the technique used to identify Marlowe’s hand: not traditional editorial insight, but computational analysis. So how do today’s data linguists figure out who wrote what, without confusing authorship and influence? And more importantly, why does it matter?