Finding the Sherlock in Shakespeare: some ideas about prose genre and linguistic uniqueness

By Victor Lenthe | Published: October 29, 2011

An unexpected point of linguistic similarity between detective fiction and Shakespearean comedy recently led me to consider some of the theoretical implications of tools like DocuScope, which frequently identify textual similarities that remain invisible in the normal process of reading.

A Linguistic Approach to Suspense Plot

Playing around with a corpus of prose, we discovered that the linguistic specs associated with narrative plot are surprisingly unique. Principle Component Analysis performed on the linguistic features counted by DocuScope suggested the following relationship between the items in the corpus:

spacer I interpreted the two strongest axes of differentiation seen in the graph (PC 1 and PC 2) as (1) narrative, and (2) plot. The two poles of the narrative axis are Wuthering Heights (most narrative) and The Communist Manifesto (least narrative). The plot axis is slightly more complicated. But on the narrative side of the spectrum, plot-driven mysteries like “The Speckled Band” and The Canterville Ghost score high on plot, while the least plotted narrative is Samuel Richardson’s Clarissa (9 vols.). For now, I won’t speculate about why Newton’s Optics scores so astronomically high on plot. It is enough that when dealing with narrative, PC 2 predicts plot.

The fact that something as qualitative and amorphous as plot has a quantitative analogue leads to several questions about the meaning of the data tools like DocuScope turn up.

Linguistic Plot without Actual Plot

Because linguistic plot is quantifiable, it allows us to look for passages where plot is present to a relative degree. Given a large enough sample, it is more than likely that some relatively plotted passages will occur in texts that are not plotted in any normal sense. This would at minimum raise questions about how to handle genre boundaries in digital literary research.

Our relative-emplotment test (done in TextViewer) yielded intuitive results when performed on the dozen or so stories in The Adventures of Sherlock Holmes: the passages exhibiting the strongest examples of linguistic plot generally narrated moments of discovery, and moved the actual plot forward in significant ways. Often, these passages showed Holmes and Watson bursting into locked rooms and finding bodies.


When we performed the same test on the Shakespeare corpus, something intriguing happened. The passages identified by TextViewer as exhibiting linguistic plot look very different from the corresponding passages in Sherlock Holmes. There were no dead bodies, no broken-down doors, and no exciting discoveries. Nonetheless, the ‘plotted’ Shakespeare scenes were remarkably consistent with each other. Perhaps most significant in the context of their genre, these scenes had a strong tendency to show characters putting on performances for other characters. Additionally, in a factor that is fascinating even though it is probably a red herring, the ‘plotted’ Shakespeare scenes had an equally strong tendency to involve fairies.


The consistent nature of the ‘plotted’ Shakespeare scenes suggests that the linguistic specs associated with plot when they occur in Sherlock Holmes may have different, but equally specific, effects in other genres. The next step would be to find a meaningful correspondence between the two seemingly disparate literary devices that accompany linguistic plot – detectives bursting into rooms to solve murders, and plays within plays involving fairies. I have some hunches about this. But in many ways the more important question is what is at stake in using DocuScope to identify such unexpected points of overlap.

Enough measurable links between seemingly unlike texts could suggest an invisible web of cognates, which share an underlying structure despite their different appearances and literary classifications. Accordingly, we might hypothesize that reading involves selective ignorance of semantic similarities that could otherwise lead to the socially deviant perception that A Midsummer Night’s Dream resembles a Sherlock Holmes mystery.

The question, then, is this: if the act of reading consists in part of ignoring unfruitful similarities, then what happens when these similarities nonetheless become apparent to us? Looking back at the corpus graph, we begin to see all sorts of possibilities, many of which would be enough make us literary outcasts if voiced in the wrong company. Could Newton’s Optics contain the most exciting suspense plot no one has ever noticed? Could Martin Luther be secretly more sentimental than Clarissa?

Estranging Capacities of Digital Cognates

I have been using the term ‘cognate’ to describe the relationship between linguistically similar but otherwise dissimilar texts. These correspondences will only be meaningful if we can connect them in a plausible way to our readerly understanding of the texts or genres in question. In the case of detective fiction and Shakespearean comedy, this remains to be seen. But our current lack of an explanation does not mean we should feel shy about pursuing the cognates computers direct us to. My analogy is the pop-culture ritual of watching The Wizard of Oz, starting the Pink Floyd album Dark Side of the Moon on the third roar of the MGM lion. The movie and the record sync up in a plausible pattern, prompting the audience to grasp a connection between the cognate genres of children’s movies and psychedelic rock.

If digital methods routinely direct our attention to patterns we would never notice in the normal process of reading, then we can expect them to turn up a large number of such cognates. If we want to understand the results these tools are turning up, we should develop a terminology and start thinking about implications – not just for the few correspondences we can explain, but also for the vast number we cannot explain, at least right now.

Posted in Counting Other Things, Quant Theory, Shakespeare | 2 Comments

Why the Difference? Accounting for Variation between the Folio and Globe Editions of Shakespeare’s Plays

By JasonWhitt | Published: October 21, 2011

To what extent is modern text analysis software capable of dealing with historical data? This is a perennial question asked by those working with digitized historical texts who wish to see how an analysis of such texts can be facilitated by cutting-edge technologies. No doubt the best way to answer the question is to test this software with two versions of the same text, where one version of the text can be considered an older and noticeably different version than the other version.

Enter the Folio and Globe editions of Shakespeare’s plays. The latter was published in 1867 and contains modernized spelling throughout, whereas the former was published in 1623 and maintains the original spelling of Shakespeare’s Early Modern English. Using DocuScope for text analysis and JMP for statistical visualizations, the following dendrogram was created:


The texts highlighted in red are from the Folio edition, whereas the texts highlight in blue come from the Globe edition. One would expect all of Shakespeare’s Folio plays to cluster with their Globe complement here. Much Ado About Nothing is Much Ado About Nothing, after all, regardless of which edition it appears in. But for the most part, this neat pairing off is not what happens: instead, most of the Folio plays are grouping with other Folio plays, and the same is true for the Globe plays. Only a few plays are actually grouping with themselves at the top of the dendrogram. Methinks we have a problem.

Upon closer inspection, I found that 13,667 items were tagged by DocuScope in the Globe edition of Much Ado, but only 11,382 items were tagged in the Folio edition of the same play: a 16.7% difference. An inspection of eleven other Shakespeare plays provides us with an overall mean difference of 17.8%: a difference that cannot be considered good when it comes to tagging accuracy.

But why the disparity? Maybe a closer look at DocuScope can give us an idea.

First the Folio version of the opening scene in Much Ado About Nothing (with the “Interior Thought” and “Public Values” clusters turned on):


And the Globe version of the same scene with the same clusters turned on:


One need not read far to discover what’s (not) being tagged in the older, Folio edition of Much Ado: Learne versus learn. It appears the orthographic rendering of the unstressed final –e is causing DocuScope to overlook this work altogether. We find the same mistake later on with indeede/indeed, kindnesse/kindness, helpe/help, and kinde/kind. Another common problem is Early Modern use of u, which is rendered v in modern orthography: deseru’d vs. deserved, seruice vs. service,  and ouerflow vs. overflow. There are also a few punctuation issues causing problems: the use of apostrophe (as we see in deseru’d) and the use of | (con | flict vs. conflict), which probably results from some sort of scanning or other computer error. In other plays, the hyphen was also found to be a possible culprit of DocuScope overlooking certain items (ouer-charg’d vs. overcharged).

Although the overall number of DocuScope omissions on a Folio play is rather large, the actual number of error types is quite small. This gives us hope that, with a bit of modification, it may well be possible to train DocuScope to read non-modern(ized) texts.


Posted in Uncategorized | 4 Comments

The comic ‘I’ and the tragic ‘we’?

By Jonathan Hope | Published: July 29, 2011

In our Shakespeare Quarterly paper, we used Docuscope to come up with a description of Shakespeare’s comic language which centres on the rapid exchange of singular pronouns: I/you and my/your. We claimed there that Shakespearean comedies typically involve people arguing about things, striving to arrive at a ‘we’ of agreement, but not being able to until the final scene. Here’s what we said in more detail (we’re discussing Twelfth Night):

The quick trading of I/you and my/your strings in Comic dialogue suggests a world in which predicates are attached to subjects from two, and only two, points of view. This is not a universe of one; nor is it a crowd. It is not surprising that Comic plotting, built as it is on sexual pairings, would favor this type of bivalent, perspectival tagging of action by speakers. But there is something else going on here. Olivia is trying to make something happen in this exchange. She says, “do not extort thy reasons from this clause,” and earlier, “I would you were as I would have you be!” (3.1/1392, 1381). The “thy” and “you” are important because the speaker is trying to create or assert a particular interpretation of how these two individuals relate to one another (and the words exchanged between them). The essential drama in this situation is the asymmetry of desire that obtains between the two characters, an asymmetry that keeps Viola from assenting to Olivia’s advances. That resistance is actually what forces Olivia to make these statements that are rich with I/you and me/my, since she uses these words as anchors for a broader interpretation that does not yet obtain. She really wants to say we. And Cesario doesn’t, so they remainin I/you dialogue…

Shakespeare writes Comedies in which characters, sometimes quite perversely, find the wrong way to the ones they love. Often it is chance or an onstage helper who sorts this out. Shakespeare is actually quite reserved when it comes to showing love as naturally progressing through its obstacles unassisted. But given that in the initial stages of courtship Shakespearean lovers almost never meet and join in a perfectly symmetrical way—they don’t start out as stones set in an arch, leaning perfectly on a keystone—we should expect this asymmetry to show itself in the language. Where does it show up? It appears when a resistant individual, a “you,” prevents another “I” from arriving at an interpretation of a relationship that might be referred to as a “we” before others. Let’s call this the “resistant-you” hypothesis. Linguistically, the effect manifests itself in the assertion of the self (“FirstPerson”) and the rejection of suggested mental and emotional realities (“DenyDisclaim”).

We’ve been finding that high frequencies of first person pronouns, and other features associated with rapid dialogue, are characteristic of most types of Early Modern comedy. But what of the implied correlative to this? If comedies are the genre of ‘I’; are tragedies the genre of ‘we’?

A quick way to test this is to use Martin Mueller et al.’s excellent Wordhoard tool to run a log likelihood vocabulary test on Shakespeare’s comedies and tragedies. This type of test takes an analysis corpus (in this case Shakespeare’s comedies), and compares it to a reference corpus (Shakespeare’s tragedies). The output flags those words that are either more or less frequent in the analysis corpus than we would expect, given the frequencies found in the reference corpus.

The results in this case are as follows:



What we are interested in here is the list of lemmas in column 1: ‘she’, ‘I’, ‘master’, ‘a’, ‘sir’ etc; and the symbol in column 3 ‘Relative use’ – which tells us if the frequency is greater (+) or less (-) than expected. (Column 4 gives the log likelihood value, and a number of asterisks indicating degree of statistical significance, but all the results we are looking at here are highly significant, so we can ignore this.)

Behold: pronouns used more in the comedies than the tragedies are the singular ‘she’, ‘I’, ‘you’ (let’s assume these are mainly singular uses) – these are all marked + in column 3. Now look at the results for the plural pronouns ‘our’, ‘we’, ‘they’: all marked -, and so lowered in the comedies/raised in the tragedies.

This is a very strong finding (especially considering how frequent pronouns are), and it invites further exploration of the dialogic nature of comedy in comparison with the communal nature of tragedy.

Posted in Early Modern Drama, Shakespeare | Tagged comedies, I, log likelihood, pronouns, tragedies, Twelfth Night, we, Wordhoard, you | 2 Comments

Phylogenetic inference

By Jonathan Hope | Published: May 12, 2011

spacer Image by Greg McInerny and Stefanie Posavec – textual shifts between editions of Darwin’s Origin of Species (used by kind permission of the artist – see bottom of post for further details).


In advance of starting up some big experiments on the texts being made available by TCP, we’ve been discussing the models developed in mathematics/biology for tracing influence.

This began with a conversation with David Krakauer from the Sante Fe Institute about our work. He works in mathematics and evolutionary biology, and has collaborated a bit with Franco Moretti. We told him about our attempts to group texts by genre and then trace their linguistic predecessors and descendants. He suggested this was similar to the problem of phylogenetic inference in biology.

The problem as we currently understand it: we identify a group of texts within a population based on manifest traits known to humans; we then want to account for the development of these traits among items understood to be earlier in the sequence; we link these traits to sentence level linguistic items; we track the traits via these items.

We are thinking about this, since this is going to be one of our big intellectual problems once we add time to the analysis (so far, we’ve been looking at populations, e.g. Shakespeare’s plays, in pretty much the same time slot).

Here are some starter references (though they are certainly not entry-level in all cases!):


In our work to date, we have tried to think carefully about the philosophical and methodological implications of what we are doing, rather than simply focus on the (admittedly often attractive and very interesting) results – so it is important for us to consider the implications of taking on models from other fields (especially when those other fields are better developed than ours).

Jonathan Hope has done some work on biology and linguistic history in the past. And one thing that might make the application of biological models difficult is the different natures of inheritance and ‘traits’ in language and biology. In biology, traits (or the genes that produce them) have to be passed down in a closed, continuous way.  We get our genes from our parents, not some random person we bump into on the tube – and it’s impossible for us to naturally acquire a gene, however useful it might be, from another species.

None of this holds for language though. If we’re writing a ‘history’, we’ll want to borrow some traits from other histories – but we don’t have to take everything from other histories, and we can take traits from pretty much any histories we happen to have read: old, recent, famous, unknown. So the status of *generic* traits is very different to *genetic* ones.

In addition, we are not confined to our own linguistic *species*. If we want, we can introduce traits from a completely different species to produce something new. In language, if you want a bat, you can cross rats with sparrows. In biology, you have to wait for one to evolve.

Once we are thinking about genres developing over time, it will be easy for us to assume a biological model of linear generations and influence. It’s a useful way of thinking, and the statistical techniques are powerful, but we’ll need to remember that we aren’t looking at exactly the same kind of process.

A further consideration is the power of the inferencing techniques that have been developed in biology. It is very tempting to want to throw these at our newly available textual data – but one very significant thing to have emerged from our work is the importance of having understandable observations.

If a statistical black box tells you some fact, that is not as interesting or important as being able to understand where a particular thing comes from and how it got there. If some fancy inference algorithm tells you there’s a pattern, it isn’t that helpful unless you can comprehend it, since an incomprehensible or inexplicable pattern might just be an artefact of the process or analysis.

With biology, the models are more well known and trusted, so an incomprehensible pattern is more easy to accept: it could more safely be taken as an indication of a real effect we just haven’t understood yet.

Ultimately, our interest is in building tools to help people understand the complexity in texts. We are less interested in having machines sort it out automatically (indeed, we are probably sceptical that this is really possible). Although, there is also a need for tools to help people sort out what the machines figure out…



Jonathan Hope,‘Rats, bats, sparrows and dogs: biology, linguistics and the nature of Standard English’ in The Development of Standard English 1300-1800, Laura Wright (ed.), pp. 49-56, (Cambridge University Press: 2000)  ISBN 0-521-77114-5

Stefanie Posavec and Greg McInerny:  The (En)tangled Word Bank project (originated at Microsoft Research, Cambridge)


Posted in Uncategorized | Tagged biology, David Krakauer, Franco Moretti, mathematics, phylogenetic inference, Stefanie Posavec | Comments closed

The Ancestral Text

By Michael Witmore | Published: May 9, 2011

Rosamond Purcell, "The Book, the Land"

In this post I want to understand the consequences of “massive addressability” for “philosophies of access”–philosophies which assert that all beings exist only as correlates of our own consciousness. The term “philosophy of access” is used by members of the Speculative Realist school: it seems to have been coined largely as a means of rejecting everything the term names. Members of this school dismiss the idea that any speculative analysis of the nature of beings can be replaced by an apparently more basic inquiry into how we access to the world, an access obtained either through language or consciousness. The major turn to “access” occurs with  Kant, but the move is continued in an explicitly linguistic register by Heidegger, Wittgenstein, Derrida, and a range of post-structuralists.

One reason for jettisoning the priority of access, according to Ray Brassier, is that it violates “the basic materialist requirement that being, though perfectly intelligible, remain irreducible to thought.” As will become clear below, I am sympathetic to this materialist requirement, and more broadly to the Speculative Realist project of dethroning language as our one and only mode of access to the world. (There are plenty of ways of appreciating the power and complexity of language without making it the wellspring of Being, as some interpreters of Heidegger have insisted.) Our quantitative work with texts adds an unexpected twist to these debates: as objects of massive and variable address, we grasp things about texts in precisely the ways usually reserved for non-linguistic entities. When taken as objects of quantitative description, texts possess qualities that–at some point in the future–could be said to have existed in the present, regardless of our knowledge of them. There is thus a temporal asymmetry surrounding quantitative statements about texts: if one accepts the initial choices about what gets counted, such statements can be “true” now even if they can only be produced and recognized later. Does this asymmetry, then, mean that language itself, “though perfectly intelligible, remain[s] irreducible to thought?” Do iterative methods allow us to satisfy Brassier’s materialist requirement in the realm of language itself?

Let us begin with the question of addressability and access. The research described on this blog involves the creation of digitized corpora of texts and the mathematical description of elements within that corpus. These descriptions obtain at varying degrees of abstractions (nouns describing sensible objects, past forms of verbs with an auxiliary, etc.). If we say that we know something quantitatively about a given corpus, then, we are saying that we know it on the basis of a set of relations among elements that we have provisionally decided to treat as countable unities. Our work is willfully abstract in the sense that, at crucial moments of the analysis, we foreground relations as such, relations that will then be reunited with experience. When I say that objects of the following kind – “Shakespearean texts identified as comedies in the First Folio” – contain more of this type of thing–first and second person singular pronouns–than objects of a different kind (Shakespeare’s tragedies, histories), I am making a claim about a relation between groups and what they contain. These groupings and the types of things that we use to sort them are provisional unities: the circle we draw around a subset of texts in a population could be drawn another way if we had chosen to count other things. And so, we must recognize several reasons why claims about these relations might always be revised.

Every decision about what to count offers a caricature of the corpus and the modes of access this corpus allows. A caricature is essentially a narrowing of address: it allows us to make contact with an object in some of the ways Graham Harman has described in his work on vicarious causation. One can argue, for example, that the unity “Shakespeare’s Folio comedies” is really a subset of a larger grouping, or that the group can itself be subdivided into smaller groups. Similarly, one might say that the individual plays in a given group aren’t really discrete entities and so cannot be accurately counted in or out of that group. There are certain words that Hamlet may or may not contain, for example, because print variants and multiple sources have made Hamlet a leaky unity. (Accommodating such leaky unities is one of the major challenges of digital text curation.) Finally, I could argue that addressing these texts on the level of grammar–counting first and second person singular pronouns–is just one of many modes of address. Perhaps we will discover that these pronouns are fundamentally linked to semantic patterns that we haven’t yet decided to study, but should. All of these alternatives demonstrate the provisional nature of any decision to count and categorize things: such decisions are interpretive, which is why iterative criticism is not going to put humanities professors out of business. But such counting decisions are not–and this point is crucial–simply another metaphoric reduction of the world. PCA, cluster analysis and the other techniques we use are clearly inhuman in the number of comparisons they are able to make. The detour through mathematics is a detour away from consciousness, even if that detour produces findings that ultimately converge with consciousness (i.e., groupings produced by human reading).

Once the counting decisions are made, our claims to know something in a statistical sense about texts boils down to a claim that a particular set of relations pertains among entities in the corpus. Indeed, considered mathematically, the things we call texts, genres, or styles simply are such sets of relations–the mathematical reduction being one of many possible caricatures. But counting is a very interesting caricature: it yields what is there now–a real set of relations–but is nevertheless impossible to contemplate at present. Once claims about texts become mathematical descriptions of relations, such statements possess what the philosopher Quentin Meillassoux calls ancestrality, a quality he associates primarily with statements about the natural world. Criticizing the ascendance of what he calls the Kantian dogma of correlationism—the assumption that everything which can be said “to be” exists only as correlate of consciousness—Meillassoux argues that the idealist or critical turn in Continental philosophy has impoverished our ability to think about anything that exceeds the correlation between mind and world. This “Great Outdoors,” he goes on to suggest, is a preserve that an explicitly speculative philosophy must now rediscover, one which Meillassoux believes becomes available to us through mathematics. So, for example, Meillassoux would agree with the statement, “the earth existed 4.5 billion years ago,” precisely because it can be formulated mathematically using measured decay rates of carbon isotopes. The statement itself may be ideal, but the reality it points to is not. What places The Great Outdoors out of doors, then, is its indifference to our existence or presence as an observer. Indeed, for Meillassoux, it is only those things which are “mathematically conceivable” that exceed the post-Kantian idealist correlation. For Meillassoux,

all those aspects of the object that can be formulated in mathematical terms can be meaningfully conceived as properties of the object in itself.

Clearly such a statement is a goad for those who place mind or natural language at the center of philosophy. But the statement is also a philosophical rallying cry: be curious about objects or entities that do not reference human correlates! I find this maxim appealing in the wake of the “language is everything” strain of contemporary theory, which is itself a caricature of the work of Wittgenstein, Derrida and others. Such exaggerations have been damaging to those of us working in the humanities, not least because they suggest that our colleagues in the sciences do nothing but work with words. By making language everything–and, not accidentally, making literary studies the gatekeeper of all disciplines–this line of thought amounts to a new kind of species narcissism. Meillassoux and others are finding ways to not talk about language all the time, which seems like a good thing to me.

But would Meillassoux, Harman and other Speculative Realists consider texts to be part of The Great Outdoors? Wouldn’t they have to? After all, statements about groupings in the corpus can be true now even when there is no human being to recognize that truth as a correlate of thought. Precisely because texts are susceptible to address and analysis on a potentially infinite variety of levels, we can be confident that a future scholar will find a way of counting things that turns up a new-but-as-yet-unrecognized grouping. Human reading turned up such a thing when scholars in the late nineteenth century “discovered” the genre of Shakespeare’s Late Romances. (Hope and I have, moreover, re-described these grouping statistically.) Like our future mathematical sleuth might do a century from now, nineteenth-century scholars were arguing that Romance was already a real feature of the Shakespearean corpus, albeit one that no one had yet recognized. They had, in effect, picked out a new object by emphasizing a new set of relations among elements in a collection of words. Couldn’t we expect another genre to emerge from this sort of analysis–a Genre X, let’s say–given sufficient time and resources? Would we accept such a genre if derived through iterative means?

I can imagine a day, 100 years from now, when we have different dictionaries that address the text on levels we have not thought to explore at present. What if someone creates a dictionary that is neither affiliated with the authors of this page nor responsible for its contents. This is a safe-cache copy of the original web site.