Shakespeare’s mythic vocabulary – and his invisible grammar

By Jonathan Hope | Published: February 14, 2012

Universities in the UK are under pressure to demonstrate the ‘impact’ of their research. In many ways, this is fair enough: public taxes account for the vast majority of UK University income, so it is reasonable for the public to expect academics to attempt to communicate with them about their work.

University press offices have become more pro-active in seeking out stories to present to the media as a way of raising the profile of institutions. Recently, the Strathclyde press office contacted me after they read one of my papers on Strathclyde’s internal research database: they wanted to do a press release to see if any outlets would follow-up on the story.

The paper they’d read was a survey article I’d written for an Open University course reader. My article reported recent papers by Hugh Craig and Ward Elliott & Robert Valenza, which demolish some common myths about Shakespeare’s vocabulary (its size and originality – and see Holger Syme on this too) – and went on to argue that Shakespeare’s originality might lie in his grammar, rather than in the words he does not make up.

Indeed they did want to pick up on the story, though I’d have preferred the article to have been a bit clearer, and not to have had a headline that was linguistic nonsense. The Huffington Post did a bit better.

One particularly galling aspect of the stories: the articles failed to attribute the work on Shakespeare’s vocabulary to Craig or Elliott and Valenza, so it might have looked as though I was taking credit for other people’s work

Looking back, I don’t think I explained my ideas very well either to Strathclyde’s press office, or to the Daily Telegraph when they rang – hence the rather confused reports. But I was extremely careful to attribute the work to those who had done it – even to the point of sending my original text to the journalist I talked to, and pointing him to the relevant footnote. I did not expect a news story to contain full academic references of course – but a clearly written story could easily have mentioned the originators of the work.

A minor episode, but it also made me think that there is a fundamental problem with trying to explain complex linguistic issues in the daily press – even if you use Newcastle United’s greatest goalscorers to illustrate the statistics. They want a clear story: you want to get the nuances across. Luckily, this blog allows me to make the full text of my article available (click through twice for a pdf of my article):

Shakespeare and the English Language

 

Jonathan Hope, Strathclyde University, Glasgow, February 2012

Posted in Counting Other Things, Early Modern Drama, Shakespeare | Tagged Alan Shearer, Daily Telegraph, Holger Syme, Huffington Post, Hugh Craig, Jackie Milburn, Malcolm Macdonald, Newcastle United, Robert Valenza, shakespeare's grammar, Shakespeare's vocabulary, Ward Elliott | 1 Comment

The very strange language of A Midsummer Night’s Dream

By Jonathan Hope | Published: February 6, 2012

I just got back from a fun and very educative trip to Shakespeare’s Globe in London, hosted by Dr Farah Karim-Cooper, who is director of research there.

The Globe stages an annual production aimed at schools (45,000 free tickets have been distributed over the past five years), and this year’s play is A Midsummer Night’s Dream. I was invited down to discuss the language of the play with the cast and crew as they begin rehearsals.

This was a fascinating opportunity for me to test our visualisation tools and analysis on a non-academic audience – and the discussions I had with the actors opened my eyes to applications of the tools we haven’t considered before. They also came up with a series of sharp observations about the language of the play in response to the linguistic analysis.

I began with a tool developed by Martin Mueller’s team at Northwestern University: Wordhoard, as a way of getting a quick overview of the lexical patterns in the play, and introducing people to thinking statistically about language.

Here’s the wordcloud Wordhoard generates for a loglikelihood analysis of MSND compared with the whole Shakespeare corpus:

 

spacer

Loglikelihood takes the frequencies of words in one text (in this case MSND) and compares them with the frequencies of words in a comparison, or reference, sample (in this case, the whole Shakespeare corpus). It identifies the words that are used significantly more or less frequently in the analysis text than would be expected given the frequencies found in the comparison sample. In the wordcloud, the size of a word indicates how strongly its frequency departs from the expected. Words in black appear more frequently than we would expect, and words in grey appear less frequently.

As is generally the case with loglikelihood tests, the words showing the most powerful effects here are nouns associated with significant plot elements: ‘fairy’, ‘wall’, ‘moon’, ‘lion’ etc. If you’ve read the play, it is not hard to explain why these words are used in MSND more than in the rest of Shakespeare – and you really don’t need a computer, or complex statistics, to tell you that. To paraphrase Basil Fawlty, so far, so bleeding obvious.

Where loglikelihood results normally get more interesting – or puzzling – is in results for function words (pronouns, auxiliary verbs, prepositions, conjunctions) and in those words that are significantly less frequent than you’d expect.

Here we can see some surprising results: why does Shakespeare use ‘through’ far more frequently in this play than elsewhere? Why are the masculine pronouns ‘he’ and ‘his’ used less frequently? (And is this linked to the low use of ‘lord’?) Why is ‘it’ rare in the play? And ‘they’ and ‘who’ and ‘of’?

At this stage we started to look at our results from Docuscope for the play, visualised using Anupam Basu’s LATtice.

 

spacer

 

The heatmap shows all of the folio plays compared to each other: the darker a square is, the more similar the plays are linguistically. The diagonal of black squares running from bottom left to top right marks the points in the map where plays are ‘compared’ to themselves: the black indicates identity. Plays are arranged up the left hand side of the square in ascending chronological order from Comedy of Errors at the bottom to Henry VIII at the top – the sequence then repeats across the top from left to right – so the black square at the bottom left is Comedy of Errors compared to itself, while the black square at the top right is Henry VIII.

One of the first things we noticed when Anupam produced this heatmap was the two plays which stand out as being unlike almost all of the others, producing four distinct light lines which divide the square of the map almost into nine equal smaller squares:

 

spacer

These two anomalous plays are Merry Wives of Windsor (here outlined in blue) and A Midsummer Night’s Dream (yellow). It is not so surprising to find Wives standing out, given the frequent critical observation that this play is generically and linguistically unusual for Shakespeare: but A Midsummer Night’s Dream is a result we certainly would not have predicted.

This visualisation of difference certainly caught the actors’ attention, and they immediately focussed in on the very white square about 2/3 of the way along the MSND line (here picked out in yellow):

 

spacer

So which play is MSND even less like than all of the others? A tragedy? A history? Again, the answer is not one we’d have guessed: Measure for Measure.

This is a good example of how a visualisation can alert you to a surprising finding. We would never have intuited that MSND was anomalous linguistically without this heatmap. It is also a good example of how visualisations should send you back to the data: we now need to investigate the language of MSND to explain what it is that Shakespeare does, or does not do, in this play that makes it stand out so clearly. The visualisation is striking – and it allowed the cast members to identify an interesting problem very quickly – but the visualisation doesn’t give us an explanation for the result. For that we need to dig a bit deeper.

One of the most useful features of LATtice is the bottom right window, which identifies the LATs that account for the most distance between two texts:

 

spacer

This is a very quick way of finding out what is going on – and here the results point us to two LATs which are much more frequent in MSND than Measure for Measure: SenseObject and SenseProperty. SenseObject picks up concrete nouns, while SenseProperty codes for adjectives describing their properties. A quick trip to the LATice box plot screen (on the left of these windows):

spacer

 

spacer

confirms that MSND (red dots) is right at the top end of the Shakespeare canon for these LATs (another surprise, since we’ve got used to thinking of these LATs as characteristic of History), while Measure for Measure (blue dots) has the lowest rates in Shakespeare for these LATs.

So Docuscope findings suggest that MSND is a play concerned with concrete objects and their descriptions – another counter-intuitive finding given the associations most of us have with the supposed ethereal, fairy, dream-like atmosphere of the play. Cast members were fascinated by this and its possible implications for how they should use props – and someone also pointed out that many of the names in the play are concrete nouns (Quince, Bottom, Flute, Snout, Peaseblossom, Cobweb, Mote and so on) – what is the effect on the audience of this constant linguistic wash of ‘things’?

Here is a screenshot from Docuscope with SenseObject and SenseProperty tokens underlined in yellow. Reading these tokens in context, you realise that many of these concrete objects and qualities, in this section at least, are fictional in the world of the play. A wall is evoked – but it is one in a play, represented by a man. Despite the frequency of SenseObject in this play, we should be wary of assuming that this implies the straightforward evocation of a concrete reality (try clicking if you need to enlarge):

 

spacer

Also raised in MSND are LATs to do with locating and describing space: Motions and SpaceRelations (as suggested by our loglikelihood finding for ‘through’?). So accompanying a focus on things, is a focus on describing location, and movement – perhaps, someone suggested, because the characters are often so unsure of their location? (In the following screenshot, Motions and SpatialRelation tokens are underlined in yellow.)

 

spacer

 

Moving on, we also looked at those LATs that are relatively absent from MSND – and here the findings were very interesting indeed. We have seen that MSND does not pattern like a comedy – and the main reason for this is that it lacks the highly interactive language we expect in Shakespearean comedy: DirectAddress and Question are lowered. So too are PersonPronoun (which picks up third person pronouns, and matches our loglikelihood finding for ‘he’ and ‘his’), and FirstPerson – indeed, all types of pronoun are less frequent in the play than is normal for Shakespeare. At this point one of the actors suggested that the lack of pronouns might be because full names are used constantly – she’d noticed in rehearsal how often she was using characters’ names – and we wondered if this was because the play’s characters are so frequently uncertain of their own, and others’ identity.

Also lowered in the play is PersonProperty, the LAT which picks up familial roles (‘father’, ‘mother’, ‘sister’ etc) and social ones (job titles) – if you add this to the lowered rate of pronouns, then a rather strange social world starts to emerge, one lacking the normal points of orientation (and the play is also low on CommonAuthority, which picks up appeals to external structures of social authority – the law, God, and so on).

The visualisation, and Docuscope screens, provoked a discussion I found fascinating: we agreed that the action of the play seems to exist in an eternal present. There seems to be little sense of future or past (appropriately for a dream) – and this ties in with the relative absence of LATs coding for past tense and looking back. As the LATtice heatmap first indicated, MSND is unlike any of the recognised Shakespearean genres – but digging into the data shows that it is unlike them in different ways:

  • It is unlike comedy in its lack of features associated with verbal interaction
  • It is unlike tragedy in its lack of first person forms (though it is perhaps more like tragedy than any other genre)
  • It is unlike history in its lack of CommonAuthority

Waiting for my train back to Glasgow (at the excellent Euston Tap bar near Euston Station), I tried to summarize our findings in four tweets (read them from the bottom, up!):

 

spacer

 

I’ll try to keep in touch with the actors as they rehearse the play – this was a lesson for me in using the tools to spark an investigation into Shakespeare’s language, and I can now see that we could adapt these tools to various educational settings (including schools and rehearsal rooms!).

Jonathan Hope February 2012

Posted in Early Modern Drama, Shakespeare, Uncategorized | Tagged A Midsummer Night's Dream, Docuscope, Euston Tap, LATtice, Shakespeare's Globe, Wordhoard | 3 Comments

What did Stanley Fish count, and when did he start counting it?

By Michael Witmore | Published: January 27, 2012

We have been observing the reaction to Stanley Fish’s critique of the Digital Humanities with great interest. Here is the full text of our comment, which could only be partially displayed on the New York Times comment window.

You know you’ve come up in the world if you’re being needled by Stanley Fish in The New York Times. Having done our share of work in the data mines, we believe Fish is right to insist that nothing in a text becomes evidence unless you have an interpretation which makes that evidence count. No amount of digital tabulation will substitute for a coherent, defensible reading.

As traditionally trained humanities scholars who use computers to study Shakespeare’s genres, we have pointed out repeatedly that nothing in literary studies will be settled by an algorithm or visualization, however seductively colorful. We have also argued that any pattern found through an iterative, computer-assisted analysis is meaningless without a larger interpretive framework in which to view it. It is the job of literary critics and historians to provide those interpretations, something they do by returning to the text and re-reading it with fresh eyes.

The job of digital tools is to draw our attention to evidence impossible or hard to see during normal reading, prompting us to ask new questions about our texts. This ability to redirect attention and pose new questions is the strong suit of certain kinds of digital humanities research. Indeed, we believe the addition of a digital prosthetic to our insistently human reading complements the skills of close textual analysis that are the staple of literary training. Not everyone in the so-called Digital Humanities community would agree with this position, but we believe the old and new techniques are entirely compatible.

What does it matter why Stanley Fish started minding his ps and bs in Milton? The point is that he has produced a plausible interpretation of Milton’s work based on evidence that fits his larger claim. The fact that an algorithm (“count ps and bs”) has directed his attention to something he hadn’t noticed doesn’t make the resulting pattern gibberish. You bet there are interesting patterns that show up in Milton when you mind his ps and bs. They existed before you counted them, and they exist after. However he found it, Fish has used that patterning to produce an interesting argument about the role of sound in Milton’s prose. And he has the evidence to back this argument up. In the end, he’s doing what most literary critics do in their work: create an interpretation that builds meaningfully on evidence in the text. Is there really any other way?

Yours sincerely,

Jonathan Hope, Strathclyde University

Michael Witmore, Folger Shakespeare Library

You can view a sample of our work at here.

Posted in Quant Theory | Tagged Digital Humanities, Stanley Fish | 3 Comments

Visualizing Linguistic Variation with LATtice

By Anupam Basu | Published: November 29, 2011

spacer

The transformation of literary texts into “data” – frequency counts, probability distributions, vectors – can often seem reductive to scholars trained to read closely, with an eye on the subtleties and slipperiness of language. But digital analysis, in its massive scale and its sheer inhuman capacity of repetitive computation, can register complex patterns and nuances that might be beyond even the most perceptive and industrious human reader. To detect and interpret these patterns, to tease them out from the quagmire of numbers without sacrificing the range and the richness of the data that a text analysis tool might accumulate can be a challenging task. A program like DocuScope can easily take an entire corpus of texts and sort every word and phrase into groups of rhetorical features. It produces a set of numbers for each text in the corpus, representing the relative frequency counts for 101 “language action types” or LATs. Taken together, the numbers form a 101 dimensional vector that represents the rhetorical richness of a text, a literary “genetic signature” as it were.

Once we have this data, however, how can we use it to compare texts, to explore how they are similar and how they differ? How can we return from the complex yet “opaque” collection of floating point numbers to the linguistic richness of the texts they represent? I wrote a program called LATtice that lets us explore and compare texts across entire corpora but also allows us to “drill down” to the level of individual LATs to ask exactly what rhetorical categories make texts similar or different. To visualize relations between texts or genres, we have to find ways to reduce the dimensionality of the vectors, to represent the entire “gene” of the text within a logical space in relation to other texts. But it is precisely this high dimensionality that accounts for the richness of the data that DocuScope produces, so it is important to be able preserve it and to make comparisons at the level of individual LATs.

LATtice addresses this problem by producing multiple visualizations in tandem to allow us to explore the same underlying LAT data from as many perspectives and in as much detail as possible. It reads a data file from DocuScope and draws up a grid, or a heatmap representing “similarity” or “difference” between texts. The heatmap, based on the Euclidean distance between vectors, is drawn up based on a color coding scheme where darker shades represent texts that are “closer” or more similar and lighter shades represent texts further apart or less similar according to DocuScope’s LAT counts. If there are N texts in the corpus, LATtice draws up an “N x N” grid where the distance of each text from every other text is represented. Of course, this table is symmetrical around the diagonal and the diagonal itself represents the intersection of each text with itself (a text is perfectly similar to itself, so the bottom right “difference” panel shows no bars for these cases). Moving the mouse around the heatmap allows one to quickly explore the LAT distribution for each text-pair.

spacer

While the main grid can reveal interesting relationships between texts, it hides the underlying factors that account for differences or similarities, the linguistic richness that DocuScope counts and categorizes so meticulously. However, LATtice provides multiple, overlapping, visualizations to help us explore the relationship between any two texts in the corpus at the level of individual LATs. Any text-pair on the grid can be “locked” by clicking on it, allowing the user to move to the LATs to explore them in more detail. The top right panel shows how LATs from both the texts relate to each other. The text on the X axis of the heatmap is represented in red and the one on the Y axis is represented in blue in the histogram for side by side comparison. All the other panels follow this red-blue color coding for the text-pair. The bottom panel displays only the LATs whose counts are most dissimilar. These are the LATs we will want to focus on in most cases as they account most for the “difference” between the texts in DocuScope’s analysis. If a bar in this panel is red it signifies that for this LAT, the text on the X axis (our ‘red’ text) had a higher relative frequency count while a blue bar signals that the Y axis text (our ‘blue’ text) had a higher count for a particular LAT. This panel lets us quickly explore exactly on what aspects texts differ from each other. Finally, LATtice also produces a scatterplot as a very quick way of looking at “similarity” between texts. It plots LAT readings of the two texts against each other and color codes the dots to indicate which text has a higher relative frequency for a particular LAT (grey dots indicate that both LATs have the same value). The “spread” of dots gives a rough indication of difference or similarity between texts: a larger spread indicates dissimilar texts and dots clustering around the diagonal indicate very similar texts.

You can try LATtice out with two sample data-sets by clicking on the links below. The first is drawn from the plays of Shakespeare which are in this case arranged in rough chronological order. As Hope and Witmore’s work has demonstrated, the possibilities opened up by applying DocuScope to the Shakespeare corpus are rich and hopefully exploring the relationship between individual plays on the grid will produce new insights and new lines on inquiry. The second data-set is experimental – it tries to use DocuScope not to compare multiple texts but to explore a single text – Milton’s Paradise Lost – in detail. It might give us insights about how digital techniques can be applied on smaller scales with well-curated texts to complement literary close-reading. The poem was divided into sections based on the speakers (God, Satan, Angels, Devils, Adam, Eve) and the places being described (Heaven, Hell, Paradise). These chunks were then divided into roughly three hundred line sections. As an example, we might notice straightaway that speakers and place descriptions seem to have very distinct characteristics. Speeches are broadly similar to each other as are place descriptions. This is not unexpected, but what accounts for these similarities and differences? Exploring the LATs helps us approach this question with a finer lens. Paradise, for example, is full of “sense objects” while Godly and angelic speech does not refer to them as often. Does Adam refer to “authority” more when he speaks to Eve? Does Satan’s defiance leave a linguistic trace that distinguishes him from unfallen angels? Hopefully LATtice will help us explore and answer such questions and let us bring DocuScope’s data closer to the nuances of literary reading.

Try LATtice with the Shakespeare data-set.


Try LATtice with the Paradise Lost data-set.

Finally, a few technical notes: The above links should load LATtice with the appropriate data-sets. Of course, you will need to have Java installed on your machine and to have applets enabled in your browser. You can also download LATtice and the sample data-sets, along with detailed instructions, as stand-alone applications for the following platforms:

  • Mac OS X
  • Linux/Unix
  • Windows

There are a few advantages to doing this. First, the standalone version offers an additional visualization panel which represents the distribution of LATs as box-and-whisker plots and shows where the text-pair’s frequency counts stand relative to the rest of the corpus. Secondly, the standalone application can make use of the entire screen, which can be a great advantage for larger and higher resolution monitors.

Posted in Uncategorized | Tagged counting_things, Docuscope, LATs | 2 Comments

Tokens of Impersonation in Dekker’s City Comedies

By Mattie Burkert | Published: November 19, 2011

In sixteenth- and seventeeth-century England, the relationship between clothing and identity was complex. As Ann Rosalind Jones and Peter Stallybrass have shown, the fact that clothing circulated as currency among different owners implicitly called into question its supposed correspondence with the wearer’s social and financial status. Stephen Orgel has explored how issues surrounding clothing and identity played out on the Elizabethan and Jacobean stage—a place where clothing was understood at once as the defining token of identity and as disguise, where audiences entered into the fiction that a dress could temporarily transform a lower-class boy into a noble woman. The possibility that appearance might not match reality was problematic for early modern audiences, however, because the English credit culture that emerged in this period depended on people’s ability to assess one another’s presentations of honesty and trustworthiness. By challenging the assumed correspondence between social performance and identity, cross-dressing figures like Moll Cutpurse in Dekker and Middleton’s The Roaring Girl (1611) suggest the fallability of a system in which a person’s economic status is inferred from his or her appearance.

I wondered whether The Roaring Girl’s concern with the instability of credit might be visible at the linguistic level. In Witmore and Hope’s “very large dendrogram” (see Figure 9 here), three plays group tightly with The Roaring Girl: Westward Ho (Dekker and Webster, 1604), Northward Ho (Dekker and Webster, 1605), and The Honest Whore, Part 2 (Dekker, performed 1605 and published 1630). Based on where they cluster in the dendrogram, it is clear that these texts are not merely linked by authorship, genre, or time period. I hypothesized that these four plays might all share The Roaring Girl’s concern with disguise and credit, and that this concern would be one of the factors linking them together stylistically. Still, much of early modern drama, especially city comedy, is concerned with the economics of identity. Assuming that these plays’ treatment of credit and disguise contributes to their linkage, what is uniquely similar about them that pushes the plays together?

To answer this question, I performed Principle Component Analysis (PCA) on 130 plays performed between 1601 and 1621 and found a component that united the plays on The Roaring Girl twig. As it turns out, the cocktail of linguistic factors that joins these four plays includes the categories Docuscope labels “Person Properties” and “Sense Objects.” The component also discriminates against Positive and Negative Standards, Abstract Concepts, and Negativity.

The passage from the four plays that is most exemplary of this component comes from Westward Ho. Words underlined in purple are Person Properties, while bright yellow indicates Sense Objects:

spacer

In this scene, the bawd Birdlime tries to protect the identity of one of her clients, Tenterhook, from another who has entered her house. Tenterhook hides in a closet with the prostitute, Luce, and covers her eyes. She tries to identify him by the feel of his hands and what he wears on them. In guessing, she reveals the names of all her clients, thereby contradicting the bawd’s claim that whores practice a kind of doctor-patient confidentiality. The most frequent elements in this scene

gipoco.com is neither affiliated with the authors of this page nor responsible for its contents. This is a safe-cache copy of the original web site.