Visualization to connect the dots

Using Overview without DocumentCloud, and how to import text from a CSV

by Jonathan Stray on 12/18/2012 | 0

Overviewproject.org can now import a document set from a CSV file. This is standard, simple format for tabular data that many programs can read and write. For example, Excel can save a spreadsheet as a CSV file. This allows use of Overview without uploading the documents to DocumentCloud, and makes it much easier to import data from sources such as Twitter.

There are two basic steps to loading a CSV file into Overview: get your data in the correct format, and upload it. You may also need to extract text from a set of PDFs.

Getting your data into the correct format

A CSV file is simply a list of “comma-separated values,” organized into rows and columns, like a spreadsheet or a table. The file starts with a list of the column names, separated by commas. This is followed by each row of data, one row per line, with the values for each column again separated by commas. Overview only requires one column, which much be named “text.” Here is an example file:

text
This is the content of the first document.
And here is the text of document the second
Document three talks about quick brown foxes.
.
.

If the text of a document spans multiple lines, or itself contains commas, then it needs to be quoted. Quotes inside a quoted document must be “escaped” by turning them into double quotes. This is all standard CSV stuff, and any program or library that writes CSVs should do it automatically.

text
"This document is really long and crosses multiple lines and contains
commas, which is why it is quoted."
"This is the second document. I'd like to say ""Hi!"" to everyone
to show how to put quotes inside a quoted document. The text of
this document can cross as many lines as needed, or even contain
blank lines like this:

The second document ends with this final quote."
The third document fits all on one line so no quotes needed.
"The fourth document has a comma in it, so it's quoted too."
.
.

And that’s it. Overview will display the text in the viewer pane when you click on each document. If you want to display something else for the document, you can add a “url” column which tells Overview to load a particular web page when you view that document. For security reasons, this must be an https URL. Here’s an example using tweets:

text,url
New deploy today -- cleaner clustering, better handling of larger document sets. Anyone got a pile of PDFs they want to look at? Try it!,https://twitter.com/overviewproject/status/281075194557259777
"""“I’m not going to sit out on the newsroom floor and sort pages into stacks of documents"" ~@jackgillum on need for document mining software.",https://twitter.com/overviewproject/status/264450385928929280
.
.

It’s also possible to add a unique ID column, simply named “id”, which Overview will read and associate with the document, which is how documents will be referenced when you export tags (coming soon.)

Uploading your CSV file to Overview

First, select the upload option from the main document set list page:

spacer

Then choose a file. Overview will show a preview and do some basic checks to ensure that the format is OK. It should look like this:

spacer

You can also tell Overview what character encoding the file uses. Try changing this if you see funny square characters in the preview, or accents aren’t displaying right. Then hit upload, and away we go. You can use Overview as usual on the document set.

Creating a CSV for Overview from a collection of PDFs

Overview does not currently support viewing a collection of PDFs without DocumentCloud in an integrated way. However, there is a workaround, based on a tool from the prototype version. You will need some familiarity with the command line to do this. First install Git and download the prototype, then use the loadpdf script to extract the text from a folder full of PDF files and create a CSV suitable for uploading into Overview. This process is described in the documentation for the prototype.

Unfortunately you will not be able to view the original PDFs within Overview without putting them on a web server somewhere and then modifying the URL column to point to the location of each document. We’re working on it.

 

 

Document mining shows Paul Ryan relying on the the programs he criticizes

by Jonathan Stray on 11/02/2012 | 0

One of the jobs of a journalist is to check the record. When Congressman Paul Ryan became a vice-presidential candidate, Associated Press reporter Jack Gillum decided to examine the candidate through his own words. Hundreds of Freedom of Information requests and 9,000 pages later, Gillum wrote a story showing that Ryan has asked for money from many of the same Federal programs he has criticized as wasteful, including stimulus money and funding for alternative fuels.

This would have been much more difficult without special software for journalism. In this case Gillum relied on two tools: DocumentCloud to upload, OCR, and search the documents, and Overview to automatically sort the documents into topics and visualize the contents. Both projects are previous Knight News Challenge winners.

But first Gillum had to get the documents. As a member of Congress, Ryan isn’t subject to the Freedom of Information Act. Instead, Gillum went to every federal agency — whose files are covered under FOIA — for copies of letters or emails that might identify Ryan’s favored causes, names of any constituents who sought favors, and more.

Bit by bit, the documents arrived — on paper. The stack grew over weeks, eventually piling up two feet high on Gillum’s desk. Then he scanned the pages and loaded them into the AP’s internal installation of DocumentCloud. The software converts the scanned pages to searchable text, but there were still 9000 pages of material.

That’s where Overview came in. Developed in house at the Associated Press, this open-source visualization tool processes the full text of each document and clusters similar documents together, producing a visualization that graphically shows the contents of the complete document set.

“I used Overview to take these 9000 pages of documents, and knowing there was probably going to be a lot of garbage or extra attachements, to separate the chaff from the wheat,” said Gillum. Much of Ryan’s correspondence is standard congressional work, communicating with constituents about their particular problems and issues. “I could figure out where are the letters from voters, and to to put these documents in groups. So if someone’s complaining about the FCC, and there are 200 pages about that, we can put that aside.”

DocumentCloud supports key word search, but search won’t always tell the full story. First, much of the material was of such low quality, such as copies of faxed letters, that the OCR process that converts a scanned image into searchable text often produced incorrect results. This means that a literal search will miss documents. Second, searching will not help you find stories that you don’t know you are looking for, a problem that gets worse as the number of documents grows. You need something like a table of contents to avoid that problem, which is what Overview provides.

In this case, Overview was able to group letters signed by Ryan, by recognizing certain standardized language in the header and footer, even when that text was sometimes garbled by the OCR process.  ”It found a cluster of the documents that Ryan had written over his signature,” said Gillum.

Tools like DocumentCloud and Overview are rapidly becoming essential as reporters are forced to deal with ever increasing amounts of information. It is not unusual for a single request for government files to produce thousands or even tens of thousands of pages of material, far too much to read exhaustively.

“Using these sorts of tools is essential as we go forward, looking at big document sets, to provide readers with some insight into how government works,” said Gillum.

“I’m not going to sit out on the newsroom floor and sort pages into stacks of documents,” he said.

VIDEO: document mining with Overview

by Jonathan Stray on 10/31/2012 | 0

With the release of the new, web-only version of Overview that runs in your browser, we thought it was time to make a little video showing how to use it.

If that doesn’t answer your questions, see also the help page, and the FAQ.

How to Use Overview to Explore A Document Set

by Jonathan Stray on 09/20/2012 | 0

It takes just a few minutes to start exploring your documents in Overview. Overview depends on DocumentCloud to store, OCR, and publish documents, so you will need a DocumentCloud account (here’s how to get one.)

1. Batch upload your documents to a DocumentCloud project
Log in to your DocumentCloud account Create a project to store all of your files, using the “New Project” button. Then select “New Documents.” Now here’s the trick to batch uploads: when the file dialog box opens, you can select all of the documents in a folder simultaneously by clicking on the first, then shift-clicking on the last (or pressing Control-A on Windows, or Command-A on Mac). You can keep the documents private if you like.

spacer

2. Log into Overview and import your project
Go to overviewproject.org and log in, or create an account. Select “import your project from DocumentCloud” and enter your DocumentCloud username and password when prompted. Your DocumentCloud projects will appear. Select the project that you want to explore, and get a coffee while Overview imports and analyzes it.

spacer

3. Explore the documents in the tree view

Overview’s main screen is divided into four parts: the topic tree, the tag list, the document list, and the document viewer.

spacer

The topic tree view displays your documents sorted into the topics and sub-topics that Overview has automatically created for your documents. The big node at the top contains all documents. It splits into several smaller nodes below, each of which contains  documents on similar topics. The nodes are different sizes, because sometimes Overview finds many documents on a similar topic, while in other cases a document is so unique that Overview puts it into a node all by itself.

You can pan the tree left and right by dragging with the mouse, or moving the scroll bar. You can zoom into the tree by using the mouse wheel, two fingers on the trackpad, or dragging one end of the scroll bar. Nodes which have a small ⊕ in the center can be expanded to show children, while ⊖ hides children.

Each node is labelled by the top keywords from the documents in that node. These words tell you the topic of the node. The children of a node contain, collectively, all of the documents in the parent, but broken down into more specialized topics.

When you select a node, the documents in it appear in the document list. Each document in the list is represented by a list of keywords specific to that document. Clicking on a document on the list loads it in the document viewer.

4. Tag interesting documents
As you explore the topic tree, you’ll run across individual documents or entire nodes you want to remember. Enter a descriptive tag in the “new tag” field and press “tag.” The currently selected documents will be tagged, and a little tag color swatch will appear next to them in the document list.

spacer

Once you’ve created a tag, you can add the currently selected documents to it at any time by pressing the + button that appears when your mouse is over the tag name.  (To tag an entire node at once, select the node and then press the +/- button.) Or press – to remove the tag.

Clicking on a tag name selects that tag,  highlighting the tagged documents in the tree and loading them into the document list.

5. Work your way through the tree
When you have a lot of documents, it pays to be systematic. We recommend working your way through the nodes in the tree from left to right — biggest topics to smallest topics. Select a node, then view a few of the documents in it to see if you understand what they have in common. If there seems to be more than one important topic in the documents in that node, try opening up the child nodes instead, until you find a node where all of the documents are more or less the same. Then, tag that node with a descriptive label.

As you proceed, you may find documents in the same topic in different nodes. Overview doesn’t know what story you are working on, so it can’t always guess how the documents should be arranged. You can apply a tag to any combination of nodes and documents to create a set that is meaningful to you.

You may also  discover that the documents in a node are irrelevant to your story, in which case you can tag them with “read” and simply move on. Part of the power of Overview is being able to decide not to look at an entire topic.

When you’re finished this process, you’ll have a neatly categorized tree, and a set of tags corresponding to all the interesting topics in your documents.

6. Ask for help!
Questions? Bugs? Something you’d like to see in a future version! Contact us!

How I Used Overview to Report on 8,000 Police Department Emails

by Jonathan Stray on 09/19/2012 | 0

Guest post by Jarrel Wade, Tulsa World. Originally at PBS IdeaLab

In May, I published a story which described how the Tulsa Police Department in Oklahoma purchased millions of dollars of under-powered and under-tested computer hardware, resulting in a multitude of problems.

Emails showed complaints from the field in which officers were unable to get basic police information about dangerous calls when they were en route to scenes, or network dead spots around town that officers were completely avoiding.

But leading into April, I had no idea how I was going to read all these emails by myself.

Three weeks away from receiving the documents, I called my city records official for an update and was told my request had expanded several times over. I would be receiving about 8,000 emails from the city of Tulsa based on a server keyword search regarding a technology purchase for the city’s police department. By far the largest open records request I’ve ever made, it took a four-month city legal review and would end up being its own line-item on the police department’s budget, the chief later told me.

Searching the Internet and IRE website for help on reviewing thousands of emails, I came across DocumentCloud and Overview. The Overview developers had just made a pre-beta version available for testing. I had been prepared to spend months of my spare time reading email after email, opening PDF after PDF as long as I could hold out on my editors without writing a story. Overview was the perfect find for a tech-savvy reporter — installing a staircase to the top of the mountain.

The Next Step
After some difficulties of cleaning the documents (the emails came in Outlook format, which became a pain to convert to clean PDFs), the next step was figuring out how to make Overview work for my documents. My first impression after loading the emails was, “OK, now what?”

I found that Overview works differently for every document set. For emails, I think Overview works best with a completely random selection — say all emails for a department in a given month. Lots of emails would be meaningless, spam or pictures of cats, so Overview can be used to easily dismiss the majority. Given a set of emails based on a keyword search, the problem is more difficult because most of the emails will be at least somewhat relevant.

In this case, Overview was most useful as an organizational tool. I could look at an email, make a note, and easily have it grouped with other similar emails through tagging.

I started with a branch of Overview’s document tree and starting clicking, glancing, noting and tagging. Right off the bat, I found that Overview had grouped together all of the similarly formatted “service desk” requests. There were hundreds if not thousands of those, so I was able to tag them by the dozens without a second thought — while focusing on the more meaty emails.

spacer

The next thing Overview did for me was to generally group email chains together. Much of my document set was taken up with emails that were replies or mass messages that were duplicates of other emails. Those were easy because I could find the most complete version, annotate it, and write off the rest.

Several hundred tags into my documents, I realized that simply tagging an email into a single sub-group was not good enough. Overview allows each document to have several different tags, I found out. Until then, I had been tagging emails by the sender’s department — city legal, city IT, police IT, purchasing, police administration, etc. From that point forward, I also tagged each email according to whether or not it was sent from one of my main players, and a separate tag for whether it was “important.” That allowed me to look later at all my quotable or crucial emails.

A good strategy for tagging your documents is important. I recommend having a tag for important, crucial, quotable or relevant documents — whatever you wish to call it.

The End Game
Once all the emails were tagged, the important ones were annotated, and I felt like I had a good “overview” of the document set, it was time for the end game. This is where Overview developers are hard at work — Overview already facilitates digesting and reviewing thousands of emails, but how does it handle the one email that’s different from the rest, because it’s the only one discussing officers accepting payments for travel from the vendor of an item you spent millions of public dollars on?

spacer

This email only has two people in it from any of the other emails and almost none of my keywords, but it became crucial to my story.

Overview is not far away from being the one-stop, mass-document-review source, but it’s not intended to do all the work for the reporter, I found (and its developers will agree, I’m sure.) I still had to go through all of the small branches in the document tree, looking for the unique, unmatched emails that Overview couldn’t pair with other documents, in case I had overlooked something.

Despite the final effort, Overview was still crucial to my end game as I was able to review and find documents far easier than if I was to search for a given email through a keyword search. All I had to do to review my work in Overview was select my “Important” tag and scan through the few hundred emails that I deemed important.

Another interesting part here was that I would remember emails that I had deemed irrelevant at first read, but now seemed relevant because of another supplemental email. I could easily find the original email by pulling up the tag and looking through that bank of emails in minutes. Keyword searches for one email out of 8,000 just doesn’t compare to the organization Overview provides.

In the end, I’m guessing it would have taken four reporters — splitting up emails into stacks of a few thousand — to do the work I did in two weeks. Furthermore, they’d have to do it full-time and compare notes at the end. Overview, with the help of DocumentCloud, allowed me to have all my documents annotated in one place. Additionally, it invaluably allowed me to save my work, move to another story on my beat, and then start up again without losing momentum.

Finally, the work the Overview developers did to add to his program was impressive and very helpful. Every bit of feedback I gave led to immediate changes, which tells me the Overview team needs more feedback from a wider audience. It’s a wide-open program with tons of potential, in addition to many basic features that are practical now to any level of reporter with the gall to request thousands of documents.

How Overview turns Documents into Pictures

by Jonathan Stray on 06/04/2012 | 0

Overview produces intricate visualizations of large document sets — beautiful, but what do they mean? These visualizations are saying something about the documents, which you can interpret if you know a little about how they’re plotted.

There are two visualizations in the current prototype version of Overview, and both are based on document clustering.

spacer

The first is the items plot, which grew out of the proof-of-concept system we presented a year ago. Every document is a dot. Similar documents get pulled together to form visible groups, that is, clusters. All the dots start grey, but become colored as you apply tags while exploring the document set. You can click on individual documents to view them, or select a whole region of documents to see what they have in common.

spacer

Overview also has a “tree” view. Documents are still organized into clusters, but each “node” in the tree is an entire cluster, not just a single document. Also, the clusters are hierarchical, meaning that the larger clusters (higher up the tree) contain all the documents within their child clusters (lower down the tree.) The bottom of the window displays the top words and two-word phrases from the selected nodes. In this case, the selected node contains press releases discussing oil industry subsidies.

The tree view and the items plot show the same thing, just in different ways. When you select documents in one view, the same documents are selected in the other. They’re two different ways of looking at the same set of clusters: hierarchically categorized, or laid out visually.

Extracting Key Words
All of Overview’s clustering depends on grouping similar documents together, but what does that mean? Conceivably, two documents might be “similar” because they were written by the same person, talk about the same event, or came from the same place. There are as many potential categorization schemes as there are stories.

But Overview doesn’t know any of this. Instead, it breaks down documents by words and short phrases. It starts by counting how many times each word appears in each document. Frequent words are “key” words. But the language processing also discounts words which appear in too many documents. This gets rid of common English words like “the” and “is,” but also suppresses words which are very common in your specific set of documents. If most documents from a set of police reports contain the word “crime,” Overview will mostly ignore that word.

Two documents are similar if they have overlapping sets of key words. A cluster is a set of documents that are all pretty similar to one another, and less similar to all other documents. This sounds insanely naive; after all, this word counting process throws away pretty much all of the syntactic information in the text, including word order. It can’t differentiate between “police hit protesters” and “protesters hit police.” But it can group together all the documents that talk about police and protesters, and that by itself is useful enough. In fact, variations of this basic algorithm, called the vector space model, are used by every search engine.

Where do those documents go?
This simple, word-based technique determines where each document is placed in the visualizations. In the Items plot, Overview tries to place similar documents close together. Collections of documents with similar words (and two-word phrases) naturally form groups, or clusters. Clusters with similar topics tend to be nearby. But that’s the extent of the process; the exact angle or position of each document and cluster doesn’t mean anything at all. In fact, it depends somewhat on the initial, random position of the documents, and every time you run Overview you will get a slightly different visualization — the same clusters will show up, but possibly in different places.

The tree view finds not only clusters but sub-clusters. In the example above, the yellow branch of the tree contains the key words “fees, credit card, consumers, airlines.” The left sub-branch has key words “fees, credit card, consumers” and the right branch has key words “fees, airlines, surcharges.” One sub-cluster is about credit card fees, while the other is about airline fees. They’ve been grouped into one larger cluster because they both contain many occurrences of the word “fees”.

For more information, see the discussion of our WikiLeaks visualization. Or if you’re really into all the gory details — including how Overview creates these visualizations efficiently, even for large document sets — we’ve recently released a technical report in collaboration with the University of British Columbia.

With a little practice and experimentation, you can learn how to read Overview’s visualizations effectively. If you want to use Overview for your own work, it’s important to get a feel for what these visualization can tell you — and what they can’t.

VIDEO: Document Mining with the Overview Prototype

by Jonathan Stray on 03/16/2012 | 0

We’ve now completed one major story with Overview, and released the prototype system. But what, exactly, does Overview do? If you’re into buzzwords, you could say that Overview does semantic visualization and hierarchical clustering of large document sets, of course! Or, you could just watch this video of Overview in action on two document sets: 1,500 press releases scraped from the web, and 4,500 pages of declassified documents from the U.S. Department of State.

We’ll be making more videos in the future to help folks install and use Overview. Meanwhile, if you’d like to try Overview now, please see the installation instructions.

Getting Started with the Overview Prototype

by Jonathan Stray on 02/25/2012 | 0

Note: these instructions apply to the original prototype version. We highly recommend the new web version at this time — no installation required!

You can be up and running with the Overview prototype, browsing through the sample document sets, in just a few minutes.

Getting ready
First you will need Git to download the program and sample files. If you’re not used to Git, this might be a bit of a pain now, as opposed to

gipoco.com is neither affiliated with the authors of this page nor responsible for its contents. This is a safe-cache copy of the original web site.