April 5th, 2013 Humanities Initiative: New Methods in Digital Research


On Friday, April 5, 2013, OLC attended Humanities Initiative’s event, New Methods in Digital Research featuring Daniel Cohen, Assistant Professor at George Mason University.


 “The DPLA should act as a generative platform for undefined future uses,” Daniel Cohen quoted the DPLA working group. “They’re basically saying, ‘Look, let’s just create something that people will use maximally. You can’t just have data and look at it, you need to analyze it to create some sort of visualizations,” he said.

“We set up a site where people could come and upload pictures and stories,” Cohen said. He had created the September 11 Digital Archive. “It was up before Flickr or blogs,” he added. “It has many uses and people have interest in this accessible medium. When Roy and I worked together, we looked at the sequence logs and found out people used the search functions. People did a lot of searches for teen slang. This is an undefined usage. Linguists came by and took these accounts and wrote about them when we had intended it for historians. We then thought we should keep a research interface, but provide .XML and full text to be more flexible. I think this is a basic lesson on how we make our research material more available,” he said.

“We did another project on Hurrican Katrina,” Cohen said. “We learned a lot from the 9/11 archive. This time, we added a geo-location tag. We’re really trying to add geocode and control neighborhood names. We have about 30,000 objects in the archive now. How do we move the archive to a research-oriented role—so we take this collection of stoties and well, it’s on OpenRefine and GitHub. There’s software that runs in your browser and loads up your text. It’s tabular data you’re working with. We asked as much informationas possible to get concise research analysis. This way, you can get good refined data. The point of this is to quickly visualize your metadata or the data itself. OpenRefine works on facets and we can visualize each facets. It’s like Excel and you can do quick eyeballing to see if data is good or not when you visualize it. If you don’t use programs like OpenRefine, you’re going to have tons of data you have to sift through and it’s time consuming,” Cohen said. With OpenRefine, users can easily edit and clean up data, find anomalies before users can actually work with them.

“We got a grant from Google to work on 1.6 million books from the Victorian era,” Cohen said. “With the caveats of working on these projects, we were able to extract data from titles of books to figure out what’s important and what’s not using the Google NGram site.” Google’s NGram takes books and creates a visualization of it. Google distills books and takes words to see what words started when and users can see the trend of words throughout history.

“Once we get down to tabular data and arrays, we can work with the data,” Cohen said. We’re not looking at charts and words, but actually close readings. You can see the evolution of thought and phrases in a simple and easy way from Google.”

Regarding metadata, Cohen said that he was really interested in geocoding because it makes data transparent. “Using Google Maps, I took images and text that we had and used freeform text to Google API to visualize the data. That led to further visualization to show what people were watching on television. I pulled data straight from the corpus and extracted what people were viewing and combined zip codes with coordinates to convert it into a .KML file and threw it into Google Earth. You start with raw data and download it into a text document, then convert that into a .KML for Google Earth,” he reiterated.

“Something that I’ve been working with is called CAESAR [CAndidatE Search And Rank], a text-mining tool to do sentiment analysis of text. You take a corpus that has a bunch of words that might be associated with certain emotions. Using that, you can extract data of the frequency of the emotions that have been written about.” Cohen said.

Cohen talked at length about Old Bailey, an online archive of court cases from the 18 th century and on. “Folks who designed Old Bailey,” Cohen said, “they maximized the flexibility of processing the data. We had our international group who performed a bunch of experimental research and analysis on the website: mathematica, data warehousing, NCD, Zotero and Voyeur. We could get very full visualizations like gender crime and prosecutions. We could pull out data like rioting and make parallels to moments in history. Extensions like Paper Machines help visualize data that you mine using Zotero, but I don’t think a chart is going to replace an essay. We’re always going to need a way to contexualize data. Of course, visualizations are great ways to help text,” Cohen said.