December 12th, 2012 Open Analytics NYC: Using Open Source Technologies for Social Network Analyis

On Wednesday, December 12th, 2012, OLC attended Open Analytics' NYC: Using Open Source Technologies for Social Network Analysis featuring Craig Vitter, Product Evangelist for IKANOW and Jason Capehart, Data Scientist for Research and Development Ventures, New York Times. Vitter spoke about exploring strategies and methods for extracting and harvesting data from social media sources. Capehart talked about the Cascade project, "which allows for precise analysis of the structures which underlie sharing activity on the web."     

Craig Vitter gave a talk about building effective networks on social analysis. "How can you use Twitter as analysis?" Vitter asked. "Social media, from an intelligence perspective is data transformed to meet an operational need no matter what methodology is used." There are four iterative processes: collect data, store data, analyze data and distribute data.

Social media intelligence is a combination of the best and worst of human social intelligence—real life interaction (HUMINT), open-source intelligence—Twitter and Facebook (OSINT), signals intelligence—private intelligence (SIGNT). "The goal of social media analysis is to provide value to organizations," Vitter said. "We want to turn data into intelligence using the operational lens to ensure cyclical data."

There are some common misconceptions about social media. Vitter outlined some of them. "Social media is not a panacea," he said. "Not everyone uses it, and if they are, they use it unevenly. User behavior also changes based on situations. Just because people can talk about anything doesn't mean they talk about everything all the time." Vitter also said, "A lot of people tend to analyze the 'What' instead of 'Why.'" The important thing is often not what people are saying, but why they are saying it. "It's not the message," he said. "It's why they are sending the message."

People also tend to use the wrong analysis tool. The tools that are used to analyze social media "rarely help dig into the 'Why.'" Tools like word clouds, sentiment metrics and information in aggregate are highly unreliable and misleading. "The dangers of disintegration are that the analytical value of the information is nothing."

An analytic framework needs to be constructed to capture data. It is needed to report and analyze data as well. "It gives information on what to measure, what the data is saying and what should be done based on that data," Vitter said. On choosing a platform, Vitter said that because social media and the ways it is used is relatively new and evolving rapidly, he said that "static approaches to social media are flowed from outset, that no one metric or set of metrics will always let you know what is happening," and suggested, to "buy something that is flexible."

Vitter presented a case study where he and his team analyzed the game Elder Scrolls Online through Twitter. "The problem was, how do brand managers use social media to track public attitudes towards the product? How do you harvest this data?" Vitter used Twitter due to its excellent analytical potential. "There's enormous value is you know what you are looking for. It uses open API, which makes it easy to do analysis on. There are limitations, however, due to its 140 character limit and limited history."

Data capturing on Twitter was broken down into: Who: Twitter handle, Where: geolocation, What: Hashtag, keywords, URL, and When: Time stamp. "This way, we can extract natural language using natural tools." Data reporting was focused on graph analysis: "This way we can tell who is Tweeting who, who is the key influencer. If you're a marketing person, you want to have information on who is the most listened to," he said. For data analysis, data needs to be routed in operational need. "The creation of hypothesis on operation testing and experimentation should be too," Vitter noted. He talked about hashtags, calling them "way too generic," as the message conveyed wasn't being condensed down. "It undermines tracking and understanding," he said.

Regarding sentiment metrics, Vitter explained that it is a poor choice for analyzing data because there is a small amount of text to work with on Twitter. "Larger text sources offer potential value with sentiment analysis that Tweets can't offer."

"Shape the conversation," Vitter said. "Create and promote hashtags that help shape the conversation and make it easier to analyze the Twitter stream." On segment data, he suggested that they be sorted by username, hashtags, key words and geographical location. "All of these are things that allow you to get higher quality analysis on the data," Vitter said.

Vitter summed up the lessons learned from analyzing Tweets: "Don't try to drink from the fire horse," her said. "Sometimes, less is more. Don't try to harvest everything. Don't use metrics you can't tie to actions, 'Why' as opposed to 'What.' Don't use visualizations or reports that strip the data from its context. It's useless that way. Do segment data. Do bring the data to a manageable size. Look for 'Why.' Return to your source material and explore alternative explanations. Remember to always consider the ultimate goal."

Jason Capehart of the New York Times presented Cascade, a real-time big data social data science platform. It incorporates visualizations with data and analysis.

"Cascade is a software created by the NYTimes," Cascade said. "On Twitter, there's a lot of people who talk and interact with content with all the time. The question is: how do we display this? We developed Cascade to look at everything on Twitter that interacts with the NYTimes articles. We can see events that happen by minute, who is ReTweeting the Tweets, and we can link connections between Tweets and tie all users together."

Cascade can rank influence vs story and can look at a story and see what is happening with it. "You can view the visual on 3D and looks like a radar. We focused on legibility, demonstrated speed of conversation and speed of offshoots of the conversation." Cascade is built on WebGL and stretches the space between points to use GPU. It is not, however, in real-time. There is an approximate 10 to 15 second delay.

"Scaling is an important factor in Cascade," Capehart said. "The ability to scale up and down depending on resources is pretty important." Cascade is a work-in-progress and Capehart hopes to have it released as soon as possible.