Big Data Warehousing Meetup: Exploring Big Data Analytic Techniques with Datameer

On Monday, February 11, 2013, OLC attended BDW Meetup's [Big Data Warehousing] event, Exploring Big Data Analytic Techniques with Datameer. Joe Caserta (President and Founder of Caserta Concepts), Elliott Cordo (Principal Consultant at Caserta Concepts) and Adam Gugliciello (Solutions Engineer at Datameer).

"Big data is a complex, rapidly changing landscape," Joe Caserta said. "At Caserta Concepts, all we do is big data." Caserta Concepts serves the financial industry as well as education, retail/ecommerce, healthcare and many others. "We come up with the best solutions in the best way possible," he said.

Elliott Cordo presented Analyzing Data: Pig & Hive in a Hadoop ecosystem. "Why do we need map reduce?" Cordo asked. "What do we need these languages for? Well, they're [Pig and Hive] are powerful languages," he said. "It's there to solve interesting and complex computational problems at scale. We're writing low level language coding to perform low level code, so we use Pig and Hive to increase productivity."

A distributed programming framework allows for a "divide and conquer system," where the master is able to divide work into smaller, digestible chunks and distribute the spliced data to worker nodes. This is called Map. The work from the nodes are collected and formed into an answer, which is called Reduce. 

Cordo talked about Hive's Hadoop data warehouse and explained that the HiveQL has a SQL-like interface that allows the user to identify and analyze abstract relational database-like structure on top of non-relational or relational data, but Cordo said, "Hive is not a database. There's no optimization like relational databases have. It's meant to be an analysis tool for problems within Hadoop." Cordo explained that SQL is interpreted to Map Reduce jobs and that simple queries take up to a minute or even longer.

"Pig is not an abstraction," Cordo said. "It's a high-level language. Pig is a Powerful High-Level Programming Language and it's not an ad hoc tool, it's meant to process data." Cordo admitted that there was a small learning curve to fully utilize Pig, but once learned, it is "excellent for data transformation."

"Hive is helpful for ETL [Extract, Transform, Load] and it directly leverages SQL expertise. Pig is great for ETL and it's SQL-like, but different in many ways.

Adam Gugliciello of Datameer presented, first giving a brief definition of what Hadoop actually is. "Hadoop is a file system and a processing system," Gugliciello said. "Datameer is a self-service analytics for big data and Hadoop works as a disruptive response." Gugliciello explained that there are a couple of advantages to using Datameer: Economics, flexibility and scalability. It also allows for rapid adoption and has been adopted by Yahoo! and Facebook among other early adopters. It is also data-driven and other Fortune 500 companies rapidly deployed Datameer.

The value proposition of Datameer is that it makes big data analytics accessible to business users. It can be focused on institutional risk and identify "departmental asset movement, craft an early-warning system for future detection and aggregate sources and correlate." Datameer also makes it easy for extreme scale and performances and for seamless integration to all data sources. Datameer also has a low cost of ownership. "Datameer was founded by Hadoop and enterprise software veterans," Gugliciello said.

Datameer brings three capabilities: administrator, analyst and decision maker. It also has a built-in map reducer to control optimizations and remove reliability on other languages. The UI also allows for data visualization.

Regarding data quality, Datameer enables "data stewards to run ad hoc analysis and dashboards and drives iterative processes to refine data." Gugliciello said, "We're moving to a post-ETL world. The problem is that the transition is slow. Data warehouse remains static and business intelligence is a barrier. All three of these play a brittle role in structure." Now, fast and raw data is loaded without indexing, "You can define the structure when you ask for it. This is real control over files [dynamic, Hadoop] and self-service," and has drag and drop spreadsheets.

Datameer is capable of seamless data integration, structure (semi or unstructured), 25+ connectors, connector plug-in API and it is a powerful analytical tool with interactive spreadsheet UI and visualization plug-in API and infographics and dashboards on the business end of the platform.

Datameer is not open-source and it uses its own servers and its own memory. "Datameer is the conductor," Gugliciello said. Regarding security, Datameer monitors external intrusions and can detect suspicious traffic.