April 22th, 2013 BDW: Intro to NoSQL Databases


On Monday, April 22, 2013, OLC attended Big Data Warehousing Meetup’s Intro to NoSQL Databases event held at New Work City. The event featured Joe Caserta, CEO of Caserta Concepts; Elliot Cordo, Principal Consultant at Caserta Concepts; and Mike O’Brian of 10gen. They introduced NoSQL databases and focused on introducing syntax and usage patterns for a new aggregation system in MongoDB. “The new MongoDB aggregation framework makes it simple to do tasks such as counting, averaging and finding minima or maxima while grouping keys in a collection and complementing MongoDB’s built-in Map/Reduce capabilities.”



http://www.casertaconcepts.com/                                http://www.10gen.com/

Joe Caserta introduced NoSQL to the audience that gathered for the BDW Meetup, which was featured tonight on the website as the most popular event of the night.

“We used to have ERP [Enterprise Resource Planning], Finance and Legacy transferred to ETL [Extract, Transform, Load] and do a traditional EDW [Enterprise Data Warehouse], then put a traditional BI [Business Intelligence] over it. It’s successful for the most part, but we’re getting a lot of unstructured data right now,” Joe Caserta said. By unstructured data, Caserta named data from Twitter and Facebook, as well as data from other social media outlets.

“The old-fashioned way is now impossible to process the information. We use Big Data Clusters, but it’s all batch oriented.” He blamed the high level of latency that was produced to process the unstructured data using the BDC. The solution to lowering the latency is using a NoSQL database, which is a low latency warehouse. “We put it on top of our big data clusters to do low latency analytics correctly,” Caserta said. He asked if there will be a need for old-fashioned BI. “It’s questionale,” he said.

Elliot Cordo also introduced NoSQL, but went further to describe the functions of specific databases.

“Does NoSQL mean that there’s no more SQL? Absolutely not,” Cordo said. “Relational databases still have their place. They’re flexible and we use them for general purposes. And because it’s old, it has a rich query syntax and not to mention that we’re familiar with it. However, there are some interesting alternatives from analytic databases.” Cordo listed off columnar or key value databases, document databases and graph databases. He reminded the audience that many NoSQL databases have SQL-like interfaces.

“Not all data is efficiently stored in a relational database. There’s sparse data, data with a lot of variation and relationships—and it’s funny how relational database are not great at relations,” Cordo said. He broke down the scaling and performance of NoSQL databases by presenting the performance of relational database as having a lot of features, but with overheads that are not needed. Regarding scaling, most relational databases scale vertically, which gives them limits to how large they can scale. “Sharding is an awkward manual process. Most NoSQL stores scale horizontally,” Cordo said.

Cordo also spoke about object impedance mismatch. “Relational databases rarely look the way our applications want them to. So much time is assembling and reassembling relational data. NoSQL databases have simple query language. It does have limited support for Joins, aggregation and secondary indexes—but only because NoSQL databases were born to be high performance.”

“So NoSQL as data warehouses? It’s not as flexible as relational databases for ad hoc questions, but secondary indexes do provide some flexibility. The lack of Joins, however, required denormalization. There are also materialized views, where Joins and aggregates can be implemented using Map/Reduce,” Cordo explained. “NoSQL can be a good fit for certain applications like ones with high volume and low latency analytic environments. Queries are largely known and can be precomputed in-stream or in batch using Map/Reduce.”

From here, Cordo outlined the various forms of NoSQL database types. First, he described Columnar, which can be used on platforms like Cassandra and HBase. Columnar uses columns, which are equivalent to a table in RDMS [Relational Database Management System]. Its primary unit of storage is a column, which is stored continuously. Columnar uses both skinny rows and wide rows—the latter being used rarely. Document databases are used on platforms like MongoDB and CouchDB. Its collections are the equivalent to a table in RDMS and its unit of storage is a document. “It’s pretty fluid,” Cordo said about Document databases. “It doesn’t matter if you create new fields.” Graph databses are used on NeoJ4 and Titan and its relationships are front and center and nodes can have properties of their own.

To close out his portion of the presentation, Cordo concluded with a composite analytical environment. He reminded the audience that they should “choose the right tool for the right job,” and to not be afraid to mix technologies to get the job done.

Mike O’Brian talked about MongoDB’s aggregation framework.

“MongoDB galls in to the document-oriented database. It’s designed to fit a ‘sweet spot’ between features and performance. Each record in the database is stored as JSON-style documents. They’re schemaless and it provides rich queries,” O’Brian said. He revealed that MongoDB is capable of scaling horizontally. The new MongoDB (version 2.2) added aggregation framework on top of the Map/Reduce from previous versions.

“In a SQL-based RDBM, you usually use Join, Group By, AVG Count, Sum, First, Last and so on, but with Map/Reduce—with MongoDB version 2.2—you don’t have to write for Map/Reduce anymore,” O’Brian said. He briefly explained what Map/Reduce exactly was. “Map/Reduce is a very powerful too, but it’s usually overkill to use it. A lot of users rely on it for simple aggregating tasks. With Map/Reduce, it’s easy to screw up JavaScript and debugging that really is time consuming. You’re basically writing more code for simple tasks,” he said.

With the aggregation framework, MongoDB users can write using the declarative, which means there’s no need to write JavaScript code. They can implement code in C++ and the framework can be extended using new operations.

“It works by collecting data and puts in to a pipeline. It transforms it and the end result is an array of pipeline output,” O’Brian explained. A pipeline starts with an input, which is then put through a series of operations. Each filter transforms the input until it reaches the end, where it ends up as output.

O’Brian gave a brief talk on usage notes for MongoDB. “In BSON [Binary JSON], order matters, so computed fields always show up after regular fields. We use $ in front of field names to distinguish fields from string-literal expressions and we use $Match, $Sort, $Limit first in the pipeline if possible because it makes queries simpler and less wasteful. When performing cumulative operations with $Group, be aware of memory usage. There’s a limit of 16MB for output, so use $Project to discard unneeded values,” he said.

When comparing aggregation to Map/Reduce, O’Brian said that the MongoDB framework is geared towards counting and accumulating. But if the user were to need something “exotic,” Map/Reduce would fit the bill. If users needed more than 16MB, Map/Reduce would be the answer—because it isn’t limited by 16MB of output. O’Brian also added that JavaScript in Map/Reduce s not limited to any fixed set of expressions.