May 6th, 2013 New York Hadoop User Group - Deep Dive into Cloudera Impala

On Monday, May 6, 2013 OLC attended New York Hadoop User Group’s event held at AppNexus. The event, Deep Dive into Cloudera Impala featured Marcel Kornacker, a teach lead at Cloudera for new products development. Kornacker is also the creator of the Cloudera Impala project. He graduated from UC Berkley with a PhD in databases and joined Google in 2003, where he worked on storage infrastructure projects, becoming the tech lead for distributed query engine component of Google’s F1 project.

The Cloudera Impala project is paving the path for the next generation of Hadoop. It is where fast SQL queries and the flexibility of Hadoop clusters come together to form the aptly named Impala. This new codebase is open-source and its operations are fast enough for interactive jobs.

With this, Marcel Kornacker kickstarted the event off, outlining briefly what he was going to cover—the user view of Impala, its performance and intervals and its use.

“Impala was to produce a general purpose SQL engine that works for both analytical and transactional single row workloads,” Kornacker said. “It supports queries that take from milliseconds to hours and it runs directly in Hadoop. Impala reads widely used Hadoop file formats and performs in high performance.” He revealed Impala to run C++ instead of Java—although it was later revealed that Java is used in Impala. It also runs runtime code generation and is a completely new engine that doesn’t use MapReduce.

Kornacker talked about user view of Impala, which runs as a distributed service in cluster. “Impala runs as a distributed service in cluster. The user submits query via ODBC/JDBC and that query is distributed to nodes with relevant data. If any node fails, the query fails. Impala uses Hive’s metadata interface and connects to Hive’s metastore,” he said. Referring to Impala being compatible with Hadoop file formats, Kornacker listed the file formats it supports: Parquet columnar format, sequence files and RCFile with Snappy/gZip compressions, Avro data files and uncompressed/IZO-compressed text files.

Impala’s SQL support is patterned after Hive’s version of SQL. “It’s essentially SQL-92, minus correlated subqueries. For Impala, there’s no cross production and there’s limited DDL support,” Kornacker said. Regarding its limitations, Impala has no custom UDFs, no beyond SQL, “only has joins, and joined table has to fit in aggregate memory of all executing nodes.”

For HBase, it uses Hive’s mapping of HBase table into a metastore table. “HBase predicates on rowkey columns, which are mapped into start/stop rows. It’s also predicated on other columns, which are mapped into SingleColumnValueFilters,” Kornacker said. For HBase limitations, there is no nested-loop joins and all data is stored as text.

Kornacker presented an Impala single user performance example, where 20 queries from TPC-DS were extracted and analyzed over several months. The resulting analysis was worth five years of data, equivalent to one terabyte. “We found that compared to Hive, Impala speeds up over time,” Kornacker said. “For multi-user performance, we did the same thing, used the same dataset, workload and we found that responses increase gradually. We can conclude that it’s substantially faster than Hive.”

The Impala architecture consists of two binaries—Impalad and statestored. “Impala Daemon [ImpalaD] uses n-instances and handles client requests and all internal requests related to query executions. Query execution phases start with requests arriving via ODBC/JDBC and the planner turns the request into collections and plan fragments. The coordinator initiates execution on remote Impalad nodes,” Kornacker said.

Impala’s query planning is a two-step process. “A single-node plan uses a left-deep tree of plan operations, where operator trees partition into plan fragments for parallel operations. The goal is to streamline data and paralyze other operations. The single-node plan is to construct a plan tree that uses HashJoin for HashAggregation, Union, Zoon and Exchange,” Kornacker said. The second phase uses distributed plans, where the goals is to maximize scan locality and minimize data movement.

Metadata is cached in Impala. No synchronous metastore API calls are made during query execution and Impalad instances read metadata from metastore at startup. “Our goal going forward is to use storage distribution services to distribute metadata. We also plan to make HCatalog as metadata,” Kornacker said. “Impala is written in C++, but just on the execution side, with performance in mind. It also uses Java too.”

With Impala’s execution engine, you can create code that has minimal amount of branches to optimize for it o run on modern pipeline CPUs, meaning more on runtime code generation.

The statestore uses a central system state repository, where the name service is a membership. All data can be reconstructed from the rest of the system and is considered soft-state.

“We’re asked a lot why we don’t use ZooKeeper. Well, it’s because it’s not a good pub-sub system and it has a lot of unnecessary features for state store, so we created our own from the ground up,” Kornacker said.

Kornacker also compared Impala to Dremel, a columnar storage for data with nested structures with distributed scalable aggregation on top. “Dremel borrowed very heavily from parallel style databases,” Kornacker said. “It was necessary for us to create columnar storage so we looked at creating something else. We were soon joined by Twitter to create a new columnar format—Parquet. Putting Impala on Parquet is doing fuctions that Dremel was doing in 2010,” he added.

“Parquet is columnar format for popular serialization formats. It was jointly developed by Twitter and Hadoop and it’s open-source code on GitHub. Parquet’s features are fully shredded nested data and data is stored in native types. It contains support for index values and has pluggable compression codes,” Kornacker said.

He then compared Impala to Hive, which is MapReduce as an executable engine with high latency and low throughput queries. “Hive uses fault tolerance model based on MapReduce and has extensive multi layering, which imposes high runtime overhead. But with Impala, we went the opposite way. There’s direct process-to-process queries and there’s no fault tolerance,” Kornacker said.

To cap off the talk, Kornacker outline the future of Impala. “We want Impala to be more compatible with more SQLs like UDF, SQL authorization and DDL. We want window functions and support for more structured data types. Look for Impala to have an improved HBase support with composite keys, complex typesin commands and a better, more optimized runtime, with straggler handling, join order optimizations, improved cache management and data collection for improved join performance all ironed out,” he said.