March 27th, 2013 Scale Warriors of NYC Event: How to Compare NoSQL Databases: Determining True Performance and Recoverability
On Wednesday, March 27, 2013, OLC attended Scale Warriors of NYC's event: How to Compare NoSQL Databases: Determining True Performance and Recoverability featuring William Keung, Senior Software Engineer at adMarketplace, who talked about choosing the best NoSQL engine and avoiding vendor lock with a small NoSQL abstraction layer; and Ben Engber, CEO and founder of Thumbtack Technology, who talked about how to perform tuned benchmark across a number of NoSQL solutions (Couchbase, Aerospike, MongoDB and Cassandra).

William Keung presented How to Choose the Best NoSQL Engine: Avoiding Underlock with A Smart NoSQL Abstraction. "We're a leading search syndication," Keung said, introducing his company, adMarketplace. "We have 250 major partners with 100 million user profiles. We have 24/7-service uptime and process five to 10 thousand ads per second. Our search request rate is less than 10ms," he said. Keung talked about the driving forces of choosing a NoSQL engine. "For easier migrations, we swap out implementation like legacy spymemcached. Driver abstractions, we have faster development cycles. Persistence vs. pre-caching, we alleviate downtime. Performance and scalability, we conserve and optimize resources. Transparency means easier debugging in proximation," he said.
"The challenges we faced were dependencies shared by multiple systems like design [key generator], querying, legacy maintenance, hybrid of technology [memcached legacy, spymemcached, MongoDB], code upgrades—which are difficult—and testing, because a lot of data is stale," Keung said. "The easiest solution is internal update. We trap user profiles and the data usually ranges from five to 10 gigabytes. It's a large data set. It was originally MongoDB," he said.
"Here are the things you want to look out for. Thread safety, connection timeouts, read timeouts. If you can't get a read, just drop it and move on," Keung advised. He expressed the same opinion regarding failovers.
Keung's testing methodology broke down into four parts: unit and integration testing [mock testing and overriding private data], comprehensive logging, reply captured requests—"These are pure core HTTP requests,"—and statistics as metrics. Keung suggested that developers think about abstractions for business and data layers regarding live data verification.
Ben Engber took the stage to present his "quick NoSQL comparison."
"One of the first things we get asked is, 'What NoDQL framework should we use?' so we wanted a basic foundation to launch into," Engber said. "This presentation compares databases. We get a lot of interest for NoSQL databases. Discussions get confusing, like discussion of CAP Theorem. Our goal is to present business-use case answers. There's a lot of bad data out there and vague information. We're here to straighten that out. So why use NoSQL at all? Well, if you want to support a lot of information, if you want to distribute data tier, if you want a simpler failover and turnover and if you want rapid application development, we recommend NoSQL," Engber said.
"The plan was to start simple. Take bid databases and do simple workload tests. We used a standard client—Yahoo! Cloud Serving Benchmark—and tested something new but legit, like a solid-state performer, then moved on. We learned that running a database is easy, but running it correctly is tough. We ran into memory sizing and problem sizing and more. Eviction and elections—these databases work very differently," Engber said. "We came up with a new plan: choose a new database and create standard baseline schemes. We separated them into two categories: fast and reliable. Fast wasn't the same as available. It was asynchronous replication to nodes and it was very fast. Reliable wasn't the same was consistent. It was synchronized and it was pretty fast," he said.
"Our performance tests were provisional according to the best practices and reality," Engber said. "We installed database on a four-node cluster. Then we loaded a large data set to the disk [SSD]. We measured and read the rate and transactions per second," he said. "Don't use Couchbase because it didn't even work on asynchronous RAM," he warned.
Engber outlined his general conclusions based on his experiments. "Aerospike is an SSD-optimized database," he said. "Asynchronous K/V stores can do a large amount of traffic, but says nothing about secondary reflexes," he added.
"Looking at failure scenarios, we lined up three throughputs at 50, 75 and 100 percent. We had three failure types: graceful, kill-9 and splitbrain. At 75 percent load asynchronous RAM data set, a perfect system would show no impact and an imperfect system would show stack. Cassandra had a spike. Couchbase had traffic and MongoDB had small latency progress. Aerospike generally had none," Engber said. "At 100 percent, all the databases show that behavior of losing a node. That means heavier traffic, high latency," he said. "You shouldn't be running your software at 100 percent capacity anyway," Enger said.
"At 75 percent load asynchronous SSD data set, Couchbase didn't even load," Engber said. "Cassandra and MongoDB do not relocate data and simply fail and have downtime. Aerospike tries to catch up and handle the data," he said. Regarding failover downtimes, the Engber described it as "chaotic, no pattern to see. Best thing about all of the databases was that when they failed, they all started back up very quickly," he said.
The main failover takeaways Engber had were: "For a 'fast' scenario, these systems function as advertised: downtime is low and performance effort is not dramatic," he said. "For 'reliable,'—make sure you have a replication factor of three," he said. "Include replication lag in your capacity. Planning an end for both, understand your potential data loss," he warned. "And of course, we have limitations," Engber said. "We only have finite wealth. SSDs are not a fair synchronous failure test for Cassandra and Mongo DB," he said. "Our next steps are to do it in a larger cluster measure data loss and to make sure more than K/V streams while getting other databases involved."