Benchmarking Impala on Kudu vs Parquet

Benchmarking Impala on Kudu vs Parquet

Why Apache Kudu

Apache Kudu is a recent addition to Cloudera's CDH distribution, open sourced and fully supported by Cloudera with an enterprise subscription. Created by Cloudera and HBase veterans and getting so much traction and press recently, Kudu is worth considering for your next Big Data architecture platform. Apache Kudu, according to its creators, finally makes it real to have mutable data (yay for inserts/updates/deletes!!) in your Hadoop cluster, while providing fast random lookups and excellent analytical performance (BI like queries with joins, predicates and aggregations). It fits nicely between Apache HBase and Apache Impala with Parquet and greatly simplifies architecture for near real-time use cases. If you are scared of Lambda architecture that requires a complex system of batch and streaming processes, you will like Kudu. Please head over to the Apache Kudu site to find great documentation and examples. Download Kudu's virtual machine and try Kudu yourself, using provided examples (either using Apache Spark, Apache Impala or Kudu clients for Java, C++ or Python).

Use Case

Our use case for Kudu is an ever-popular Data Lake architecture. It is not enough these days to build a batch-oriented Data Lake, updated a few times a day. Many modern analytical projects (predictive alerts, anomaly detection, real-time dashboards etc.) rely on data, streamed in near real-time from various source systems. The common architecture for building such systems used to employ a combination of Apache HBase as a storage engine for "fast" data, and HDFS as a storage for historical and partitioned "slow" data. HBase is great for random look-ups and fast inserts/updates/deletes while HDFS and tools like Hive or Impala are great for analytical type SQL queries. However, until recently, it has not been possible to do both at the same time. Check these great presentations below:

Moving Beyond Lambda Architectures with Apache Kudu

Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data

Proceed with caution

Kudu's team is very honest by keeping a running list of known issues and limitations.
I was especially concerned about lack of support for decimal and date/time data types but was assured it was on their radar. The workaround for now is to store precise decimal values as strings, and date/time as integers, and then use conversion functions with Impala / Spark.

Update 5/2018: Timestamp data type is supported as of Kudu 1.5 and Decimal data type is supported as of Kudu 1.7.

The bigger issue is a limited scalability of Kudu, which is expressed in just how much data you can have in Kudu per cluster node and per cluster. In our case, our nodes are very powerful high-density and CPU/RAM heavy machines and we are limited by Kudu, not by hardware.

Security and encryption are two concerns and something that Kudu team treats seriously. Right now though, only system-wide security is supported for Kudu API calls, but one can still use Apache Sentry with Impala to secure Kudu tables for the end users. Encryption is not supported directly (like HDFS transparent encryption) but Kudu has been tested to work with common disk encryption software.

Kudu does not use Yarn so multitenancy is something I still need to research.

None of these limitations were showstoppers in our use case and we decided to move forward with the installation and testing.

Environment

Out test cluster has 6 nodes, running Cloudera Enterprise 5.11.1 (see versions of all the products here).

Apache Kudu 1.3.0-cdh5.11.1 was the most recent version provided with CM parcel and Kudu 1.5 was out at that time, we decided to use Kudu 1.3, which was included with the official CDH version.

Each node has 2 x 22-Core Intel Xeon E5-2699 v4 CPUs (88 hyper-threaded cores), 256GB of DDR4-2400 RAM and 12 x 8TB 7,200 SAS HDDs. Nodes are connected using fast 40G InfiniBand network. Most of the disk space cannot be used by Kudu due to the limitations above. Disks are shared with HDFS as Kudu will live happily on your existing Hadoop cluster. Memory is obviously shared by Kudu, Impala, Spark/Yarn and the rest of the zoo.

Data

25 tables total were ingested with Sqoop into HDFS and stored as parquet files. External tables were created on top of these files using Hive. HDFS data was not partitioned as it did not make sense with our highly volatile source system. These 25 tables represent core clinical patient data, stored in an EHR system (Cerner). One table has 6B rows (clinical events), 3 tables with ~1B rows and the rest of them are much smaller (500k to 100M rows). Most of the tables are pretty wide with 70-100 columns in average.

This is a very good set of data that models core patient interactions in a hospital (inpatient patients) and medical records history. A lot of good reports and dashboards can be built using this set of tables, including clinical, financial and patient safety performance measures. Some examples are CMS Sepsis 3-hour bundle, Readmission ratio, Length of Stay and service line analysis, Charges, CPOE compliance and so forth.

Benchmarks

If you are looking for a scientific, well-controlled experiment (and likely biased by a vendor), this is not one of those. My goal was to see if Kudu would be a good technology for us to build a near real-time Data Lake while retaining the amazing analytical performance of Impala. Therefore, here are my expectations in order of importance:

  • Must have great query performance, close to Impala with Parquet on HDFS. Typical queries involve 5-10 table joins and filters. I have not tested Spark SQL but Kudu integrates nicely with Spark and I was able to follow Spark and Kudu examples using Jupyter Notebook and PySpark.
  • Must have ability to update/insert data in near real-time without maintaining two different storage systems. Not expecting high throughput or concurrency but expecting consistency and reliability. Honestly, our users will be thrilled with 30-60 seconds latency but my goal is to support 1-5 seconds latency. A lot of them still call dashboards that are updated once a day, "real-time", as they are used to getting data once a month. We can do better than that!
  • Nice to have fast ingest throughput, but do not need it to be as fast as with HBase.
  • Nice to have fast random look-ups but not expecting HBase performance. 1-2 seconds to look-up an individual record in 1B row table is more than adequate.

Please keep these expectations in mind as they were the main drivers for this evaluation.

1. Data ingestion

There are a few ways to ingest data into Kudu (see the docs) - I opted to do CTAS (Create Table As Select) from Impala as one of the fastest and recommended ways. I did encounter a few challenges I wanted to share.

The first challenge was that it took forever to ingest a 23M row table with 200 columns into Kudu (4 hash partitions by primary key). Precisely, it took a whopping 58 minutes, which translated to 63 rows per second. I could not believe Kudu was that slow and we did double check installation and configuration docs. Unfortunately, we had trusted the defaults and as I found out on the Kudu slack channel (thanks, Will Berkeley!), there are two parameters that need to be tweaked. Specifically:

memory_limit_hard_bytes controls the total amount of memory Kudu daemon should use.

maintenance_manager_num number of maintenance threads, recommended setting to 1/3 of the number of disks, used for Kudu

The defaults with CDH Kudu parcel were quite terrible - Kudu was limited by 1Gb of memory and was only using 1 maintenance thread. We set the latter one to 4 (12 drives / 3) and the former one to 0 (dynamic allocation). CM did not want to accept 0 for memory_limit_hard_bytes and we had to use a CM safety valve to override it. Once it was done and Kudu restarted, my first 23M table was finished in 240 seconds (~95k rows per second) - much better! CTAS from Impala to Impala parquet took only 60 seconds.

A word about partitioning strategy: make sure to understand Kudu schema design doc as it is crucial to pick the right method based on your data. In my case, I kept one partition for smaller tables (less than 1Gb of total size on disk) and for larger tables I used hash partitioning by primary key. I came up with a simple formula to estimate the number of partitions based on the table size in the source system and tried to keep tablets around 1Gb in size as recommended in the documentation. This was my second challenge as my largest 3 tables had to be capped by 120 partitions (6 nodes x 20 pre-replication):

Maximum number of tablets per table for each tablet server is 60, post-replication, at table-creation time.

That resulted in an average tablet size of 5-8Gb which exceeds the recommended size for a tablet.

Finally, my third challenge was with data type conversions needed for Kudu. Kudu 1.3 does not like Impala's (or Hive's) timestamp and decimal types.

Update 5/2018: Timestamp data type is supported as of Kudu 1.5 and Decimal data type is supported as of Kudu 1.7.

I came up with a script that generated SQL CTAS snippets to convert timestamps to Unix Epoch integers and decimals to strings - painful but it worked. I had to remember to convert these columns back to a human-readable format. I was not willing to lose the precision of decimals as float data type is not a good thing when you store patient lab results. For example, a Glucose level of 39.99999999 mg/dl is not quite the same thing as 40.0 mg/dl. Float also does weird things to whole numbers and all the sudden numbers like 103 look like 103.0000000399 (I am making this up but you get the point). Impala will not let you use functions with strings that expect numbers. A harmless aggregation like AVG(length_of_stay_min) will error out if the length_of_stay_min column values are stored as a string, so you must convert it explicitly to a number first. The good news is the Kudu team is actively working to bring Decimal and Timestamps support so things will be easier.

Update 5/2018: Timestamp data type is supported as of Kudu 1.5 and Decimal data type is supported as of Kudu 1.7.

Here is throughput for CTAS from Impala to Kudu:

And for comparison, here is the time for a few tables to execute CTAS from one Impala table on HDFS to another vs. CTAS from Impala to Kudu:

2. Data modification (Insert/Update/Delete)

Unfortunately, I have not done any real benchmarking here, just a few random tests. I tried to insert or delete one row from a 500M row table and a 6B row table, and I also tried to insert/delete a few thousand rows using predicates - it was pretty fast for my needs and took between 0.2 and 0.4 seconds. As I mentioned above, high latency updates were not a big deal for my use case. You can find these posts below very helpful if you are looking for these type of benchmarks:

Performance comparison of different file formats and storage engines in the Apache Hadoop ecosystem

Benchmarking and Improving Kudu Insert Performance with YCSB

I was thrilled that I could insert or update rows and ... (drum rolls) I did not have to refresh Impala metadata to see new data in my tables. This is a huge deal, really. No manual compactions or periodic data dumps from HBase to Impala. One of the things we took for granted with RDBMS is finally possible on a Hadoop cluster.

That 6B row table in our "traditional" Hive/Impala world would have to be re-processed every time we need to update a row. That process took 3-4 hours on our 6 node cluster and we were required to do some MapReduce tuning.

The sub-second time I've witnessed is a game changer as we can now build a near real-time Data Lake. Again, while this was possible with HBase, HBase could not be used for BI-type queries with a bunch of joins and filters.

3. Random look-ups

This was another thing I did not really care about that much as long as it did not take minutes. The first blog post above does a very good job measuring this and they found it good enough even compared to HBase. I just did a few random look-ups by a primary key on my largest 6B row table and got results back in 1.2 - 1.5 seconds on average. I tried the same with a 6B row table stored in Impala on parquet and my average time was 380 seconds. This is of course because my table is not partitioned by PK and Impala has to scan the entire table (and all HDFS parquet files) just to find one row. Pretty impressive given my use case but this is probably where HBase would get you into 0.1 sec territory.

Look-ups by non-PK columns were also quite fast compared with native Impala queries thanks to Kudu's columnar storage (Kudu storage format is based on Parquet ideas). It is interesting to note that columns are physically stored in separate files per tablet.

4. Queries

This was the most interesting test. At this point, I was happy I could ingest and update data quickly enough, and it was time to make sure Kudu would be a feasible alternative to Impala's native storage engine (which uses HDFS and normally Parquet format). A set of seven SQL queries was chosen and run against Impala on HDFS and Impala on Kudu.

We noticed that the first time a query was executed, it was about 2 times slower than the subsequent runs.

When I posted a question about this on a user group mailing list, Cloudera’s Jean-Daniel Cryans explained that this could be due to OS page cache. OS will cache data from disks in its page cache the first time it is accessed. Queries that hit tables that have not been read before will cause data to be read from disks, while tables that are already pre-loaded to page cache will be much faster as they are fetched from RAM. We observed the same thing with native Impala queries. We ran a few tests, restarting Impala and/or Kudu and could reproduce this by resetting the page cache. For this reason, we measured separately the execution time for the very first run after page cache reset, and the execution time for 3 consecutive runs after the first one. Tests were repeated 3 times, recycling the cache for each loop.

Despite statistics built for all the tables, we were impacted by IMPALA-5547 Improve join cardinality estimation with a more robust FK/PK detection and had to use a straight_join hint for most of our queries. Otherwise, Impala would scan smaller tables first and broadcast the largest one, which caused queries to take forever to finish.

Take a look at the chart below. While Impala on HDFS still rocks, Kudu is very close. In fact, the longest running query, Q5, Kudu beat Impala:

The next chart illustrates the difference in runtime between a "cold" run (clean OS page cache) and a "warm" run (table has been accessed previously and loaded to page cache). Note that for longer running queries, Q4 and Q5, the time difference is not as significant as for faster queries Q1, Q2 and Q3, which finished in half the time of the cold run. The same effect is observed for Impala:

5. Compression

Kudu supports different encodings and compression codecs per column. It is not possible to set a compression for the entire table. While we did not want to tune encodings, we wanted to see how compression impacts ingest time and query performance. We decided to focus on Snappy and LZ4 codecs.

Our data did not compress very well, probably because our tables are highly normalized and mostly contain numeric data. Per Kudu documentation, numbers are using bitshuffle encoding by default, so we did not apply compression to numeric columns as it was not recommended.

Below, you can see pre-replication size comparisons between Kudu and HDFS Parquet with Snappy - Kudu is certainly doing a better job. Only the 5 largest tables are shown:

Average compression ratio across 25 tables was 1.46 for LZ4 and 1.47 for Snappy:

It took 1.7 times less time on average to ingest data into LZ4 compressed Kudu tables than the uncompressed tables (1.7 times for snappy) - the chart below represents load time for 4 larger tables with the maxed out number of tablets on our cluster:

To see how compression affects query execution time, look at the chart in the previous section. Overall, the difference was not significant. I think we will compress our largest tables but keep smaller tables uncompressed. Your mileage may vary.

6. Data Egress

The reality is data eventually need to leave the Hadoop cluster. We connected QlikView, Alteryx and Informatica PowerCenter and measured the throughput of loading data into these tools. In order to do that, a simple SELECT * FROM table was executed and results stored into a flat file on remote machines, connected via a 10G Ethernet network.

We did have an issue with extremely slow throughput with Informatica PowerCenter but by setting Buffer Block Size = 100MB and DTM memory = Auto, we were able to get results that were comparable to QlikView and Alteryx.

Table 2 has 170 columns while the other 2 tables are pretty narrow - I guess this is why it is slower. Note how much better Kudu is doing compared to native Impala:

Conclusion

There are really no good alternative storage engines to Kudu in the Hadoop ecosystem that achieve great analytical query performance and, at the same time, allow you to change data in near real-time. While one may argue that Cloudera and Kudu are moving further away from the original HDFS and MapReduce-based Hadoop design, there is a need for a better tool to support mutable data in that ecosystem. Kudu documentation states that Kudu's intent is to compliment HDFS and HBase, not to replace, but for many use cases and smaller data sets, all you might need is Kudu and Impala with Spark.

Cloudera did it again. Apache Impala set a standard for SQL engines on Hadoop back in 2013 and Apache Kudu is changing the game again. Not sure about you, but I am really excited about the possibilities of having a general storage engine that can power both "fast" and "slow" data.

The next step for us is to build a near real-time pipeline to stream data from Cerner, using Apache Kudu, Apache NiFi and Apache Kafka. Stay tuned!

Big thanks to Greg Haskell and Brian Macleod for all your help and Nick Scartz and Todd Carlson for your continuous support!

Bonus

For those who made it to the end of the post, I have one last chart to share. I do not think I need to comment it, but I ran the same queries on Hive and our Oracle database (a beefy 4-node Oracle 12g RAC cluster):

Oracle makes this chart hard to read, let me try again with Oracle out of this picture: