Is Snappy compressed Parquet file splittable?
I spent a few hours trying to find a definite answer on this question and hopefully my post will save someone time and trouble. The short answer is yes, if you compress Parquet files with Snappy they are indeed splittable
Read below how I came up with an answer.
First off, why should you even care about compression? A typical Hadoop job is IO bound, not CPU bound, so a light and fast compression codec will actually improve performance. There are tons of posts on the web if you want to get more details about various codecs and you will find that both Cloudera and Hortonworks recommend Snappy. Snappy is designed for speed and it does not load hard your CPU cores. The downside of course is that it does not compress that well as gzip or bzip2.
If you've read about Parquet format, you learn that Parquet is already some cool smart compression and encoding of your data by employing delta encoding, run-length encoding, dictionary encoding etc. It is still a very good idea to use Snappy compression though. In my tests (and your mileage will vary), Snappy reduced my Parquet files by 2 times at least while improving job processing time by 10-20%. Not bad at all.
Once you figure out that Snappy is a way to go and learn about how to tweak the settings for intermediate and output compression, you will stumble upon a notion of a codec being "Splittable" or not. By the way I do not believe "Splittable" is an actual English word. And if you pay attention, you quickly notice that Snappy IS NOT splittable and next thing you read this is a really bad thing. It means that if HDFS file has more than one block, map/reduce jobs would have to decompress the entire file first (all the blocks) and only one core can do it at the same time hurting parallelism a lot.
This is when I started looking frantically for an answer and ended up spending hours.
Earlier versions of Cloudera documentation were plainly wrong stating that Snappy is Splittable and we know it is not. Hortonworks docs were even more vague on a subject.
The recent version of CDH documention fortunately delivers a better message (link):
For MapReduce, if you need your compressed data to be splittable, BZip2 and LZO formats can be split. Snappy and GZip blocks are not splittable, but files with Snappy blocks inside a container file format such as SequenceFile or Avro can be split. Snappy is intended to be used with a container format, like SequenceFiles or Avro data files, rather than being used directly on plain text, for example, since the latter is not splittable and cannot be processed in parallel using MapReduce. Splittability is not relevant to HBase data.
My only question was why they did not mention their favorite Parquet format? I had to dig further to see if Parquet/Snappy combo is indeed splittable. Parquet stores rows and columns in so called Row groups and you can think of them as above-mentioned containers:
Property parquet.block.size defines Parquet file block size (row group size) and normally would be the same as HDFS block size. Snappy would compress Parquet row groups making Parquet file splittable.
Excellent Tom White's book Hadoop: The Definitive Guide, 4th Edition also confirms this:
The consequence of storing the metadata in the footer is that reading a Parquet file requires an initial seek to the end of the file (minus 8 bytes) to read the footer metadata length, then a second seek backward by that length to read the footer metadata. Unlike sequence files and Avro datafiles, where the metadata is stored in the header and sync markers are used to separate blocks, Parquet files don’t need sync markers since the block boundaries are stored in the footer metadata. (This is possible because the metadata is written after all the blocks have been written, so the writer can retain the block boundary positions in memory until the file is closed.) Therefore, Parquet files are splittable, since the blocks can be located after reading the footer and can then be processed in parallel (by MapReduce, for example).