Big Data Building Near Real-time Big Data Lake: Part 2 This is the second part of the series. In Part 1 I wrote about our use-case for the Data Lake architecture and shared our success story. RequirementsBefore we embarked on
Big Data Building Near Real-time Big Data Lake: Part I PrefaceA lot has been said and done about a Data Lake architecture. It was 10 years ago when James Dixon defined a Data Lake concept in his viral blog post.
Big Data How to ingest a large number of tables into a Big Data Lake, or why I built MetaZoo MetaZoo what?? When I started learning about Big Data and Hadoop, I got excited about Apache Sqoop. I was naïve enough to believe that the ingestion of database tables was
Big Data How to hot swap Apache Kudu tables with Apache Impala Sometimes, there is a need to re-process production data (a process known as a historical data reload, or a backfill). Source table schema might change, or a data discrepancy might
NiFi How to run Sqoop from NiFi Sqoop + NiFi = ? Apache Sqoop is still the best tool to do a bulk data transfer between relational databases and Apache Hadoop. One sunny day in Florida, I was able to reluctantly ingest 5 billion rows from a remote Oracle database in just 4 hours,
NiFi How to connect Apache NiFi to Apache Impala I spent 4 interesting hours, trying to connect Apache NiFi to Apache Impala. It turned out to be very easy and not really any different from a JDBC compliant database, but at the same time frustrating enough to make me post about it, hoping
Alteryx Quick evaluation of Alteryx In-Database tools Back in 2015, Alteryx announced a brand new set of In-Database tools, available to all customers with no additional license required. Alteryx keeps bringing amazing value to its customers without
Big Data Benchmarking Impala on Kudu vs Parquet Why Apache Kudu Apache Kudu is a recent addition to Cloudera's CDH distribution, open sourced and fully supported by Cloudera with an enterprise subscription. Created by Cloudera and HBase veterans
Hadoop Watch out for timezones with Sqoop, Hive, Impala and Spark My head was spinning as I tried to accomplish a simple thing (as it seemed at first). I load data from 3 Oracle databases, located in different time zones, using
Hadoop Is Snappy compressed Parquet file splittable? I spent a few hours trying to find a definite answer on this question and hopefully my post will save someone time and trouble. The short answer is yes, if you compress Parquet files with Snappy they are indeed splittable Read below how I