Building Near Real-time Big Data Lake: Part I
Preface
A lot has been said and done about a Data Lake
architecture. It was 10 years ago when James Dixon defined a Data Lake concept in his viral blog post. I know more people who can explain what a Data Lake is, and cannot explain (or agree) on what a Data Warehouse
is. You can find thousands of articles about Data Lake architecture (Martin Fowler's explanation is my favorite).
As a matter of fact, I did not think that this subject was even worth to blog about in 2020. But looking back at my notes and recalling all the challenges we faced, I felt proud and wanted to share our journey.
The Data Lake concept is hardly new. For many years, a landing or a persistent staging area was one of the components of a large Enterprise Data Warehouse. It is important to mention this as Data Lake is not a technology, but a concept. And this concept can be implemented in a traditional RBDMS, or even using a bunch of files, stored on disk.
The advances in Big Data technology and open-source software made this concept feasible, economical, powerful and performant. And in 2020, it is not about Hadoop anymore. Some of the open-source products we've used to build our Data Lake, do not need a Hadoop cluster. A scalable Data Lake can be implemented using any distributed data storage, whether in the Cloud or on-premises, or both. And if your data is "small", you can save a lot of headache by implementing it in a traditional RBDMS, or better in a columnar database.
One of the questions you should ask is whether you want to build a Data Lake yourself. There are a lot of players on the market who offer commercial products to build and manage Data Lakes (search for Data Lake Management products). They promise ease of ingestion of 100s of tables from 100s of supported data systems. They promise to do all the work for you and even offering "intelligent" or "smart" Data Lakes.
In my opinion, most of them were immature at that time, or they would target a non-technical audience. I did not believe they would handle the complexity of our use cases, neither they would save us significant time. There were also quite a few challenges with our Cerner EMR
(Electronic Medical Record) system that required special treatment (no pun intended). And some of those products are cloud-based only, and it did not work for us. Certainly, check these products as you might get what you need and save on implementation and support time.
Another option is logical Data Lakes. Data Virtualization products are getting better and better. While performance will never be as good as with a physical Data Lake, this approach does bring the whole slew of benefits. It was not an option for us, due to the volumes of data. Besides, one of our goals was to reduce the load on mission-critical production systems as our analytical workloads have increased over the years.
Finally, some of you may ask: "But what about cloud data platforms like SnowFlake or Amazon Redshift? Can you just use them and be done in days?". Not quite...You still need to design a pipeline to ingest data from the source systems, pick the right tools for the job, map columns and use proper data types. Someone still needs to make hard choices and connect all the pieces, someone still needs to support all of that. Cost is another factor - if you run tens of thousands of queries non-stop every day, cloud solutions can get quite pricey.
With that said, cloud technology landscape and cloud native solutions such as AWS Lake Formation did not exist in 2016, but this is what I get the most excited now in 2020. Certainly make sure you understand your options and use fully-managed solutions if you can.
Use-case
A lot of our nightly jobs used to take hours to finish. Healthcare analytics can be quite challenging and EMR vendors do not make it easier. In large organizations like the one I work for, ETL and BI developers became really creative getting around performance issues. I will not kid you if I say that we spend more time, trying to make database queries finish in reasonable time, than working on actual project requirements. There are normally countless tickets to DBAs, lots of frustration, new indexes added that make current production queries crawl, ETL jobs freezing for hours, anxious users, asking why 8'o clock report was late. You get the picture.
But what a wonderful opportunity it was to start our Big Data journey!
Our original plan was to build a Data Lake the old fashioned way by running incremental daily extracts from various systems into our newly acquired Big Data platform. I put together a quick prototype, using Apache Airflow and MetaZoo, to ingest 100s of Cerner Millennium tables into Apache Hive and Apache Impala. Cerner data is stored in Oracle database, and in our case, it was 3 separate Oracle RAC clusters (see my previous blogs about Cerner Millennium).
At the same time, our leadership decided to switch gears and replicate 300+ Cerner tables in near real-time for the upcoming major project. Based on initial conversations, we learned that the project would require frequent data pulls out of Cerner, and those queries were going to be run against the largest and the most painful Cerner tables (clinical_event
and order_action
). In our organization, these tables alone have 40B+ rows combined. To make things worse, I saw some queries that would join these tables on each other to get parent events.
The Big Data Lake would allow us to:
- Offload hundreds of heavy data processing ETL/EDW jobs to Big Data, and shift from ETL to ELT.
- Minimize load on production systems.
- Support new major initiatives requiring near real-time data and a scalable data infrastructure.
- Process and store semi-structured data, without doing extensive data modeling upfront. Great examples are HL7 FHIR documents or patient satisfaction surveys in XML format from Press Ganey.
- Prepare for future data-intensive use-cases (AI, ML, NLP, etc.)
Success Story
Fast forward to today:
- Our near real-time Cerner pipeline is in production for 8 months now. 300+ tables from 3 Cerner domains are replicated into a central place. Our average latency is 6 seconds. While we are not in milliseconds territory, we are proud. There are a lot of things going on with operational databases. Add network latency as our analytics infrastructure is in a different part of the county, and all of the steps before data lands in the Data Lake.
- We ingested 22 source systems with thousands of tables into the Data Lake. Our Big Data Engineering team is small, but using our home-built tool MetaZoo and selecting proper open-source tools, we can do a lot with a few resources.
SQL queries, that used to take hours, are now running in minutes and seconds. One of the legacy jobs used to take 18 hours in Oracle, and it takes 11 minutes now to finish.
A few before/after benchmarks:
4. Over 10,000 queries are executed per day.
We can scale out our data infrastructure by adding new nodes, and they do not cost us an arm and a leg. We are reducing analytics footprint on production operational systems which is a big deal as our Cerner system is used by tens of thousands of clinicians every day!
We can pick the right tool for the job, and we do not need to buy another "enterprise" grade tool. We can use Apache Spark to process complex FHIR documents or to profile genomics data in VCF format. We have Apache Kafka to exchange real-time messages with our partners. We can use Apache Hive to crunch billions of rows of data in minutes or enjoy fast interactive SQL with Apache Impala and Apache Kudu.