Big Data AWS Well-Architected Folks at AWS publish a really great resource for anyone, who is designing cloud architecture. Even if you are using already or thinking about Azure or GCP, it is a really good read and it is not your typical sleep-provoking dry white-paper. AWS did an awesome job packing a lot
Big Data Building Near Real-time Big Data Lake: Part 2 This is the second part of the series. In Part 1 [https://boristyukin.com/building-near-real-time-big-data-lake-part-i/] I wrote about our use-case for the Data Lake architecture and shared our success story. Requirements Before we embarked on our journey, we had identified high-level requirements and guiding principles. It is crucial to think
Big Data Building Near Real-time Big Data Lake: Part I Preface A lot has been said and done about a Data Lake architecture. It was 10 years ago when James Dixon defined a Data Lake concept in his viral blog post [https://jamesdixon.wordpress.com/2010/10/14/pentaho-hadoop-and-data-lakes/]. I know more people who can explain what a Data Lake
Big Data How to ingest a large number of tables into a Big Data Lake, or why I built MetaZoo MetaZoo what?? When I started learning about Big Data and Hadoop, I got excited about Apache Sqoop [https://sqoop.apache.org/]. I was naïve enough to believe that the ingestion of database tables was a very easy task. All I needed to do was to use Sqoop and it would
Big Data Connecting to Kafka on VirtualBox from Windows After many hours of frustration, I was finally able to push messages into Apache Kafka, running on my VirtualBox guest machine, from Windows host. Since I did not find complete steps on the web, I wanted to document them quickly, hoping to save someone's time. Big thanks to my teammate
Healthcare Featured Sepsis Dashboard A silent killer > Sepsis is the body’s extreme response to an infection. It is a life-threatening medical emergency. Sepsis happens when an infection you already have —in your skin, lungs, urinary tract, or somewhere else—triggers a chain reaction throughout your body. Without timely treatment, sepsis can rapidly lead
Big Data How to hot swap Apache Kudu tables with Apache Impala Sometimes, there is a need to re-process production data (a process known as a historical data reload, or a backfill). Source table schema might change, or a data discrepancy might be discovered, or a source system would be switched to use a different time zone for date/time fields. One
Groovy Dynamic SQL query with Groovy Groovy’s groovy-sql module provides a higher-level abstraction over Java’s JDBC technology and it is very easy to use. You can find good examples, using groovy.sql.Sql class in API documentation [http://docs.groovy-lang.org/latest/html/api/groovy/sql/Sql.html] or Groovy documentation [http://groovy-lang.org/
NiFi How to run Sqoop from NiFi Sqoop + NiFi = ? Apache Sqoop [https://sqoop.apache.org/] is still the best tool to do a bulk data transfer between relational databases and Apache Hadoop [https://hadoop.apache.org/]. One sunny day in Florida, I was able to reluctantly ingest 5 billion rows from a remote Oracle database in just
NiFi How to connect Apache NiFi to Apache Impala I spent 4 interesting hours, trying to connect Apache NiFi to Apache Impala. It turned out to be very easy and not really any different from a JDBC compliant database, but at the same time frustrating enough to make me post about it, hoping it will save someone's time. First,
Alteryx Quick evaluation of Alteryx In-Database tools Back in 2015, Alteryx announced [https://www.youtube.com/watch?v=GGkEd3KoMj0] a brand new set of In-Database tools, available to all customers with no additional license required. Alteryx keeps bringing amazing value to its customers without making them pay every time a great new future is announced. This is
Big Data Featured Benchmarking Impala on Kudu vs Parquet Why Apache Kudu Apache Kudu is a recent addition to Cloudera's CDH distribution, open sourced and fully supported by Cloudera with an enterprise subscription. Created by Cloudera and HBase veterans and getting so much traction and press recently, Kudu is worth considering for your next Big Data architecture platform. Apache
Hadoop Watch out for timezones with Sqoop, Hive, Impala and Spark My head was spinning as I tried to accomplish a simple thing (as it seemed at first). I load data from 3 Oracle databases, located in different time zones, using Sqoop and Parquet. Then I load data to Hive using external tables and finally do some light processing and load
Cerner Healthcare Analytics with Cerner: Part 2 - Cerner Millennium Data Model This is the second part of my blog series, dedicated to healthcare analytics with Cerner. In the first part [https://boristyukin.com/healthcare-analytics-with-cerner-part-1-data-acquisition/] we've looked at different options on how to extract data from Cerner and this time we will focus on Cerner Millennium Data Model. > If you do not
Qlik 2017 Qlik Luminary Qlik just broke the news and congratulated me for being selected among with other 60 Qlik enthusiasts from around the globe! I am very proud and grateful to be part of this group of extremely talented and passionate people who I've learned a lot from! According to Qlik: > The Qlik
Cerner Featured Healthcare Analytics with Cerner: Part 1 - Data Acquisition Introduction. I am starting a series of blog posts, dedicated to healthcare analytics with Cerner. My intent is to get someone new to Cerner started with her/his analytics projects whether it is a Business Intelligence (BI) project, ad-hoc research or a quick proof of concept using Cerner's data. I
QlikView How to raise or throw an error in QlikView or Sense Sometimes you need to stop your LOAD script if some conditions are not met and raise a custom exception error (for example, required files are not found). You will also want to show a user friendly error message so it is clear why process was stopped and record the error
Hadoop Is Snappy compressed Parquet file splittable? I spent a few hours trying to find a definite answer on this question and hopefully my post will save someone time and trouble. The short answer is yes, if you compress Parquet files with Snappy they are indeed splittable Read below how I came up with an answer. First
QlikView QlikView Qlik Sense and Oracle Database - Tips for Performance Here is a few tips how to load data faster to QlikView or Qlik Sense from Oracle database. 0. Optimize performance of your SQL Statement. This is really a step 0 and assumes that you have basic knowledge and experience working with Oracle Databases. If you are coming from other
QlikView Alteryx: To QVX, or Not To QVX Question Alteryx [http://alteryx.com] is an awesome tool for Data Analysts and can do a lot of things. As of version 10, not only Alteryx can write data to QVX file, but can also read QVX files. It is a bit difficult to find information about QVX format itself
QlikView QlikView Readmission Tracking App I was super excited when one of my QlikView apps was nominated for a webinar, organized by Qlik. Even more surprising, that this application was built in just 6 weeks showing Qlik's agility and an amazing time-to-market value. This was also one of the few operational dashboards we've built for
Find and replace text in a very large file fast I needed to replace a specific text string in a 6.5Gb file for one of my projects. This is a pretty easy task if you are on Linux (using tool like sed) but it is not that easy if you are on Windows. First, I tried my favorite PowerShell.
QlikView QlikView How to check if table exists already Here is a simple (and fast) one-line code to check in QlikView load script if a table exists already and then drop it: If (len(TableNumber('Table Name To Check')) > 0) THEN; DROP TABLE [Table Name To Check]; End If Note, there is ; after THEN. TableNumber function returns a number
Qlik Book review: QlikView Your Business Just finished reading a recently published book for QlikView developers and super users QlikView Your Business: An expert guide to Business Discovery with QlikView and Qlik Sense [http://www.wiley.com/WileyCDA/WileyTitle/productCd-1118949552.html] by Oleg Troyansky, Tammy Gibson and Charlie Leichtweis. I met Oleg Troyansky [https://www.linkedin.
QlikView How To drop multiple fields using pattern (wildcard) I was looking for a good neat solution to remove multiple fields from my QlikView data model with similar names. These were temporary fields, needed during execution of the load script but no longer needed once script is finished. Unfortunately QlikView 11 does not allow something like that: DROP FIELD