How to connect Apache NiFi to Apache Impala
I spent 4 interesting hours, trying to connect Apache NiFi to Apache Impala. It turned out to be very easy and not really any different from a JDBC compliant database, but at the same time frustrating enough to make me post about it, hoping it will save someone's time.
First, download Impala JDBC connector and unzip it. In the archive you will see a very nice document, describing all the JDBC parameters, with lots of examples. Then you have a choice to use JDBC 4 or 4.1 driver, based on JRE ( Java Runtime Environment) version - I used 4.1. In the 4.1 archive, you will see a bunch of jar files - you will need to copy all of them to a folder, that NiFi can access. For my test, I put them under /home/oracle/database-drivers/impala/ClouderaImpalaJDBC41_2.5.42
folder. Again, it is important, that an account NiFi is running under, has permissions to access that location.
Second, go to NiFi and create new DBCPConnectionPool
controller service, like you would do with any other JDBC database. Connection pool controller service stores conveniently JDBC connection details, and also serves connections from the pool and allows to limit concurrent connections to your source system. It stores encrypted password as well:
Once you save it, you need to activate it:
Third, you should be able to use now NiFi SQL processors such as PutSQL
and ExecuteSQL
. And this is when I got frustrated. I dropped ExecuteSQL
processor on canvas to run a simple SELECT 123 as col
and got some nasty Java errors:
failed to process session due to java.lang.NoClassDefFoundError: Could not initialize class com.cloudera.impala.core.ImpalaJDBCDriver
Unable to execute query select
org.apache.commons.dbcp.SQLNestedException: Cannot create JDBC driver of class 'com.cloudera.impala.jdbc41.Driver' for connect URL
I tried a bunch of various JDBC connection string settings and I even tried to use Hive connector instead (a little known fact, Hive connector can be used to connect to Impala just fine) - nothing helped. When I stumbled upon this SO question and Joe Dankers mentioned that restart of NiFi service helped. And guess what? I restarted NiFi and everything started working magically.
Now that I think about it, it makes sense. NiFi runs in a single JVM, and because it has to connect external jars for Impala connector, I guess entire JVM needs to be restarted to include these dependencies. Maybe it is a common knowledge to Java developers, but this has caused 4 hours wasted for me. The old Microsoft Windows trick - restart your system if nothing works - actually works :)
Bonus
Did you know that you can access Impala and run queries from ExecuteScript
and InvokeScriptedProcessor
processors, using your connection pool controller service? Head over to Matt Burgess's blog to learn how. Thanks to Matt, I picked up some basic Groovy. I was not planning to learn a new language but Groovy turned out to be pretty awesome and handy with NiFi. If you are interested to learn more about NiFi scripts, Matt has put together some nice recipes here.