Laying out data infrastructure for a new product and big-data research is no small task. In many cases, it involves working in the dark with a lot of unknowns and juggling many analysis tools and methods to gain even a shred of meaningful insight into an endless sea of data.
Any data engineer or data scientist will tell you to use a toolbox full of fast to use, scalable and multiple use-case-supporting tools to get through it. Means by which to explore different types of data are also paramount, given that our data infrastructures deal with several different file types, databases and storage locations, including:
- Parquet files over S3 (Amazon Simple Storage Service)
- Parquet files over GCS (Google Cloud Storage)
- Parquet files stored locally or on a mounted drive
- JSON files (Local, on GCS and on S3)
- Google BigQuery
- And more
In an ideal world, we’d also be able to query all data from the same place and in the same language, preferably in SQL.
For the last few years, I’ve used Presto to query from multiple data sources and combine the results for data exploration. Unfortunately, while it is effective in many areas of my work, it doesn’t support querying of local parquet files, an important function in my day to day tasks. In my search for something that could, I decided to try out the Apache Drill. If the title of this post is any indication, it went far better than I expected to the point that I decided to share my experience.
TL;DR While Apache Drill is mainly meant for “data exploration”, it’s also great for simplifying ETL (Extract, Transform, Load) processes. However, once we graduate to defined use-cases, we’ll have to consider a new tool for resource and latency considerations. Nonetheless, it is, for now, one of the most useful tools in our arsenal.
Here are some of the technical challenges and triumphs from my implementation of Apache Drill for Satori’s data exploration.
A quick breakdown of Apache Drill
Apache Drill is an SQL engine that allows you to query files and databases. It connects to many different formats, including Apache Parquet (a columnar compressed format great for querying large amounts of data), JSON files, delimited text files (CSV, TDF, etc), log files, as well as most RDBMSs (you need to configure them using their JDBC drivers). The analysis is done in memory, which makes it very robust.
Some of its most exciting features allow you to:
- Not only query a lot of different data formats, but also join data from multiple sources in one query.
- Enjoy a full SQL engine. Apache Drill isn’t an “almostSQL”. It’s a fully compliant ANSI SQL engine with all of the perks we engineers love (subqueries, window functions, WITH statements, etc).
- Query your data right away, without defining any schemas, and get going quickly. The query results are also very fast.
- Write quick SQL queries against a very wide range of data sources. It even allows you to SQL query streaming data from Apache Kafka.
- Benefit from a number of cool enhancements, such as parsing user-agent strings and returning an object with user-agent properties (i.e. browser version, operating system etc).
Getting started with Apache Drill:
Apache Drill can work as a cluster, where each node is called a “Drill Bit”, but it can also run as a single instance (“embedded mode”). My current use-cases only require single instances, so I did not need to delve into setting up an Apache Drill cluster.
I won’t give a full tutorial for starting Apache Drill, as there’s plenty of information about that (this is a good place to start your Apache Drill journey). What I will share is that it was easy, quick and relatively seamless. Suffice to say that either running it locally or in a docker container is as straightforward as bin/drill-embedded or docker run -it apache/drill respectively.
Connecting Apache Drill to Google Cloud Storage:
This is where some of the sparkle wears off. Any technologist reading this will empathize with the experience of having a task that should have taken 30 minutes top, but end up taking a full day of spitting blood on our keyboards to get our code working or infrastructure ready. This is exactly what happened when I tried to connect Apache Drill to Google Cloud Storage. I searched high and low for any working example of using GCS as a storage engine for parquet files queried by Apache Drill, but only managed to find examples for AWS S3.
After wiping the blood off my keyboard, I used a hint from the (awesome) “Learning Apache Drill” book by Charles Givre & Paul Rogers, which sent me to the GCP connectors page.
From there, I downloaded the Hadoop connector file, and placed it in the /jars/3rdparty directory, to ensure that it was loaded when Apache Drill started. I spent a long time trying to understand how to configure my GCP API key to be used by Apache Drill. I finally figured it out by following the Hadoop connector repository of Google Cloud Dataproc. After providing the appropriate role to the IAM service account, I was up and running!
To spare anyone else the struggle, I consolidated what I learned into these steps:
-
- Add the JAR file from here to the jars/3rdparty directory. I used the “Cloud Storage Connector for Hadoop 3.x”, but other versions may work as well.
- Add the following to the site-core.xml file in the conf directory (change the upper-case values such as YOUR_PROJECT_ID to your own details)
<property> <name>fs.gs.project.id</name> <value>YOUR_PROJECT_ID</value> <description> Optional. Google Cloud Project ID with access to GCS buckets. Required only for list buckets and create bucket operations. </description> </property> <property> <name>fs.gs.auth.service.account.private.key.id</name> <value>YOUR_PRIVATE_KEY_ID</value> </property> <property> <name>fs.gs.auth.service.account.private.key</name> <value>-----BEGIN PRIVATE KEY-----\nYOUR_PRIVATE_KEY\n-----END PRIVATE KEY-----\n</value> </property> <property> <name>fs.gs.auth.service.account.email</name> <value>YOUR_SERVICE_ACCOUNT_EMAIL/value> <description> The email address is associated with the service account used for GCS access when fs.gs.auth.service.account.enable is true. Required when authentication key specified in the Configuration file (Method 1) or a PKCS12 certificate (Method 3) is being used. </description> </property> <property> <name>fs.gs.working.dir</name> <value>/</value> <description> The directory relative gs: uris resolve in inside of the default bucket. </description> </property> <property> <name>fs.gs.implicit.dir.repair.enable</name> <value>true</value> <description> Whether or not to create objects for the parent directories of objects with / in their path e.g. creating gs://bucket/foo/ upon deleting or renaming gs://bucket/foo/bar. </description> </property> <property> <name>fs.gs.glob.flatlist.enable</name> <value>true</value> <description> Whether or not to prepopulate potential glob matches in a single list request to minimize calls to GCS in nested glob cases. </description> </property> <property> <name>fs.gs.copy.with.rewrite.enable</name> <value>true</value> <description> Whether or not to perform copy operation using Rewrite requests. Allows to copy files between different locations and storage classes. </description> </property>
- Start Apache Drill (note: if you’re working in a cluster, you need to propagate config to the nodes. Perhaps we’ll cover this in a future post!)
- In the web panel, add the following storage as copied from the ‘dfs’ storage configuration. I named it ‘gcs’, but feel free to use any other name, especially since you might use several different buckets (and will need to define a connector for each one). Change YOUR_BUCKET_NAME to your bucket name, and customize to your needs:
{ "type": "file", "connection": "gs://YOUR_BUCKET_NAME", "config": null, "workspaces": { "tmp": { "location": "/tmp", "writable": true, "defaultInputFormat": null, "allowAccessOutsideWorkspace": false }, "root": { "location": "/", "writable": false, "defaultInputFormat": null, "allowAccessOutsideWorkspace": false } }, "formats": { "psv": { "type": "text", "extensions": [ "tbl" ], "delimiter": "|" }, "csv": { "type": "text", "extensions": [ "csv" ], "delimiter": "," }, "tsv": { "type": "text", "extensions": [ "tsv" ], "delimiter": "\t" }, "httpd": { "type": "httpd", "logFormat": "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-agent}i\"" }, "parquet": { "type": "parquet" }, "json": { "type": "json", "extensions": [ "json" ] }, "pcap": { "type": "pcap", "extensions": [ "pcap" ] }, "pcapng": { "type": "pcapng", "extensions": [ "pcapng" ] }, "avro": { "type": "avro" }, "sequencefile": { "type": "sequencefile", "extensions": [ "seq" ] }, "csvh": { "type": "text", "extensions": [ "csvh" ], "extractHeader": true, "delimiter": "," }, "image": { "type": "image", "extensions": [ "jpg", "jpeg", "jpe", "tif", "tiff", "dng", "psd", "png", "bmp", "gif", "ico", "pcx", "wav", "wave", "avi", "webp", "mov", "mp4", "m4a", "m4p", "m4b", "m4r", "m4v", "3gp", "3g2", "eps", "epsf", "epsi", "ai", "arw", "crw", "cr2", "nef", "orf", "raf", "rw2", "rwl", "srw", "x3f" ] }, "excel": { "type": "excel" }, "ltsv": { "type": "ltsv", "extensions": [ "ltsv" ] }, "shp": { "type": "shp" }, "syslog": { "type": "syslog", "extensions": [ "syslog" ] } }, "enabled": true }
- Upload a sample parquet file to the bucket. In my case, I uploaded the sample orders.parquet file from
- Now try the following query just to make sure you’re good to go:
# Using our new gcs connector :) USE gcs.root; # Let's just fetch a record to see if we're getting data SELECT o_orderkey FROM `orders.parquet` LIMIT 1; +------------+ | o_orderkey | +------------+ | 1 | +------------+ 1 row selected (6.458 seconds) # Let's take the top 10 orders SELECT o_custkey, COUNT(*) AS num_of FROM `orders.parquet` GROUP BY o_custkey ORDER BY 2 DESC LIMIT 10; +-----------+--------+ | o_custkey | num_of | +-----------+--------+ | 898 | 32 | | 712 | 32 | | 79 | 32 | | 1282 | 32 | | 643 | 32 | | 4 | 31 | | 1213 | 30 | | 334 | 30 | | 1078 | 30 | | 73 | 30 | +-----------+--------+
Using Apache Drill to prepare data for Machine Learning
Preparing data for Machine Learning, especially when it involves large datasets to be processed, can be particularly annoying. Sure, you can use a number of solutions to prep your data, such as handling them in Python scripts, but they’re often too cumbersome. This is one of the small instances where Apache Drill really helped us through the less exciting minutiae of our projects. It saved us precious time when used on large JSON/CSV files by allowing us to do a lot of it as part of a simple SQL query. This freed up a significant amount of time for us to spend on the actual problems we’re trying to solve.
So, while using Python is possible, the ability to clean & prepare data for processing via a simple and easy SQL query is a far more appealing option. Examples of useful functions to use are REGEXP_REPLACE, SPLIT, the quite robust TO_NUMBER, TO_TIMESTAMP and other conversion functions.
Apache Drill makes ETLs fun
At Satori, we often find ourselves processing large datasets and carrying out large-scale enrichments, feature extraction, etc. These processes, which fall under the dreadful ETL title, can be converted into a simple SQL query.
To achieve this, we’re using CTAS (Create Table As Select), meaning that we’re creating a new table as the results of a SELECT statement. While this is useful in any SQL DB, when combined with Apache Drill’s ability to read different data sources and join them, this becomes a way to write very simple and readable SQL code to make such conversions and transformations.
Note that you can define what output format you want by changing the store.format setting, for example:
ALTER SESSION SET `store.format`='json';
"workspaces": { "etl": { "location": "/tmp/etl_output", # ← change to your own value "writable": true, "defaultInputFormat": null, "allowAccessOutsideWorkspace": false }
CREATE TABLE dfs.etl.`enrichedIncidents` AS SELECT * FROM dfs.tmp.`data/incidents` AS incidents JOIN dfs.tmp.`data/enrichmentData` AS ed ON (incidents.incident_id = ed.incident_id);
Combining Apache Drill with Apache Zeppelin
The Satori team often uses Apache Zeppelin when we want a quick backoffice dashboard or a notebook. In most cases, we use Apache Zeppelin when we’re playing with data to get quick visualizations.
Apache Zeppelin is a smooth (thanks to its websockets), web-based platform, which makes it easy to create quick notebooks and dashboards with out-of-the-box visualization from a wide variety of data sources. While Apache Zeppelin deserves a blog post of its own, let’s talk about how we combine Apache Zeppelin and Apache Drill.
While you can configure Zeppelin to work with many different data types as well, using it with Apache Drill both abstracts that need and allows you to create queries from different data stores together.
To add Apache Drill to Apache Zeppelin, use the JDBC driver from the /jars/jdbc-driver directory of your Apache Drill installation (in my case, drill-jdbc-all-1.17.0.jar), and add the following settings in a new JDBC driver to your Zeppelin dashboard with the following configuration:
- default.url: jdbc:drill:drillbit\u003dlocalhost
- default.driver: org.apache.drill.jdbc.Driver
Bottom line:
Apache Drill is a great tool for anyone dealing with data from a variety of sources and of varying quality. It’s quick to get started and packs a serious punch. I’ve found it most useful for exploring and preparing data without determined use-cases. Once you do assign it a use-case, it is naturally advisable to find or create more tailored solutions to better serve you. The examples I included in this post are pulled directly from my own workloads, and while I verified them thoroughly on my end, it’s possible that they may not work for you. Please feel free to get in touch if this is the case (Via email, twitter or linkedin)!