Big Data Technical Articles: 2017

Monday, April 17, 2017

Using Data Partition with Apache Drill

In this article we will walk through how to creation partitions using apache drill on data residing in operating system.

Many a times , we have data generated or received as files at a certain frequency i.e. daily , hourly etc and we want in effective way to query only a certain section of data without scanning all the files.

For our e.g , we will assume that we receive files every quarter within a year and or requirement is that when we query the data for a particular query , only that quarter file gets scanned and not all the files. The advantage of drill is that it doesn't generate and persist metadata before hand and generates the same on the fly.

Lets assume that our files resides in folder with a year and with year , folders for quarters are created and each quarterly folder has a file for that particular quarter. So our folder structure will look something like this:

<<BasePath>>/2016/Q1/file.csv
<<BasePath>>/2016/Q2/file.csv
<<BasePath>>/2016/Q3/file.csv
<<BasePath>>/2016/Q4/file.csv
<<BasePath>>/2017/Q1/file.csv
<<BasePath>>/2017/Q2/file.csv
<<BasePath>>/2017/Q3/file.csv
<<BasePath>>/2017/Q4/file.csv

Each of the file will have the sample data in the below format:

Now we need to create a Apache drill view on top of this data till the base path i.e. <<BasePath>>
Visit the <<DrillHome>> path log into the Apache Drill terminal console using below command:

Just run a query on the base path to ensure that the data is accessible.

You would have observed above that the output of the above query gives dir0 and dir1 as 2 separate columns. Now we will create a view on top of this data using the generated columns dir0 and dir1.
First set the workspace by executing the command use dfs.root;

Now create a view by executing the below command.

Once the view is created , you can fire a query to fetch the relevant data for a year and quarter.

We can view the plan to see how many files were scanned to fetch the data to ensure that the partition pruning happened.

As seen above only the relevant file was scanned and not all the files.
Do revert/comments if you have any observations.

Thursday, April 13, 2017

Querying Data on S3 using Apache Drill

In this post ,we will have look how to query data which is made available on S3 buckets using Apache Drill. Although the approach is pretty straight forward but since there are a couple of issues that I faced, I thought of putting it over a post:

Pre-requisites:

Apache Drill [I used version 1.10 but I believe it will work the same for others too]
AWS S3 storage.

Steps to query the data on S3 storage using the Apache Drill component:

Lets 1st create a bucket on AWS and upload a file. We will upload a normal csv file in the below format and upload it on S3. The file can be of any format i.e. json , csv, tsv , parquet etc. For the sake of simplicity , we have taken a csv file. Name it as Sales.csv

Create a bucket on AWS S3 with a unique name. In my case I created a bucket with the name amazondrill-s3-testing and uploaded the above file Sales.csv
Now permission needs to be applied on the bucket and and to the file. For bucket , ensure you enable the permission of grantee as "Everyone" and check the "List" & "View Permissions" check boxes. For file , enable the permission of grantee as "Everyone" and check the "Open/ Download" check boxes

You would also need to create a user from the AWS IAM component and allow Programmatic access and note down the AccessID and Secret Key on completion of user creation. Also note down the link that the AWS generate for the file. In the above case , the link is "s3-us-west-2.amazonaws.com". This completes the setup needed for AWS.
Now login into the server where Apache Drill is installed and drillbit instance is running.
Goto the folder <<Drill Home>>/conf and create a copy of the core-site-example.xml as core-site.xml if it already doesn't exist.
In the file core-site.xml ,add the following configuration:

In the configuration as shown above update the Access Id , Secret Key and the link obtained in step4. Once updated , save and close the file. These are the most important fields and any incorrect values will result into access issues.
Goto the folder <<Drill Home>>/jars/3rdparty and download the latest zip file from the https://jets3t.s3.amazonaws.com/downloads.html page. Do not unzip the file after download. The drillbit nodes uses the file to access the code from S3 and hence its important to download this file on all the drill servers on the path mentioned above.
Goto the folder <<Drill Home>>/bin and open the file hadoop-excludes.txt. Remove the word jets3t from the list of words and save the file. Restart the drillbit service.
Visit the browser to http://<<host>>:8047 and click on Storage. Enable the s3 plugin if not already enabled.
Click on update against the S3 plugin and update the connection value as "connection": "s3a://<<Your bucket name>>/".