Big Data Technical Articles

Monday, April 17, 2017

Using Data Partition with Apache Drill

In this article we will walk through how to creation partitions using apache drill on data residing in operating system.

Many a times , we have data generated or received as files at a certain frequency i.e. daily , hourly etc and we want in effective way to query only a certain section of data without scanning all the files.

For our e.g , we will assume that we receive files every quarter within a year and or requirement is that when we query the data for a particular query , only that quarter file gets scanned and not all the files. The advantage of drill is that it doesn't generate and persist metadata before hand and generates the same on the fly.

Lets assume that our files resides in folder with a year and with year , folders for quarters are created and each quarterly folder has a file for that particular quarter. So our folder structure will look something like this:

<<BasePath>>/2016/Q1/file.csv
<<BasePath>>/2016/Q2/file.csv
<<BasePath>>/2016/Q3/file.csv
<<BasePath>>/2016/Q4/file.csv
<<BasePath>>/2017/Q1/file.csv
<<BasePath>>/2017/Q2/file.csv
<<BasePath>>/2017/Q3/file.csv
<<BasePath>>/2017/Q4/file.csv

Each of the file will have the sample data in the below format:

Now we need to create a Apache drill view on top of this data till the base path i.e. <<BasePath>>
Visit the <<DrillHome>> path log into the Apache Drill terminal console using below command:

Just run a query on the base path to ensure that the data is accessible.

You would have observed above that the output of the above query gives dir0 and dir1 as 2 separate columns. Now we will create a view on top of this data using the generated columns dir0 and dir1.
First set the workspace by executing the command use dfs.root;

Now create a view by executing the below command.

Once the view is created , you can fire a query to fetch the relevant data for a year and quarter.

We can view the plan to see how many files were scanned to fetch the data to ensure that the partition pruning happened.

As seen above only the relevant file was scanned and not all the files.
Do revert/comments if you have any observations.

Thursday, April 13, 2017

Querying Data on S3 using Apache Drill

In this post ,we will have look how to query data which is made available on S3 buckets using Apache Drill. Although the approach is pretty straight forward but since there are a couple of issues that I faced, I thought of putting it over a post:

Pre-requisites:

Apache Drill [I used version 1.10 but I believe it will work the same for others too]
AWS S3 storage.

Steps to query the data on S3 storage using the Apache Drill component:

Lets 1st create a bucket on AWS and upload a file. We will upload a normal csv file in the below format and upload it on S3. The file can be of any format i.e. json , csv, tsv , parquet etc. For the sake of simplicity , we have taken a csv file. Name it as Sales.csv

Create a bucket on AWS S3 with a unique name. In my case I created a bucket with the name amazondrill-s3-testing and uploaded the above file Sales.csv
Now permission needs to be applied on the bucket and and to the file. For bucket , ensure you enable the permission of grantee as "Everyone" and check the "List" & "View Permissions" check boxes. For file , enable the permission of grantee as "Everyone" and check the "Open/ Download" check boxes

You would also need to create a user from the AWS IAM component and allow Programmatic access and note down the AccessID and Secret Key on completion of user creation. Also note down the link that the AWS generate for the file. In the above case , the link is "s3-us-west-2.amazonaws.com". This completes the setup needed for AWS.
Now login into the server where Apache Drill is installed and drillbit instance is running.
Goto the folder <<Drill Home>>/conf and create a copy of the core-site-example.xml as core-site.xml if it already doesn't exist.
In the file core-site.xml ,add the following configuration:

In the configuration as shown above update the Access Id , Secret Key and the link obtained in step4. Once updated , save and close the file. These are the most important fields and any incorrect values will result into access issues.
Goto the folder <<Drill Home>>/jars/3rdparty and download the latest zip file from the https://jets3t.s3.amazonaws.com/downloads.html page. Do not unzip the file after download. The drillbit nodes uses the file to access the code from S3 and hence its important to download this file on all the drill servers on the path mentioned above.
Goto the folder <<Drill Home>>/bin and open the file hadoop-excludes.txt. Remove the word jets3t from the list of words and save the file. Restart the drillbit service.
Visit the browser to http://<<host>>:8047 and click on Storage. Enable the s3 plugin if not already enabled.
Click on update against the S3 plugin and update the connection value as "connection": "s3a://<<Your bucket name>>/".

Now you can go to the query tab and run the query "Select * from `s3`.`root`.`Sales.csv`;"

You would get your output as shown below:

Hope this post was helpful. Please drop comments if you face any issues.

Sunday, June 26, 2016

Using Apache Drill REST API to query Hive Data

This article will guide you how to use Apache Drill REST API interface to query HIVE data. We will say how hive can be queries but the same approach can be used to query data from Hbase, MongoDB , flat file etc

PreRequisite:

1. Apache Hadoop should be installed.
2. Apache Hive should be installed.
3. Apache Drill is installed.

Implementation:
1. Goto the path where hive is installed.

2. Start the hive Metastore service.

3. Start a new terminal and goto path where Apache Drill is installed

4. Start the drill service in embeded mode.

5. Open Web-browser and goto localhost:8047

6. Goto Storage tab and enable hive storage plugin

7. Click on update against the hive storage plugin and change the values as shown below:

8. Lets fire a query on the hive tables now. We have a transaction_orc table in hive and a store table. We want to find the total sales across each store.

9. Now we will look to fire the same query via a REST API.

10. Open a Web browser with any REST API client. We will user Firefox with the REST Client plugin.

11. Select the method as "POST" , URL as "http://localhost:8047/query.json" , Header as "Content-Type":"application/json"

12. In the body section paste the json as shown below:

13. Now click on send button and validate the response.

14. As seen we have got the result in the json format. We can use this approach using any REST API programming framework,parse the data and display the data on the UI.

Conclusion:

Hope this helps you to understand how to configure Apache Drill and use REST API to query data
Sourabh Jain
Big Data & Analytics Architect

Tuesday, June 21, 2016

Using Beeline to connect to HiveServer 2

This article will guide you on how to connect Apache Drill with Hive.

PreRequisite:

1. Hadoop should be installed.
2. Hive should be installed

Implementation:
1. Goto the path where hive is installed.

2. Start the Hive Server2 process.

3. This will start the hiveserver2 process as a foreground process. In order to stop the process , press Ctrl + C. If the process is to be started as background process , execute "./hive --service hiveservice2 &"

4. Start a new terminal.

5. Goto the hive installation path as mentioned in step 1.

6, If there is a metastore_db folder , go inside the same and remove all the *.lck files. This is required because default database derby supports only 1 connection. Recommended metastore db for production is mysql.

7. Start the beeline shell

8. The beeline shell will be started as shown above. Lets connect to the hiveserver2 process.

9. On executing "!connect jdbc:hive2://" command , user will be prompted for username and password. Just press enter to continue. As we are running both the beeline client and hiveserver2 process on the same node, we execute the command as "!connect jdbc:hive2://" . If the hiveserver2 process was running on the a different server , we would need to connect as "!connect jdbc:hive2://hostname:port". The default port for hiveserver2 process is 10000.

10. Lets display all the hive tables.

11. Lets execute some queries against the tables.

12. In order to quit the beeline shell, execute "!quit" command

Conclusion:
Hope this helps you to understand how to connect hiveserver2 process using beeline.

Sourabh Jain
Big Data & Analytics Architect

Monday, June 20, 2016

Using Hive with Apache Drill

This article will guide you on how to connect Apache Drill with Hive.

PreRequisite:

1. Hadoop should be installed.
2. Hive should be installed
3. Apache Drill should be installed.

Implementation:
1. Goto the path where hive is installed and start the hive metastore service:

Goto Hive Home

2. Start the Hive Metastore service.

Start Hive Service

3. You will observe the below screen once the hive metastore service is successfully implemented.

Successfully Start Hive Service

4. Start a new terminal.
5. Goto the path where Apache Drill is installed.

Goto Drill Home

6. Start the Drill server in embeded mode

Start Drill Embedded Mode

7. Once the drill server is started in embedded mode, you will observe the drill prompt:

Apache Drill Shell

7. Go to the Apache Drill browser at http:localhost:8047 . You will observe below screen:

Apache Drill Web UI

8. Click on Storage and then Enable hive storage plugin. The plugin should appear now in enabled storage plugins as shown below.

Enable Hive Storage Plugin

9. Click on update against the hive storage plugin and update the value as shown in the screen

Configure Hive Storage Plugin

10. Go back to the terminal where you have apache drill server started i.e. step 7.

11. Change the schema to use hive.

Change Hive Schema

12. Now you can run show tables to list all the hive tables;

List Hive Tables

12. You can run any query on the hive tables. Remember, the query will not invoke a MapReduce process.

Execute Queries on Hive via Apache Drill

Conclusion:
Hope this helps you to understand how to configure Apache Drill with Hive metastore interface to query hive tables directly.

Sourabh Jain
Big Data & Analytics Architect