In this post ,we will have look how to query data which is made available on S3 buckets using Apache Drill. Although the approach is pretty straight forward but since there are a couple of issues that I faced, I thought of putting it over a post:
Pre-requisites:
Pre-requisites:
- Apache Drill [I used version 1.10 but I believe it will work the same for others too]
- AWS S3 storage.
Steps to query the data on S3 storage using the Apache Drill component:
- Lets 1st create a bucket on AWS and upload a file. We will upload a normal csv file in the below format and upload it on S3. The file can be of any format i.e. json , csv, tsv , parquet etc. For the sake of simplicity , we have taken a csv file. Name it as Sales.csv
- Create a bucket on AWS S3 with a unique name. In my case I created a bucket with the name amazondrill-s3-testing and uploaded the above file Sales.csv
- Now permission needs to be applied on the bucket and and to the file. For bucket , ensure you enable the permission of grantee as "Everyone" and check the "List" & "View Permissions" check boxes. For file , enable the permission of grantee as "Everyone" and check the "Open/ Download" check boxes
- You would also need to create a user from the AWS IAM component and allow Programmatic access and note down the AccessID and Secret Key on completion of user creation. Also note down the link that the AWS generate for the file. In the above case , the link is "s3-us-west-2.amazonaws.com". This completes the setup needed for AWS.
- Now login into the server where Apache Drill is installed and drillbit instance is running.
- Goto the folder <<Drill Home>>/conf and create a copy of the core-site-example.xml as core-site.xml if it already doesn't exist.
- In the file core-site.xml ,add the following configuration:
- In the configuration as shown above update the Access Id , Secret Key and the link obtained in step4. Once updated , save and close the file. These are the most important fields and any incorrect values will result into access issues.
- Goto the folder <<Drill Home>>/jars/3rdparty and download the latest zip file from the https://jets3t.s3.amazonaws.com/downloads.html page. Do not unzip the file after download. The drillbit nodes uses the file to access the code from S3 and hence its important to download this file on all the drill servers on the path mentioned above.
- Goto the folder <<Drill Home>>/bin and open the file hadoop-excludes.txt. Remove the word jets3t from the list of words and save the file. Restart the drillbit service.
- Visit the browser to http://<<host>>:8047 and click on Storage. Enable the s3 plugin if not already enabled.
- Click on update against the S3 plugin and update the connection value as "connection": "s3a://<<Your bucket name>>/".
- Now you can go to the query tab and run the query "Select * from `s3`.`root`.`Sales.csv`;"
- You would get your output as shown below:
- Hope this post was helpful. Please drop comments if you face any issues.
No comments:
Post a Comment