Apache Drill: Connecting and Querying From AWS S3 (Amazon S3) storage.

The Apache drill is a Schema free SQL query engine for Hadoop,Json File, NOSQL and some cloud storage like AWS S3. Using drill you can query Hbase,Mongdb or S3 by writing ANSI Standard SQL for more detailed introduction follow this URL [https://blogs.ashrithgn.com/2019/09/20/apache-drill-why-and-how-basic-usage-features-drill-bit/]

As we know from above linked article we can use Apache drill to query from various source like S3, file system, Hbase and other NoSql DB. in this article let us concentrate on connecting to S3 and query a Sample CSV file,along with few queries which are frequently used in real world.  

Getting Started

  1. Create A S3 Bucket in the region US EAST (N. Virginia).
  2. Upload a sample CSV file.
  3. Start A drill in embedded mode {path-drill}/bin/drill-embedded.
  4. Navigate to drill Web interface and storage tab [http://localhost:8047/storage].
  5. Enable S3 plugin.
  6. Update S3 plugin with configuration below.
   {
 "type": "file",
 "connection": "s3a://<BUCKET NAME>/",
 "config": {
   "fs.s3a.access.key": "<ACCESSKEY>",
   "fs.s3a.secret.key": "<ACCESS SECRET>",
   "fs.s3a.impl.disable.cache": "true"
 },
 "workspaces": {
   "tmp": {
     "location": "/home/ashrith/data",
     "writable": true,
     "defaultInputFormat": null,
     "allowAccessOutsideWorkspace": false
   },
   "root": {
     "location": "/test",
     "writable": false,
     "defaultInputFormat": null,
     "allowAccessOutsideWorkspace": false
   }
 },
 "formats": {
   "psv": {
     "type": "text",
     "extensions": [
       "tbl"
     ],
     "delimiter": "|"
   },
   "csv": {
     "type": "text",
     "extensions": [
       "csv"
     ],
     "extractHeader": true,
     "delimiter": ","
   },
   "tsv": {
     "type": "text",
     "extensions": [
       "tsv"
     ],
     "delimiter": "\t"
   },
   "parquet": {
     "type": "parquet"
   },
   "json": {
     "type": "json",
     "extensions": [
       "json"
     ]
   },
   "avro": {
     "type": "avro"
   },
   "sequencefile": {
     "type": "sequencefile",
     "extensions": [
       "seq"
     ]
   },
   "csvh": {
     "type": "text",
     "extensions": [
       "csvh"
     ],
     "extractHeader": true,
     "delimiter": ","
   }
 },
 "enabled": true
}
  1. Open {drill-path}/conf/core-site.xml (if not exist copy from core-site.xml.example).
  //In case of using IAM USER ROLE
  <configuration>
    <property>
        <name>fs.s3a.aws.credentials.provider</name>
       <value>com.amazonaws.auth.InstanceProfileCredentialsProvider</value>
    </property>
  </configuration> 
  //In case of using Direct Account Access Credentials 
  <configuration>
    <property>
    <name>fs.s3a.aws.credentials.provider</name>
    <value>org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider</value>
  </property>
  </configuration> 

Querying

1. Syntax of query select * from s3.worspaceName.folder/filename.csv
According to our configuration we have two workspace root and default.
2. Query Example.

select * from s3.`default`.`test/listings.csv` limit 1;

3. Casting Data for aggreation purpose.

select CAST(id AS  INT) from s3.`default`.`test/listings.csv` limit 1;
select SUM(CAST(id AS  INT)) from s3.`default`.`test/listings.csv` limit 10;

4. Conditional Query using Where Condition

select CAST(id AS  INT) as id from s3.`default`.`test/listings.csv` WHERE CAST(id AS  INT) > 100 limit 10;