September 1st 2017

Overview

Big data exploration in almost all fields has led to the development of multiple big data technologies such as Hadoop (Hive, HDFS, Pig, HBase), NoSQL databases (MongoDB), and so on for accessing, exploring, and reporting huge volume of data. Amazon Athena, a serverless, interactive query service, is used to easily analyze big data using standard SQL in Amazon S3. Apache Drill, a schema-free, low-latency SQL query engine, enables self-service data exploration on big data.

In this blog, let us compare data partitioning in Apache Drill and AWS Athena and the distinct features of both.

Dataset Description

A sample dataset, containing census data of a particular country in the USA, is used in this use case. For sample dataset, consider Reference section.

Partitioning Data

In this section, let us discuss data partitioning based on male and female fertility rate in a predefined age group in Apache Drill and Athena.

Partitioning Data in Apache Drill

To perform data partition in Drill, perform the following:

Change data storage format to Parquet using the following command:

ALTER SESSION SET `store.format`='parquet';

Create table and partition data using the following command:

CREATE TABLE dfs.`csvOut`.AGE_FERTILITY_RATES_GENDER_PARQUET_PARTITION(country_code, country_name, 
`year`, fertility_rate_15_19, fertility_rate_20_24, fertility_rate_25_29, fertility_rate_30_34, fertility_rate_35_39, 
fertility_rate_40_44, fertility_rate_45_49, total_fertility_rate, gross_reproduction_rate, sex_ratio_at_birth,gender) 
PARTITION BY (gender)
AS
SELECT columns[0] as country_code,columns[1] as country_name,columns[2] as `year`,columns[3] as 
fertility_rate_15_19,columns[4] as fertility_rate_20_24,columns[5] as fertility_rate_25_29,columns[6] as 
fertility_rate_30_34,columns[7] as fertility_rate_35_39,columns[8] as fertility_rate_40_44,columns[9] as 
fertility_rate_45_49,columns[10] as total_fertility_rate,columns[11] as gross_reproduction_rate,columns[12] as 
sex_ratio_at_birth,columns[13] as gender FROM 
dfs.`/user/tsldp/drillathena/age_specific_fertility_rates_gender.csv`;

The table created is as shown below:

The time taken to create a table is as shown below:

You can check the data loaded into the database using the following command:

select * from dfs.`csvOut`.`AGE_FERTILITY_RATES_GENDER_PARQUET_PARTITION` ;

The time taken to select the required data in a table is as shown below:

Get total count of male and female fertility data using the following command:

select count(*),gender from dfs.`csvOut`.`AGE_FERTILITY_RATES_GENDER_PARQUET_PARTITION` group by gender;

The count of males and females in a country is shown below:

The file size after partitioning data using Apache drill is as shown below:

Partitioning Data in Athena

Athena uses Hive data partitioning and provides improved query performance by reducing the amount of data scanned.

In Athena, data partitioning can be done in two separate ways as follows:

With already partitioned data stored on Amazon S3 and accessed on Athena.
With unpartitioned data.

In both methods, specify the partitioned column in create statement.

To perform data partition in Athena, perform the following:

Create table using the below query:

create external table sampledb.age_fertility_rates_gender_parq_part(
country_code string,
country_name string,
year string,
fertility_rate_15_19 decimal(10,5),
fertility_rate_20_24 decimal(10,5),
fertility_rate_25_29 decimal(10,5),
fertility_rate_30_34 decimal(10,5),
fertility_rate_35_39 decimal(10,5),
fertility_rate_40_44 decimal(10,5),
fertility_rate_45_49 decimal(10,5),
total_fertility_rate decimal(10,5),
gross_reproduction_rate decimal(10,5),
sex_ratio_at_birth decimal(10,5))
PARTITIONED BY (gender string)
stored as parquet
LOCATION 's3://cps3bucket/data_gender_parquet/';

Add partitions to the catalog by using the below command:

lMSCK REPAIR TABLE age_fertility.age_fertility_rates_gender_parq_part;

Check partitioned data using the below query:

select * from age_fertility.age_fertility_rates_gender_parq_part;

Data Partition Comparison between Apache Drill and Amazon Athena

The time taken to perform create partition and select partition is as follows:

Distinct Features of Drill and Athena

Conclusion

In Apache Drill, data partitioning concepts can be applied directly. In Athena, we need to convert the files into Parquet format using EMR to perform data partitioning. A separate storage is not required in Athena as you can query the data directly from Amazon S3.

References

Amazon Athena:
https://aws.amazon.com/athena/
Apache Drill:
https://drill.apache.org/
Sample Dataset in GitHub:
https://github.com/treselle-systems/apache_drill_vs_amazon_athena

Discover the Top 5 Beaches in Spain: …
$10 billion renewable energy project …

This post first appeared on Front-end Code Review & Validation Tools | Treselle Systems, please read the originial post: here

People also like

Discover the Top 5 Beaches in Spain: Spainâ€™s Best Beaches

$10 billion renewable energy project paused over concerns for Native American historic sites

Apache Drill vs Amazon Athena â€“ A Comparison on Data Partitioning

Overview

Dataset Description

Partitioning Data

Partitioning Data in Apache Drill

Partitioning Data in Athena

Data Partition Comparison between Apache Drill and Amazon Athena

Distinct Features of Drill and Athena

Conclusion

References

Related Articles

Apache Drill vs Amazon Athena â€“ A Comparison on Data Partitioning

Overview

Dataset Description

Partitioning Data

Partitioning Data in Apache Drill

Partitioning Data in Athena

Data Partition Comparison between Apache Drill and Amazon Athena

Distinct Features of Drill and Athena

Conclusion

References

Related Articles

Share the post

Subscribe to Front-end Code Review & Validation Tools | Treselle Systems

Thank you for your subscription