Monday, December 16, 2013

Hive vs Impala @ EMR

A colleague recently tipped me off that Impala arrived at Amazon's Elastic MapReduce (EMR).  I've been waiting for some time for this functionality, so I got started right away with a little benchmark.

Launching impala @ EMR is a breeze, just download the latest ruby client and replace "hive-interactive" with "impala-interative" in your EMR creation script.  I also followed the instructions to use the most recent AMI.

/home/ec2-user/emr2/elastic-mapreduce --create --alive --impala-interactive --name "Dev Impala - 1 hs1.8xlarge, 3 m2.4xlarge"  --master-instance-type "hs1.8xlarge" --slave-instance-type "m2.4xlarge" --num-instances 4 --ami-version 3.0.2 --region us-east-1 | awk -F' ' '{print $4}' > impala_master

Here is my benchmarking code ....

### impala test ###
impala_start=$(date +"%s")

hadoop fs -mkdir /data/jobs/
hadoop distcp s3n://XXXXXXXXXX/csvfeed/jobs/ /data/
hive -e 'drop table jobs'
hive -e 'create external table jobs (id string, title string, post_date string, job_board_url string, company string, city string, state string, description string, industry string, price string, update_date string, feed_date string, job_board string, hash string) row format delimited fields terminated by "\t" LOCATION "/data/jobs/";'
impala-shell -r --query="select count(1) from jobs";
impala_end=$(date +"%s")


### hive test ###
hive_start=$(date +"%s")

hadoop fs -mkdir /temp20/
hadoop distcp s3n://XXXXXXXXXX/csvfeed/jobs/ /temp20/
hive -e 'drop table jobs'
hive -e 'create table jobs (id string, title string, post_date string, job_board_url string, company string, city string, state string, description string, in
dustry string, price string, update_date string, feed_date string, job_board string, hash string) row format delimited fields terminated by "\t";
load data inpath "/temp20/jobs/" overwrite into table jobs;'
hive -e 'select count(1) from jobs';
hive_end=$(date +"%s")


diff=$(($impala_end-$impala_start))
echo "Impala - $(($diff / 60)) minutes and $(($diff % 60)) seconds elapsed."

diff=$(($hive_end-$hive_start))

echo "Hive - $(($diff / 60)) minutes and $(($diff % 60)) seconds elapsed."


And here is my output .... Not too shabby.

I loaded about 130GB of data, and then counted the 1387230376 rows.  Now some of the performance gains are likely due to the use of an externally managed Hive table (needed so Impala can see the Hive tables), but the actual query is darn near instantaneous in Impala.


Impala - 8 minutes and 50 seconds elapsed.
Hive - 20 minutes and 2 seconds elapsed.


Of course, Impala and Hive are great for different use cases and leaving Hive's internally managed cocoon is not for every situation.  I'm still debating the ramifications of this on our product, but it is going to be hard to argue with the performance gains.

Hope this was useful!