Wednesday, June 17, 2015

EMR cluster and selection of EC2 instance type - Cost Optimization!

AWS Elastic MapReduce (EMR) is Amazon’s service providing Hadoop in the Cloud.
EMR inherently uses the EC2 nodes as the hadoop nodes. While triggering an EMR cluster, we can choose appropriate instance type based on the requirements of resources and profile of hadoop process.

In our case, we achieved direct cost saving of 30% by moving from m2.4xlarge to r3.2xlarge for our EMR cluster.
Apart from this, we had additional advantage of performance improvement of our hadoop jobs because of new generation CPUs in r3 instances and also because of SSD.
SSD is better for HDFS as well as MapReduce jobs because a MapReduce job reads/writes intermediate data to local disk.

Apart from this, earlier we used to use m2.4xlarge cluster for all transient clusters but as part of this optimization I observed that some jobs used to take hardly 15 minutes.
Now, as EMR billing is done on hourly rates, even for using cluster for only 20 minutes, we will end up paying cost for full 1 hour.
In such cases, I decided to move corresponding jobs to smaller r3.xlarge instances which resulted in cost savings of 65%!

We can further push this cost savings upto 90% by moving to spot instances for our transient clusters.
However this requires more effort in terms of ensuring fault-tolerance of our workflows and avoiding delays in our job execution due to loss of spot instances because of price spikes.

In summary, appropriate instance type selection can significantly reduce your EMR bills!

Following page talks about available EC2 instance types - http://aws.amazon.com/elasticmapreduce/pricing/
After comparing various instance types, I decided to move our transient clusters from m2.4xlarge to r3.2xlarge clusters.

For example, let’s compare m2.4xlarge vs r3.2xlarge


m2.4xlarge
r3.2xlarge

vCPU
8 core
8 core (new gen CPU)

RAM
68.4GB
61GB

SSD
No (2 x 840 GB)
Yes (1 x 160 GB)
SSD is better for performance of hadoop jobs
EC2 Price
$0.980 per Hour
$0.700 per Hour

EMR Price
$0.246 per Hour
$0.180 per Hour

Total Price
$1.226 per Hour
$0.880 per Hour
~30% cost saving





Default Memory Allocation:
Another interesting aspect of EMR is that EMR's ResourceManager allocates different memory to various YARN containers (mapper, reducer, application-manager etc.) based on instance type.
That means that only looking at the available resources for an instance type is not sufficient to take decision about which instance type to use for our EMR cluster.
m2.4xlarge
Configuration Option
Default Value
mapreduce.map.java.opts
-Xmx1280m
mapreduce.reduce.java.opts
-Xmx2304m
mapreduce.map.memory.mb
1536
mapreduce.reduce.memory.mb
2560
yarn.scheduler.minimum-allocation-mb
256
yarn.scheduler.maximum-allocation-mb
8192
yarn.nodemanager.resource.memory-mb
61440

r3.2xlarge
Configuration Option
Default Value
mapreduce.map.java.opts
-Xmx2714m
mapreduce.reduce.java.opts
-Xmx5428m
mapreduce.map.memory.mb
3392
mapreduce.reduce.memory.mb
6784
yarn.scheduler.minimum-allocation-mb
3392
yarn.scheduler.maximum-allocation-mb
54272
yarn.nodemanager.resource.memory-mb
54272


Following page shows default memory allocations for various instance types.
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/HadoopMemoryDefault_H2.html
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/TaskConfiguration_H2.html
The memory allocation does not go well with all hadoop jobs, so we wrote a script to reduce the memory allocation in r3.2xlarge instances during bootstrap of the cluster.

Overall, we achieved direct cost saving of 30% to 90% by selecting appropriate EC2 instance types for our EMR cluster.

- Sarang Anajwala