Sunday, November 16, 2014

Hive Performance Improvement - statistics and file format checks

Following are couple of important configuration changes that can improve the performance of hive queries (specially insert queries) great deal.

hive.stats.autogather

By default 'hive.stats.autogather' is 'true'. This setting governs collection of statistics for newly created tables/partitions.
Following are the stats collected:
  • Number of rows
  • Number of files
  • Size in Bytes
For newly created tables and/or partitions (that are populated through the INSERT OVERWRITE command), statistics are automatically computed by default.
In case of tables with high number of partitions, this can have big impact on performance.
It is advisable to disable this setting with set hive.stats.autogather=false;
The stats can be collected when required using following query:
ANALYZE TABLE tablename [PARTITION(partcol1[=val1], partcol2[=val2], ...)]
  COMPUTE STATISTICS 
  [FOR COLUMNS]          -- (Note: Hive 0.10.0 and later.)
  [NOSCAN];

hive.fileformat.check

This property governs whether to check file format or not when loading data files.
In our case, where the format of data file is governed by our own processes, checking file format every time we load data file may not add any value.
Hence, it is recommended to disable the file format check to gain some performance.
set hive.fileformat.check=false;

Performance

More than 50% of Performance improvement is observed by using these configuration changes. The time taken by hive queries for metering reduced from >100 minutes to <50 minutes.

References

https://cwiki.apache.org/confluence/display/Hive/StatsDev
https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties