How Orc Split Strategies Work? (Hive)

Tamil Selvan K
Oct 25, 2020

What strategy ORC should use to create splits for execution. The available options are “BI”, “ETL” and “HYBRID”.

The HYBRID mode reads the footers for all files if there are fewer files than expected mapper count, switching over to generating 1 split per file if the average file sizes are smaller than the default HDFS blocksize. ETL strategy always reads the ORC footers before generating splits, while the BI strategy generates per-file splits fast without reading any data from HDFS.

BI: Per file split. It will be faster when the number of files is less

ETL: Here the ORC reader reads the File Footer and then decides the number of splits. There is a searchArg passed to the reader which can eliminate orc stripes/splits based on the filter condition provided in the query. This is used when you can allow ORC split to spend time on calculating the Splits and is used when the Query is large.

HYBRID = ETL or BI is decided based on number of Files and average file size. If number of files are less than expected Mapper counts, then it will go with BI else with ETL

Reference:

[1]https://www.slideshare.net/Hadoop_Summit/orc-file-optimizing-your-big-data

[2]https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java

[3]https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties

--

--