Hive LLAP Caching
Cache is used mostly for BI queries as compared to ETL queries.
hive.llap.io.threadpool.size is at the node level and it defines the number of low level io threads .Basically, the daemon offloads I/O and transformation from compressed formats to these I/O threads. Then, the data will be passed on to execution where the actual vectorized processing happens by executor threads in JVM.
hive.llap.io.threadpool.size and number of executors per daemon are recommended to be set to same value as cores.
Cache storage :
LLAP’s cache is columnar, automatic and decentralised. When a new column or partition is used , it adds it to cache automatically and do not hold any dead columns. The daemon will cache metadata for input files, as well as the data. The metadata and index information can be cached even for data that is not currently cached. Metadata will be stored in process in Java objects.
● Eviction policy. Currently, LRFU is used but the policy is pluggable. LRFU prevents large scans.
● Caching granularity. Column-chunks will be the unit of data in the cache.
To disable LLAP cache from Command line