You might have used Tez extensively if you are using HDP distribution of Hive but if you are new to the HDP/CDP or have used Hive on MR only, then this article will give you an quick overview on what Apache Tez is and how it uses the existing Yarn Architecture to speed up the Query Execution for Hive and Pig. Tez relies on Yarn for Running the tasks and it requires a durable shared filesystem accessible through Hadoop Filesystem interface.

Traditional MapReduce Execution has proved very reliable for Data Processing in a cluster environment but at the same time…

ConnectionPool: A “connection pool” is a cache of database connection objects. Connection pools promotes the reuse of connection objects and reduce the number of times that connection objects are created. Connection pools significantly improve performance for database-intensive applications because creating connection objects is costly both in terms of time and resources.

In Hive we can configure this value by configuring the property “datanucleus.connectionPool.maxPoolSize” which is set to 10 by default.

The total number of connections to backend Database (Mysql or Postgres) can be found using the below formula:

(No. of Hiveserver2 Instance x 2 x datanucleus.connectionPool.maxPoolSize) + (No. of HMS…

Database has locks and Hive is no different. Locks in database can be either Read Lock or Write Lock. Locks are used when concurrent applications tries to access the same table. Locks prevents data from being corrupted or invalidated when multiple users try to reach while others write to database.

There may be specific scenarios in which when we try to delete the table, the table does not respond at all. Please check the hive.log in /tmp/$user and you can see that the table is being Locked. First test in confirming that a table is being locked is by running…

Cache is used mostly for BI queries as compared to ETL queries. hive.llap.io.threadpool.size is at the node level and it defines the number of low level io threads .Basically, the daemon offloads I/O and transformation from compressed formats to these I/O threads. Then, the data will be passed on to execution where the actual vectorized processing happens by executor threads in JVM. hive.llap.io.threadpool.size and number of executors per daemon are recommended to be set to same value as cores. Cache storage : LLAP’s cache is columnar, automatic and decentralised. When a new column or partition is used , it adds…

Hive 3 has seen lot of changes in terms of Architecture like default Table type as ACID, deprecating hive cli (thick Jdbc client) and only supporting the Thin JDBC client (Beeline) etc.

Below is the High Level Architecture (I tried to make some changes to existing Hive Architecture Diagram which had Job Tracker in it)

Design: components

Server Thrift Server API (server impls, serialization, network I/O) Processor: Application Logic (session, operation, driver etc) Client JDBC/ODBC/Beeline Thrift Client API (discouraged) ZooKeeper Service discovery Authentication Kerberos/LDAP/Pluggable Metastore & RDBMS Remote: TCP connection to Metastore Thrift Server to access RDBMS data Embedded: Access RDBMS data…

For Yarn application, to fetch application’s data we can use Rest APIs on ATS below are some reference links:

https://community.hortonworks.com/content/supportkb/221899/how-to-export-information-from-tez-view.html

https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/TimelineServer.html#Timeline_Server_REST_API_V1

We can use RM REST APIs to get some application related data. Some of the Examples are below:

a) Failed apps for the specific time period

GET “http://Resource-Manager-Address:8088/ws/v1/cluster/apps?limit=10&startedTimeBegin={time in epoch}&startedTimeEnd={time in epoch}&states=FAILED”

b) Query To get all the apps having states as FINISHED,KILLED by the specific user for specific time period

GET “http://Resource-Manager-Address:8088/ws/v1/cluster/apps?limit=20&states=FINISHED,KILLED&user=<user-id>&startedTimeBegin={time in epoch}&startedTimeEnd={time in epoch}”

c)To get failed jobs for the specific user and for specific time period

GET “http://Resource-Manager-Address:8088/ws/v1/cluster/apps?limit=1&startedTimeBegin={time in epoch}&startedTimeEnd={time in epoch}&states=FAILED&user=<user-id>”

d) To get finished jobs for the specific user

GET “http://Resource-Manager-Address:8088/ws/v1/cluster/apps?limit=1&states=FINISHED&user=<user-id>"

e) To get finished jobs based on the application type

GET “http://Resource-Manager-Address:8088/ws/v1/cluster/apps?limit=1&states=FINISHED&applicationTypes=tez"

f)Get spark application type jobs

GET “http://Resource-Manager-Address:8088/ws/v1/cluster/apps?limit=1&states=FINISHED&applicationTypes=spark"

What strategy ORC should use to create splits for execution. The available options are “BI”, “ETL” and “HYBRID”.

The HYBRID mode reads the footers for all files if there are fewer files than expected mapper count, switching over to generating 1 split per file if the average file sizes are smaller than the default HDFS blocksize. ETL strategy always reads the ORC footers before generating splits, while the BI strategy generates per-file splits fast without reading any data from HDFS.

BI: Per file split. It will be faster when the number of files is less

ETL: Here the ORC reader…

Tamil Selvan K

Support Engineer @DataRobot

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store