Tamil Selvan K

Database has locks and Hive is no different. Locks in database can be either Read Lock or Write Lock. Locks are used when concurrent applications tries to access the same table. Locks prevents data from being corrupted or invalidated when multiple users try to reach while others write to database.

--

--

Cache is used mostly for BI queries as compared to ETL queries.
hive.llap.io.threadpool.size is at the node level and it defines the number of low level io threads .Basically, the daemon offloads I/O and transformation from compressed formats to these I/O threads. Then, the data will be passed on to…

--

--

Hive 3 has seen lot of changes in terms of Architecture like default Table type as ACID, deprecating hive cli (thick Jdbc client) and only supporting the Thin JDBC client (Beeline) etc.

Below is the High Level Architecture (I tried to make some changes to existing Hive Architecture Diagram which…

--

--

For Yarn application, to fetch application’s data we can use Rest APIs on ATS below are some reference links:

https://community.hortonworks.com/content/supportkb/221899/how-to-export-information-from-tez-view.html

https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/TimelineServer.html#Timeline_Server_REST_API_V1

We can use RM REST APIs to get some application related data. Some of the Examples are below:

a) Failed apps for the specific time period

GET “http://Resource-Manager-Address:8088/ws/v1/cluster/apps?limit=10&startedTimeBegin={time in epoch}&startedTimeEnd={time in epoch}&states=FAILED”

b) Query To get all the apps having states as FINISHED,KILLED by the specific user for specific time period

GET “http://Resource-Manager-Address:8088/ws/v1/cluster/apps?limit=20&states=FINISHED,KILLED&user=<user-id>&startedTimeBegin={time in epoch}&startedTimeEnd={time in epoch}”

c)To get failed jobs for the specific user and for specific time period

GET “http://Resource-Manager-Address:8088/ws/v1/cluster/apps?limit=1&startedTimeBegin={time in epoch}&startedTimeEnd={time in epoch}&states=FAILED&user=<user-id>”

d) To get finished jobs for the specific user

GET “http://Resource-Manager-Address:8088/ws/v1/cluster/apps?limit=1&states=FINISHED&user=<user-id>"

e) To get finished jobs based on the application type

GET “http://Resource-Manager-Address:8088/ws/v1/cluster/apps?limit=1&states=FINISHED&applicationTypes=tez"

f)Get spark application type jobs

GET “http://Resource-Manager-Address:8088/ws/v1/cluster/apps?limit=1&states=FINISHED&applicationTypes=spark"

--

--

What strategy ORC should use to create splits for execution. The available options are “BI”, “ETL” and “HYBRID”.

The HYBRID mode reads the footers for all files if there are fewer files than expected mapper count, switching over to generating 1 split per file if the average file sizes are…

--

--