Hive Datanucleus ConnectionPool

ConnectionPool: A “connection pool” is a cache of database connection objects. Connection pools promotes the reuse of connection objects and reduce the number of times that connection objects are created. Connection pools significantly improve performance for database-intensive applications because creating connection objects is costly both in terms of time and resources.

In Hive we can configure this value by configuring the property “datanucleus.connectionPool.maxPoolSize” which is set to 10 by default.

The total number of connections to backend Database (Mysql or Postgres) can be found using the below formula:

2 in above Equation — denotes TxnHandler and ObjectStore ConnectionPools

(TxnHandler — A handler to answer transaction related calls that come into the metastore * server)

Refer: https://github.com/apache/hive/blob/master/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java#L1137

Based on datanucleus.connectionPool.maxPoolSize , you can calculate how much max no. of connections is going to backend DB.

For example. If you have datanucleus.connectionPool.maxPoolSize to 10 then the number of connections will be calculated as below:

Assuming 2 Hiveserver2 and 2HMS

(2*2*10)+(2*2*10)=80. So 80 is the number of connections to backend DB from hive.

How to check the max_connections in your backend DB?

Postgres:

To determine maximum connections your postgres Database can hold you can check this link

Mysql:

Check the connection details:

Refer: https://dev.mysql.com/doc/refman/8.0/en/show-processlist.html

The maximum number of connections that can be supported depends on the following variables:

  • The available RAM — The system will need to have enough RAM to handle the additional workload.
  • The thread library quality of the platform — This will vary based
    on the platform. For example, Windows can be limited by the Posix
    compatibility layer it uses (though the limit no longer applies to
    MySQL v5.5 and up). However, there remains memory usage concerns
    depending on the architecture (x86 vs. x64) and how much memory can
    be consumed per application process.
  • The required response time — Increasing the number could increase
    the amount of time to respond to request. This should be tested to
    ensure it meets your needs before going into production.
  • The amount of RAM used per connection — Again, RAM is important,
    so you will need to know if the RAM used per connection will overload the system or not.
  • The workload required for each connection — The workload will also
    factor in to what system resources are needed to handle the
    additional connections.

Additional Readings:

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store