Hive Datanucleus ConnectionPool

Tamil Selvan K
3 min readOct 29, 2020

ConnectionPool: A “connection pool” is a cache of database connection objects. Connection pools promotes the reuse of connection objects and reduce the number of times that connection objects are created. Connection pools significantly improve performance for database-intensive applications because creating connection objects is costly both in terms of time and resources.

In Hive we can configure this value by configuring the property “datanucleus.connectionPool.maxPoolSize” which is set to 10 by default.

The total number of connections to backend Database (Mysql or Postgres) can be found using the below formula:

(No. of Hiveserver2 Instance x 2 x datanucleus.connectionPool.maxPoolSize) + (No. of HMS Instance x 2 x datanucleus.connectionPool.maxPoolSize) = total connections to db

2 in above Equation — denotes TxnHandler and ObjectStore ConnectionPools

(TxnHandler — A handler to answer transaction related calls that come into the metastore * server)

Refer: https://github.com/apache/hive/blob/master/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java#L1137

METASTORE_CONNECTION_POOLING_TYPE('datanucleus.connectionPoolingType', 'BONECP',
'Specify connection pool library for datanucleus'),
METASTORE_CONNECTION_POOLING_MAX_CONNECTIONS('datanucleus.connectionPool.maxPoolSize', 10,
'Specify the maximum number of connections in the connection pool. Note: The configured size will be used by\n' +
' 2 connection pools (TxnHandler and ObjectStore). When configuring the max connection pool size, it is \n' +
'recommended to take into account the number of metastore instances and the number of HiveServer2 instances \n' +
'configured with embedded metastore. To get optimal performance, set config to meet the following condition\n'+
'(2 * pool_size * metastore_instances + 2 * pool_size * HS2_instances_with_embedded_metastore) = \n' +
'(2 * physical_core_count + hard_disk_count).'),

Based on datanucleus.connectionPool.maxPoolSize , you can calculate how much max no. of connections is going to backend DB.

For example. If you have datanucleus.connectionPool.maxPoolSize to 10 then the number of connections will be calculated as below:

(No. of Hiveserver2 Instance x 2 x datanucleus.connectionPool.maxPoolSize) + (No. of HMS Instance x 2 x datanucleus.connectionPool.maxPoolSize) = total connections to db

Assuming 2 Hiveserver2 and 2HMS

(2*2*10)+(2*2*10)=80. So 80 is the number of connections to backend DB from hive.

How to check the max_connections in your backend DB?

Postgres:

The current "max_connections" from your configuration file.
$ cat postgresql.conf | grep -i "max_connections"
max_connections = 100 # (change requires restart)

To determine maximum connections your postgres Database can hold you can check this link

Mysql:

mysql> show variables like 'max_connections';
+-----------------+-------+
| Variable_name | Value |
+-----------------+-------+
| max_connections | 151 |
+-----------------+-------+

Check the connection details:

mysql> SHOW FULL PROCESSLIST

Refer: https://dev.mysql.com/doc/refman/8.0/en/show-processlist.html

The maximum number of connections that can be supported depends on the following variables:

  • The available RAM — The system will need to have enough RAM to handle the additional workload.
  • The thread library quality of the platform — This will vary based
    on the platform. For example, Windows can be limited by the Posix
    compatibility layer it uses (though the limit no longer applies to
    MySQL v5.5 and up). However, there remains memory usage concerns
    depending on the architecture (x86 vs. x64) and how much memory can
    be consumed per application process.
  • The required response time — Increasing the number could increase
    the amount of time to respond to request. This should be tested to
    ensure it meets your needs before going into production.
  • The amount of RAM used per connection — Again, RAM is important,
    so you will need to know if the RAM used per connection will overload the system or not.
  • The workload required for each connection — The workload will also
    factor in to what system resources are needed to handle the
    additional connections.

Additional Readings:

--

--