Hive Datanucleus ConnectionPool
ConnectionPool: A “connection pool” is a cache of database connection objects. Connection pools promotes the reuse of connection objects and reduce the number of times that connection objects are created. Connection pools significantly improve performance for database-intensive applications because creating connection objects is costly both in terms of time and resources.
In Hive we can configure this value by configuring the property “datanucleus.connectionPool.maxPoolSize” which is set to 10 by default.
The total number of connections to backend Database (Mysql or Postgres) can be found using the below formula:
(No. of Hiveserver2 Instance x 2 x datanucleus.connectionPool.maxPoolSize) + (No. of HMS Instance x 2 x datanucleus.connectionPool.maxPoolSize) = total connections to db
2 in above Equation — denotes TxnHandler and ObjectStore ConnectionPools
(TxnHandler — A handler to answer transaction related calls that come into the metastore * server)
Refer: https://github.com/apache/hive/blob/master/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java#L1137
METASTORE_CONNECTION_POOLING_TYPE('datanucleus.connectionPoolingType', 'BONECP',
'Specify connection pool library for datanucleus'),
METASTORE_CONNECTION_POOLING_MAX_CONNECTIONS('datanucleus.connectionPool.maxPoolSize', 10,
'Specify the maximum number of connections in the connection pool. Note: The configured size will be used by\n' +
' 2 connection pools (TxnHandler and ObjectStore). When configuring the max connection pool size, it is \n' +
'recommended to take into account the number of metastore instances and the number of HiveServer2 instances \n' +
'configured with embedded metastore. To get optimal performance, set config to meet the following condition\n'+
'(2 * pool_size * metastore_instances + 2 * pool_size * HS2_instances_with_embedded_metastore) = \n' +
'(2 * physical_core_count + hard_disk_count).'),
Based on datanucleus.connectionPool.maxPoolSize , you can calculate how much max no. of connections is going to backend DB.
For example. If you have datanucleus.connectionPool.maxPoolSize to 10 then the number of connections will be calculated as below:
(No. of Hiveserver2 Instance x 2 x datanucleus.connectionPool.maxPoolSize) + (No. of HMS Instance x 2 x datanucleus.connectionPool.maxPoolSize) = total connections to db
Assuming 2 Hiveserver2 and 2HMS
(2*2*10)+(2*2*10)=80. So 80 is the number of connections to backend DB from hive.
How to check the max_connections in your backend DB?
Postgres:
The current "max_connections" from your configuration file.
$ cat postgresql.conf | grep -i "max_connections"
max_connections = 100 # (change requires restart)
To determine maximum connections your postgres Database can hold you can check this link
Mysql:
mysql> show variables like 'max_connections';
+-----------------+-------+
| Variable_name | Value |
+-----------------+-------+
| max_connections | 151 |
+-----------------+-------+
Check the connection details:
mysql> SHOW FULL PROCESSLIST
Refer: https://dev.mysql.com/doc/refman/8.0/en/show-processlist.html
The maximum number of connections that can be supported depends on the following variables:
- The available RAM — The system will need to have enough RAM to handle the additional workload.
- The thread library quality of the platform — This will vary based
on the platform. For example, Windows can be limited by the Posix
compatibility layer it uses (though the limit no longer applies to
MySQL v5.5 and up). However, there remains memory usage concerns
depending on the architecture (x86 vs. x64) and how much memory can
be consumed per application process. - The required response time — Increasing the number could increase
the amount of time to respond to request. This should be tested to
ensure it meets your needs before going into production. - The amount of RAM used per connection — Again, RAM is important,
so you will need to know if the RAM used per connection will overload the system or not. - The workload required for each connection — The workload will also
factor in to what system resources are needed to handle the
additional connections.
Additional Readings: