Confluence Unresponsive Due to High Database Connection Latency
Platform Notice: Server and Data Center Only - This article only applies to Atlassian products on the server and data center platforms.
Confluence is intermittently unable to serve requests, with all HTTP execution threads waiting on a database connection from the C3P0 connection pool. Thread dumps show all C3P0 connection pool threads in the process of obtaining connections.
In addition, excessive database connections may be observed, with the pool attempting to obtain more connections than there is demand for.
This issue can be observed in any environment with excessive database connection establishment latency.
It is commonly observed in Oracle Database 12c environments when Oracle Data Guard or RAC products are implementing the Advanced Security Option feature. This features encrypts data across the network, but the connection establishment process adds significant time to obtaining new connections.
In order to test for this issue in a platform agnostic way, follow the below steps
- In Confluence Logging and Profiling set the com.mchange.v2.resourcepool.BasicResourcePool package to DEBUG
Tail the atlassian-confluence.log file to monitor the connection pool size in the below log
2019-12-05 06:30:16,194 DEBUG [http-nio-8090-exec-81 url:/confluence/res...notification/count] [mchange.v2.resourcepool.BasicResourcePool] log acquire test -- pool size: 40; target_pool_size: 125; desired target? 42
- Run a script to repeatedly query <confluence-URL>/status every second
- When a stall is observed, monitor the log until the pool size reaches the configured target_pool_size, when this happens, the status endpoint should return along with all other threads
- This test can be repeated numerous times to confirm the issue
In the case of Oracle Database 12c issues, it can be confirmed using threads dumps. Review the C3P0PooledConnectionPoolManager threads for the below stack trace
at oracle.net.aso.e.b(Unknown Source) at oracle.net.aso.e.a(Unknown Source) at oracle.net.aso.n.a(Unknown Source) at oracle.net.ano.DataIntegrityService.a(Unknown Source) at oracle.net.ano.Ano.negotiation(Unknown Source) at oracle.net.ns.NSProtocol.connect(NSProtocol.java:345) at oracle.jdbc.driver.T4CConnection.connect(T4CConnection.java:1438) at oracle.jdbc.driver.T4CConnection.logon(T4CConnection.java:518) at oracle.jdbc.driver.PhysicalConnection.connect(PhysicalConnection.java:688) at oracle.jdbc.driver.T4CDriverExtension.getConnection(T4CDriverExtension.java:39) at oracle.jdbc.driver.OracleDriver.connect(OracleDriver.java:691) at com.mchange.v2.c3p0.DriverManagerDataSource.getConnection(DriverManagerDataSource.java:175) at com.mchange.v2.c3p0.WrapperConnectionPoolDataSource.getPooledConnection(WrapperConnectionPoolDataSource.java:220) at com.mchange.v2.c3p0.WrapperConnectionPoolDataSource.getPooledConnection(WrapperConnectionPoolDataSource.java:206) at com.mchange.v2.c3p0.impl.C3P0PooledConnectionPool$1PooledConnectionResourcePoolManager.acquireResource(C3P0PooledConnectionPool.java:203) at com.mchange.v2.resourcepool.BasicResourcePool.doAcquire(BasicResourcePool.java:1138) at com.mchange.v2.resourcepool.BasicResourcePool.doAcquireAndDecrementPendingAcquiresWithinLockOnSuccess(BasicResourcePool.java:1125) at com.mchange.v2.resourcepool.BasicResourcePool.access$700(BasicResourcePool.java:44) at com.mchange.v2.resourcepool.BasicResourcePool$ScatteredAcquireTask.run(BasicResourcePool.java:1870) at com.mchange.v2.async.ThreadPoolAsynchronousRunner$PoolThread.run(ThreadPoolAsynchronousRunner.java:696)
Note that this only confirms Oracle Advanced Security Option latency issues. Network issues can also introduce this latency, and this second test does not exclude the results of the first test.
The root cause for this problem is excessive latency in obtaining a connection to the database, though this condition exacerbates a design choice in the C3P0 database connection pool manager.
C3P0 has a design choice wherein it will prioritise obtaining new connections to meet perceived requirements of the pool. During this time where obtaining connections is prioritised, dispatching database connections to the HTTP pool is deprioritised and thus blocked by the obtain work queue. As HTTP execution threads are unable to get database connections, they are also blocked by this behaviour.
The subsequent issue is that, by default, the pool manager will destroy connections that have been idle for 30 seconds in order to keep connection states fresh. This mechanism will also bring the current pool size below the minimum pool size whilst cleaning up these aged connections.
In normal conditions, the pool is rapidly rebuilt and the connection pool continues releasing connections to HTTP execution threads. When there is excessive latency in establishing connections, the pool is slow to rebuild, and takes a long time to rebuild to the minimum database connection pool value.
The issue is further exacerbated as HTTP execution threads build up and request additional connections from C3P0. C3P0 interprets this as there not being enough available connections in the pool, and begins to increase the target number of connections it's building towards. This number will increase until it hits the maximum connection pool size, during which time C3P0 continues to prioritise obtaining connections rather than releasing them to the HTTP execution threads.
Ultimately, this condition persists until C3P0 meets its target number of database connections, at which point it releases all of the database connections required to the HTTP execution pool and unblocking HTTP execution.
There are two workarounds that can mitigate this issue. The first mitigation is preferable as it is more reliable and less likely to encounter additional issues
Confluence supports using Tomcat as a database pool manager. This is commonly referred to as a Tomcat datasource.
Full documentation on how to change to this connection manager can be found at Configuring a datasource connection
This works around the behaviour seen in C3P0 and will allow dispatching of database connections to HTTP execution pools whilst obtaining additional connections.
This does not solve the core problem though, and should demand exceed the available database connections pooled, Tomcat will still be restricted by the connection establishment latency, and will be slow to respond.
Confluence should remain responsive during this time, but some page loads may be impacted.
The connection timeout value on the C3P0 connection pool can be changed from its default 30 second value, to 10 minutes.
This is achieved by changing the following value in the confluence.cfg.xml file
By reducing the regularity with which connections are timed out, we reduce the number of times the connection pool available will drop below the minimum connection pool value. This reduces the risk of C3P0 prioritising obtaining connections and thus blocking HTTP execution.
Additionally, significantly reducing the minimum and maximum connection pool values can further reduce the impact of this problem, but at the risk of introducing other problems.
This method is still likely to run into the problem detailed above, but significantly less frequently.
The only resolution to this issue is to remove the latency in establishing new database connections.
In the case of Oracle 12c with Advanced Security Option, this can be done by turning off the Advanced Security Option setting. If this isn't possible, there is currently no way to mitigate this behaviour in Oracle Database.
Oracle have additional information on this issue at Slow Connection Using 12c Client When Network Encryption Is Enabled
In other database scenarios, further investigation will be required to identify and mitigate the cause of the connection latency.