Confluence throws 500 error due to lock timeout in VCacheLock
Platform notice: Server and Data Center only. This article only applies to Atlassian products on the Server and Data Center platforms.
Support for Server* products ended on February 15th 2024. If you are running a Server product, you can visit the Atlassian Server end of support announcement to review your migration options.
*Except Fisheye and Crucible
Problem
Sometimes Confluence may respond with 500x error pages, have a general performance issue, or even a full outage, due to the lock timeout in VCacheLock which is used by I18NBean.
Either of the following appears in the atlassian-confluence.log:
WARN [http-nio-8090-exec-143] [internal.core.service.VCacheLock] lockWithTimeout Timed out waiting for lock on cache: com.atlassian.confluence.util.i18n.I18NBeanFactory.by.locale
or:
ERROR [http-nio-8090-exec-1] [xhtml.view.macro.ViewMacroMarshaller] handleMacroExecutionException Error rendering macro: content-column
-- url: /display/SPACEKEY/Page+Name | page: 123456789 | traceId: 7a94e4c4b2be88a0 | userName: user.name@example.com | referer: https://confluence.example.com/display/SPACEKEY/Page+Name | action: viewpage
java.lang.StackOverflowError
at java.security.AccessController.doPrivileged(Native Method)
at java.io.PrintWriter.<init>(PrintWriter.java:116)
at java.io.PrintWriter.<init>(PrintWriter.java:100)
at com.atlassian.confluence.util.PatternLayoutWithStackTrace.appendStackTrace(PatternLayoutWithStackTrace.java:43)
at com.atlassian.confluence.util.PatternLayoutWithStackTrace.format(PatternLayoutWithStackTrace.java:30)
at org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:310)
at org.apache.log4j.RollingFileAppender.subAppend(RollingFileAppender.java:276)
at org.apache.log4j.WriterAppender.append(WriterAppender.java:162)
at com.atlassian.confluence.logging.ConfluenceHomeLogAppender.append(ConfluenceHomeLogAppender.java:147)
at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251)
at org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66)
at org.apache.log4j.Category.callAppenders(Category.java:206)
at org.apache.log4j.Category.forcedLog(Category.java:391)
at org.apache.log4j.Category.error(Category.java:322)
at com.comalatech.confluence.workflow.DefaultWorkflowAccessor.getWorkflows(DefaultWorkflowAccessor.java:225)
at com.comalatech.confluence.workflow.DefaultWorkflowAccessor.getWorkflows(DefaultWorkflowAccessor.java:303)
at com.comalatech.confluence.workflow.DefaultWorkflowAccessor.getPageWorkflows(DefaultWorkflowAccessor.java:290)
at com.comalatech.confluence.ipc.api.DefaultStateService.getCurrentState(DefaultStateService.java:118)
at com.comalatech.confluence.ipc.api.DefaultStateService.getCurrentState(DefaultStateService.java:53)
at com.brikit.comalaworkflowsservice.Comala.getPublishedVersion(Comala.java:74)
at com.brikit.core.confluence.Confluence.getPublishedComalaWorkflowsVersion(Confluence.java:3137)
at com.brikit.core.confluence.Confluence.getVersionForUserRequest(Confluence.java:3906)
at com.brikit.core.confluence.Confluence.getBodyAsString(Confluence.java:1717)
at com.brikit.core.confluence.MacroParser.extractMacros(MacroParser.java:160)
at com.brikit.core.confluence.MacroParser.extractMacros(MacroParser.java:151)
at com.brikit.core.confluence.MacroParser.extractMacros(MacroParser.java:144)
at com.brikit.themepress.model.PageWrapper.contentMacros(PageWrapper.java:623)
at com.brikit.themepress.model.PageWrapper.layers(PageWrapper.java:1348)
at com.brikit.themepress.model.PageWrapper.getLayerMacros(PageWrapper.java:995)
at com.brikit.themepress.model.PageWrapper.getLayerMacros(PageWrapper.java:998)
at com.brikit.themepress.model.PageWrapper.getLayerMacros(PageWrapper.java:998)
at com.brikit.themepress.model.PageWrapper.getLayerMacros(PageWrapper.java:998)
at com.brikit.themepress.model.PageWrapper.getLayerMacros(PageWrapper.java:998)
at com.brikit.themepress.model.PageWrapper.getLayerMacros(PageWrapper.java:998)
at com.brikit.themepress.model.PageWrapper.getLayerMacros(PageWrapper.java:998)
.... Infinite recursion here
Symptoms
The system becomes unresponsive. Thread dumps show HTTP threads hung waiting for the lock. Thread dumps do not show what threads owns the lock. After a heap dump investigation, it will be clear that lock owner is either dead or spending time in an unrelated piece of code.
The lock has been 'leaked' somehow. Threads are blocking on it, but it is never going to be unlocked. Once the system enters this state, it can only be healed by a restart.
Diagnosis
Environment
Confluence Server
Confluence Data Center
- Brikit Themepress 1.x through 2.2.x installed
Diagnostic Steps
Capture Thread Dumps while the issue is occurring
First, check the status of all thread states across the thread dumps, using the following:
$ grep -A1 '-exec-' confluence_threads.* | grep State | sort | uniq -c
If the result looks like this, with all threads in a WAITING (parking) state, you may have this issue and should follow the second grep below:
$ grep -A1 '-exec-' confluence_threads.* | grep State | sort | uniq -c 300 confluence_threads.1535641226.txt- java.lang.Thread.State: WAITING (parking) 300 confluence_threads.1535641241.txt- java.lang.Thread.State: WAITING (parking) 300 confluence_threads.1535641255.txt- java.lang.Thread.State: WAITING (parking) 300 confluence_threads.1535641270.txt- java.lang.Thread.State: WAITING (parking) 300 confluence_threads.1535641285.txt- java.lang.Thread.State: WAITING (parking) 300 confluence_threads.1535641301.txt- java.lang.Thread.State: WAITING (parking)
Second, grep each thread dump using the following method to look for the lock:
$ cat thread_dump.txt | egrep "(MemoryStore.put|MemoryStore.get)" -B 10
If the results match this example, you have likely run into this issue:
$ cat confluence_threads.1535641285.txt | egrep "(MemoryStore.put|MemoryStore.get)" -B 1 java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00007f1e048cd408> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) at java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:943) at net.sf.ehcache.store.chm.SelectableConcurrentHashMap$Segment.put(SelectableConcurrentHashMap.java:641) at net.sf.ehcache.store.chm.SelectableConcurrentHashMap.put(SelectableConcurrentHashMap.java:383) at net.sf.ehcache.store.MemoryStore.put(MemoryStore.java:283) -- java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00007f1e048cd408> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283) at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727) at net.sf.ehcache.store.chm.SelectableConcurrentHashMap$Segment.get(SelectableConcurrentHashMap.java:721) at net.sf.ehcache.store.chm.SelectableConcurrentHashMap.get(SelectableConcurrentHashMap.java:324) at net.sf.ehcache.store.MemoryStore.get(MemoryStore.java:344)
- For full confirmation you have run into this issue, contact Atlassian Support.
Cause
This happens when thread-X and thread-Y both lock the cache, then other threads try to lock the cache also and must for 30 seconds, then an exception occurs.
Multiple third party vendors have been identified with plugins that can run into this issue. Below are some known vendors which have fixes to close this issue:
The Brikit Themepress plugin 1.x through 2.2.x uses code which causes this performance issue to occur.
CelesteCS Math Plugin for Confluence prior to version 21.1.0 can run into this issue.
- https://marketplace.atlassian.com/apps/1212655/celestecs-math-for-confluence/version-history - Version history and update notes for the plugin
Workaround
Capture thread dumps and restart Confluence to recover the system.
- Disable the Brikit Themepress plugin until a fixed version is available
- If you are unable to disable the Brikit Themepress plugin via the user interface, you may need to delete the plugin from the database
- Contact Brikit Support for a patched version of Themepress, until a fixed version is available
Resolution
Brikit's bugs linked above note the issue should be resolved in Themepress 2.3.x.
Upgrading Themepress to the fixed version should prevent the issue from occurring again.