Jira applications stall due to StackOverflowError exception
Problem
Jira applications can stall when a java.lang.StackOverflowError exception occurs due to corruption of internal memory structures.
Symptoms include:
- Jira application crashes
- Jira application stalls
- Large amount of threads wait for non-existing lock
- Large amount of thread wait for DB pool, see related KB: JIRA applications stalls due to lost lock in dbcp pool caused by StackOverflowError
The following appears in the atlassian-jira.log
:
2017-11-28 08:21:19,815 http-nio-8080-exec-8 ERROR user 501x750485x1 13esc 10.0.16.1 /secure/AssignIssue.jspa [c.o.scriptrunner.runner.AbstractScriptListener] Script function failed on event: com.atlassian.jira.event.issue.IssueEvent, file: com.org.EpicLinkListener
java.lang.StackOverflowError
Notice that the stack trace of the exception may vary depending on the underlying operation being performed in JIRA
Diagnosis
Environment
- The problem can be triggered by any bug in the code. The chance of hitting a StackOverflowError becomes higher if you have any 3rd party code or write your own code (in case of Groovy from ScriptRunner)
Diagnostic Steps
If you check the logs, you can see a large number of StackOverflowErrors
grep -ic 'StackOverflowError' atlassian-jira.log* 63
Cause
The JVM running Jira applications has hit an StackOverflowError triggered by the code. StackOverflowError is an asynchronous exception that can be thrown by the Java Virtual Machine whenever the computation in a thread requires a larger stack than is permitted. The Java Language Specification permits a StackOverflowError to be thrown synchronously by method invocation. This mechanism is a clean way to report that a stack overflow has occurred while preserving the JVM's integrity, but it doesn't provide a safe way for the application to recover from this situation. A stack overflow could occur in the middle of a sequence of modifications which, if not complete, could leave a data structure in an inconsistent state.
Quote from http://openjdk.java.net/jeps/270
For instance, when a StackOverflowError is thrown in a critical section of the java.util.concurrent.locks.ReentrantLock class, the lock status can be left in an inconsistent state, leading to potential deadlocks. The ReentrantLock class uses an instance of AbstractSynchronizerQueue to implement its critical section.
After an StackOverflowError error, the Jira application will likely be in an unstable state and hence it is essential to restart your Jira applications immediately.
Work Around
- Restart Jira
- Upgrade to Oracle JVM version 11
- There are some JVM-level fixes that should allow Jira to handle these exceptions more gracefully: https://blogs.oracle.com/poonam/stackoverflowerror-and-threads-waiting-for-reentrantreadwritelock
Resolution
- Fix the problem in the code which has caused StackOverflowError. The add-on vendor may need to be contacted regarding this
Sometimes, It may also be possible to identify the offending component by thoroughly examining the actual stack trace (the application vendor can help with that). While third party applications are not supported by Atlassian, as a best effort here's an example of a case where it was identified that the issue was caused by a 3rd party JQL function PreviousSprint:
1. The stack trace referenced a class "PreviousSprint":at com.onresolve.jira.groovy.jql.plugins.PreviousSprint.getRelevantSprint(PreviousSprint.groovy:27) at com.onresolve.jira.groovy.jql.plugins.PreviousSprint$getRelevantSprint.callCurrent(Unknown Source)
2. It was then identified then that the JQL function "previousSprint" was used on one of the newly created dashboards JQL filters;
3. The JQL function previousSprint was then disabled, and once the board reloaded, the issue was resolved
Related
That might happen only when that thread encountered a StackOverflowError before it could get to execute unlock() on the ReadLock in the finally block.