Deadlocking in Jira Service Management when frequently updating the same issue
If you have an automated process that keeps updating the same issue many times, it might lead to deadlocks after you upgrade to Jira Service Management 4.3 or later. Read on to identify whether your Jira Service Management instance is affected.
Context
In Jira Service Management 4.3, we’ve fixed two issues to improve the overall performance. One of the changes included bounding the thread pools, so limiting the number of concurrent threads:
Our tests have shown significant performance improvements across whole Jira Service Management. However, bounded thread pools can lead to problems in some cases.
Problem
We’ve noticed that a bounded thread pool can result in a deadlock in the following scenario:
An instance has
OffThreadExecution
enabledAn instance has an automated process that keeps updating the same issue (many times in one minute)
Diagnosis
To check if your Jira Service Management is affected, you can run the following query periodically during peak times:
PostgreSQL
select p.pkey, i.issuenum, issueid, count(*) count_updated
from (
SELECT g.issueid, g.created as date
FROM changegroup g -- all issue edits
UNION ALL
SELECT a.issueid, a.updated as date
FROM jiraaction a -- all comments
) as all_events
join jiraissue i on i.id = issueid
join project p on p.id = i.project
WHERE date > now() - interval '1 minute'
group by 1, 2, 3
order by 4 desc;
Oracle
select p.pkey, i.issuenum, issueid, count(*) count_updated
from (
SELECT g.issueid, g.created as ddate
FROM changegroup g -- all issue edits
UNION ALL
SELECT a.issueid, a.updated as ddate
FROM jiraaction a -- all comments
) all_events
join jiraissue i on i.id = issueid
join project p on p.id = i.project
WHERE ddate > CURRENT_DATE - interval '1' minute
group by p.pkey, i.issuenum, issueid
order by 4 desc;
If the query shows that your Jira Service Management is updating any issue many times per minute, your instance may be affected by this issue. Tests have shown that up to 60 updates per minute on a single issue shouldn’t be a problem.
A sudden spike in the number of updates for an issue, which exceeds the number of threads in the thread pool, might also result in a deadlock. Such a deadlock will be resolved eventually, but some issues might end up with a corrupted SLA.
Alternate Diagnosis
Another query that may indicate your Jira Service Management is affected by this issue is the following:
select *
from "AO_319474_MESSAGE"
where "CLAIMANT" = NULL and "CLAIM_COUNT" > 0;
If the query shows there is a small number of events unclaimed but with a high claim count, it may indicate that your instance is affected by this issue.
Solution
Jira Service Management 4.9 and above
In Jira Service Management 4.9, we've improved the reliability of SLA processing. These changes are hidden behind a feature flag, so if this problem occurs, enable the feature flag sd.internal.base.db.backed.completion.events
as per the steps in this KB article https://confluence.atlassian.com/jirakb/enable-dark-feature-in-jira-959286331.html.
Jira Service Management 4.3 - 4.8
If you are on a Jira Service Management version between 4.3 and 4.8, you can fix this issue by making changes in the database.
Run the following query against your database to check if the
sd.event.processing.async.thread.pool.count
property exist.select * from propertyentry where property_key='sd.event.processing.async.thread.pool.count'
Complete one of these steps, depending on whether you have this property or not.
If the property doesn’t exist, use this query. Take into consideration that the default value is 5.
//This gives the id to use in the next queries. select max(id) + 1 from propertyentry; insert into propertyentry(id, entity_name, entity_id, property_key, propertytype) values (<id from previous query>, 'sd-internal-base-plugin', 1, 'sd.event.processing.async.thread.pool.count', 3); insert into propertynumber values (<id from the first query>, <new pool size value>);
If the property exists, use this query.
update propertynumber set propertyvalue=<new pool size value> where id=<id present in the propertyentry table>;
Updating
sd.event.processing.async.thread.pool.count
to a value not greater than the number of available threads on a node should improve throughput performance. Any larger value will very likely not result in further performance improvements.
Additional steps
Increasing OffThreadEventJobRunner
to a large number can lead to one of the problems that we were trying to solve in the first place, so you’ll need to increase the number of available database connections as well.
- JSDSERVER-5732Getting issue details... STATUS
To increase available database connections, see Tuning database connections.