Deadlocking in Jira Service Management when frequently updating the same issue
If you have an automated process that keeps updating the same issue many times, it might lead to deadlocks after you upgrade to Jira Service Management 4.3 or later. Read on to identify whether your Jira Service Management instance is affected.
In Jira Service Management 4.3, we’ve fixed two issues to improve the overall performance. One of the changes included bounding the thread pools, so limiting the number of concurrent threads:
Our tests have shown significant performance improvements across whole Jira Service Management. However, bounded thread pools can lead to problems in some cases.
We’ve noticed that a bounded thread pool can result in a deadlock in the following scenario:
An instance has
An instance has an automated process that keeps updating the same issue (many times in one minute)
To check if your Jira Service Management is affected, you can run the following query periodically during peak times:
select p.pkey, i.issuenum, issueid, count(*) count_updated from ( SELECT g.issueid, g.created as date FROM changegroup g -- all issue edits UNION ALL SELECT a.issueid, a.updated as date FROM jiraaction a -- all comments ) as all_events join jiraissue i on i.id = issueid join project p on p.id = i.project WHERE date > now() - interval '1 minute' group by 1, 2, 3 order by 4 desc;
select p.pkey, i.issuenum, issueid, count(*) count_updated from ( SELECT g.issueid, g.created as ddate FROM changegroup g -- all issue edits UNION ALL SELECT a.issueid, a.updated as ddate FROM jiraaction a -- all comments ) all_events join jiraissue i on i.id = issueid join project p on p.id = i.project WHERE ddate > CURRENT_DATE - interval '1' minute group by p.pkey, i.issuenum, issueid order by 4 desc;
If the query shows that your Jira Service Management is updating any issue many times per minute, your instance may be affected by this issue. Tests have shown that up to 60 updates per minute on a single issue shouldn’t be a problem.
A sudden spike in the number of updates for an issue, which exceeds the number of threads in the thread pool, might also result in a deadlock. Such a deadlock will be resolved eventually, but some issues might end up with a corrupted SLA.
Another query that may indicate your Jira Service Management is affected by this issue is the following:
select * from "AO_319474_MESSAGE" where "CLAIMANT" = NULL and "CLAIM_COUNT" > 0;
If the query shows there is a small number of events unclaimed but with a high claim count, it may indicate that your instance is affected by this issue.
Jira Service Management 4.9 and above
In Jira Service Management 4.9, we've improved the reliability of SLA processing. These changes are hidden behind a feature flag, so if this problem occurs, enable the feature flag
sd.internal.base.db.backed.completion.events as per the steps in this KB article https://confluence.atlassian.com/jirakb/enable-dark-feature-in-jira-959286331.html.
Jira Service Management 4.3 - 4.8
If you are on a Jira Service Management version between 4.3 and 4.8, you can fix this issue by making changes in the database.
Run the following query against your database to check if the
select * from propertyentry where property_key='sd.event.processing.async.thread.pool.count'
Complete one of these steps, depending on whether you have this property or not.
If the property doesn’t exist, use this query. Take into consideration that the default value is 5.
//This gives the id to use in the next queries. select max(id) + 1 from propertyentry; insert into propertyentry(id, entity_name, entity_id, property_key, propertytype) values (<id from previous query>, 'sd-internal-base-plugin', 1, 'sd.event.processing.async.thread.pool.count', 3); insert into propertynumber values (<id from the first query>, <new pool size value>);
If the property exists, use this query.
update propertynumber set propertyvalue=<new pool size value> where id=<id present in the propertyentry table>;
sd.event.processing.async.thread.pool.countto a value not greater than the number of available threads on a node should improve throughput performance. Any larger value will very likely not result in further performance improvements.
OffThreadEventJobRunner to a large number can lead to one of the problems that we were trying to solve in the first place, so you’ll need to increase the number of available database connections as well.
To increase available database connections, see Tuning database connections.