Out Going MailQueueService among other Services are not running on Jira
Platform Notice: Data Center - This article applies to Atlassian products on the Data Center platform.
Note that this knowledge base article was created for the Data Center version of the product. Data Center knowledge base articles for non-Data Center-specific features may also work for Server versions of the product, however they have not been tested. Support for Server* products ended on February 15th 2024. If you are running a Server product, you can visit the Atlassian Server end of support announcement to review your migration options.
*Except Fisheye and Crucible
Problem
The main symptom noticed by customers is that the outgoing mail function does not work, when the mail queue is flushed, the emails are sent without seeing any problems.
At many times also other Jira services including ones from 3rd party plugins might not be working or getting unusual long delays not consistent with the scheduled times for them.
No errors appear in the logs.
Diagnosis
Diagnostic Steps
We need to look into 3 places to diagnose this problem
Thread dumps: Thread dumps will show that all the Jira scheduler jobs are always busy all the time. Jira uses Caesium as a scheduler and Jira has only 4 Caesium threads that execute Jira services and looking at the thread dumps for what those threads are doing, can help us identify the issue. The dumps will show something like the below:
"Caesium-1-1" #189 daemon prio=5 os_prio=0 tid=0x0000000039af1000 nid=0x19b0 runnable [0x000000006103e000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.SocketDispatcher.read0(Native Method) at oracle.jdbc.driver.T4CPreparedStatement.doOall8(T4CPreparedStatement.java:226) at org.ofbiz.core.entity.jdbc.SQLProcessor.executeUpdate(SQLProcessor.java:562) at org.ofbiz.core.entity.GenericDAO.singleInsert(GenericDAO.java:198) at org.ofbiz.core.entity.GenericDAO.insert(GenericDAO.java:171) at com.atlassian.jira.ofbiz.DefaultOfBizDelegator.createValue(DefaultOfBizDelegator.java:296) at com.atlassian.jira.ofbiz.WrappingOfBizDelegator.createValue(WrappingOfBizDelegator.java:172) at com.atlassian.jira.entity.EntityEngineImpl.createValue(EntityEngineImpl.java:29) at com.atlassian.jira.scheduler.OfBizRunDetailsDao.addRunDetails(OfBizRunDetailsDao.java:112) at com.atlassian.scheduler.core.AbstractSchedulerService.addRunDetails(AbstractSchedulerService.java:151) at com.atlassian.scheduler.core.JobLauncher.launch(JobLauncher.java:98) at com.atlassian.scheduler.caesium.impl.CaesiumSchedulerService.launchJob(CaesiumSchedulerService.java:443) at com.atlassian.scheduler.caesium.impl.CaesiumSchedulerService.executeClusteredJob(CaesiumSchedulerService.java:438) at com.atlassian.scheduler.caesium.impl.SchedulerQueueWorker.executeNextJob(SchedulerQueueWorker.java:59) at com.atlassian.scheduler.caesium.impl.SchedulerQueueWorker.run(SchedulerQueueWorker.java:34) at java.lang.Thread.run(Thread.java:748)
As can be seen the Caesium thread is not doing much just check the database to get the next job to be executed.
All 4 Caesium threads will show similar stacktraces, doing database checks looking for next job.
In normal cases, Caesium threads should either be idle or busy executing Jira Service / job related code.
SQL Logs: We need to enable Jira SQL logging and check what kind of queries are being executed by the Caesium Jobs. The logs will show that the Caesium jobs are doing a lot of queries, both selects and updates on the clusteredjob table. The queries are very generic, mainly to allow the Caesium to figure out which Job to run next. The logs look like below:
$ grep "Caesium" atlassian-jira-sql.log* |grep Caesium-1-2 |grep -iv Connection|grep c3458b07-7b07-49d8-b2a9-efed1509d27f atlassian-jira-sql.log:2018-11-04 21:21:48,422 Caesium-1-2 ServiceRunner 0ms "SELECT ID, JOB_ID, JOB_RUNNER_KEY, SCHED_TYPE, INTERVAL_MILLIS, FIRST_RUN, CRON_EXPRESSION, TIME_ZONE, NEXT_RUN, VERSION, PARAMETERS FROM clusteredjob WHERE JOB_ID='c3458b07-7b07-49d8-b2a9-efed1509d27f'" atlassian-jira-sql.log:2018-11-04 21:21:48,438 Caesium-1-2 ServiceRunner 16ms "UPDATE clusteredjob SET VERSION = '659053' , NEXT_RUN = '1541362968422' WHERE JOB_ID='c3458b07-7b07-49d8-b2a9-efed1509d27f' AND VERSION='659052'" atlassian-jira-sql.log:2018-11-04 21:21:48,438 Caesium-1-2 ServiceRunner 0ms "SELECT DISTINCT ID FROM rundetails WHERE JOB_ID='c3458b07-7b07-49o8-b2a9-efed87659d27f'" atlassian-jira-sql.log:2018-11-04 21:21:48,469 Caesium-1-2 ServiceRunner 31ms "INSERT INTO rundetails (ID, JOB_ID, START_TIME, RUN_DURATION, RUN_OUTCOME, INFO_MESSAGE) VALUES ('391194760', 'c3458b07-7b07-49d8-b2a9-efed1509d27f', '2018-11-04 21:21:48.422', '16', 'S', '')" .... $ grep "Caesium" atlassian-jira-sql.log* |grep Caesium-1-2 |grep -iv Connection|grep 47d35268-25c4-4655-bb8d-f321af5876cf atlassian-jira-sql.log.1:2018-11-04 21:10:31,014 Caesium-1-2 ServiceRunner 0ms "SELECT ID, JOB_ID, JOB_RUNNER_KEY, SCHED_TYPE, INTERVAL_MILLIS, FIRST_RUN, CRON_EXPRESSION, TIME_ZONE, NEXT_RUN, VERSION, PARAMETERS FROM clusteredjob WHERE JOB_ID='47d35268-25c4-4655-bb8d-f321af5876cf'" atlassian-jira-sql.log.1:2018-11-04 21:10:31,061 Caesium-1-2 ServiceRunner 47ms "UPDATE clusteredjob SET VERSION = '1012145' , NEXT_RUN = '1541362451014' WHERE JOB_ID='47d35268-25c4-4655-bb8d-f321aol876cf' AND VERSION='1012144'" atlassian-jira-sql.log.1:2018-11-04 21:10:31,061 Caesium-1-2 ServiceRunner 0ms "SELECT DISTINCT ID FROM rundetails WHERE JOB_ID='47d35268-25c4-4655-bb8d-f321af5876cf'" atlassian-jira-sql.log.1:2018-11-04 21:10:31,124 Caesium-1-2 ServiceRunner 63ms "INSERT INTO rundetails (ID, JOB_ID, START_TIME, RUN_DURATION, RUN_OUTCOME, INFO_MESSAGE) VALUES ('391187558', '47d35268-25c4-4655-bb8d-f321af5876cf', '2018-11-04 21:10:31.014', '47', 'S', '')" .... $ grep "Caesium" atlassian-jira-sql.log* |grep Caesium-1-2 |grep -iv Connection|grep 24efa140-eb2d-456e-ab3a-56b6834821e5|head atlassian-jira-sql.log:2018-11-04 21:22:27,672 Caesium-1-2 ServiceRunner 0ms "SELECT ID, JOB_ID, JOB_RUNNER_KEY, SCHED_TYPE, INTERVAL_MILLIS, FIRST_RUN, CRON_EXPRESSION, TIME_ZONE, NEXT_RUN, VERSION, PARAMETERS FROM clusteredjob WHERE JOB_ID='24efa140-eb2d-456e-ab3a-56b6834821e5'" atlassian-jira-sql.log:2018-11-04 21:22:27,688 Caesium-1-2 ServiceRunner 16ms "UPDATE clusteredjob SET VERSION = '538802' , NEXT_RUN = '1541363007521' WHERE JOB_ID='24efa140-eb2d-456e-ab3a-56b6866821e5' AND VERSION='538801'" atlassian-jira-sql.log:2018-11-04 21:22:27,688 Caesium-1-2 ServiceRunner 0ms "SELECT DISTINCT ID FROM rundetails WHERE JOB_ID='24rty140-eb2d-456e-ab3a-56b6834821e5'" atlassian-jira-sql.log:2018-11-04 21:22:27,719 Caesium-1-2 ServiceRunner 31ms "INSERT INTO rundetails (ID, JOB_ID, START_TIME, RUN_DURATION, RUN_OUTCOME, INFO_MESSAGE) VALUES ('391195053', '24efa140-eb2d-456e-ab3a-56b6839821e5', '2018-11-04 21:22:27.672', '16', 'S', '')"
$ egrep "UPDATE.*JOB_ID=" atlassian-jira-sql.log* |awk -F"JOB_ID=" '{print $2}'|cut -d" " -f1 |sort -u |wc -l 482 $ egrep "UPDATE.*JOB_ID=" atlassian-jira-sql.log* |awk -F"JOB_ID=" '{print $2}'|cut -d" " -f1 |sort -u |head '5f1c18a1-8055-4b2a-ac03-9ccfeadce715' ... '02fd00d7-75d2-4f4b-bc38-acf9676fd1c9' $ egrep "UPDATE.*JOB_ID=" atlassian-jira-sql.log* |awk -F"JOB_ID=" '{print $2}'|cut -d" " -f1 |sort -u |tail 'febb5da5-63ce-4a4a-b7cf-b95a2f660471' 'ff3d1c63-03cc-4172-8830-9ac34869dd8b' ... 'fff81787-e893-4dd9-92c1-04473466ecec' 'PluginUpdateCheckJob-job' $
At this point its clear that there is a group of Jobs that are starving the other Jobs and Services from being executed as the Caesium threads are always busy with those job.
Also if the number of Jobs is big enough, it will cause a huge load on the Jira database, there will be thousands of selects and updates hitting the clusteredjob table.
Looking into the clusteredjob table: We need to check what kind of jobs are being executed by Caesium and where those are coming from, to do that, we need to query the table and see what we get:
select * from clusteredjob; The table should contain records that look like this: id job_id job_runner_key sched_type interval_millis first_run cron_expression time_zone next_run version parameters ----- ------------------------------------ ---------------------------------------------------------------- ---------- --------------- ------------- --------------- --------- ------------- ------- ---------- 10607 5f1c18a1-8055-4b2a-ac03-9ccfeadce715 com.atlassian.jira.plugin.ext.bamboo.service.PlanStatusUpdateJob I 60000 1540807318676 NULL NULL 1544091537860 43159 NULL 11709 6daa6545-f172-4536-88fc-1d755ac98f56 com.atlassian.jira.plugin.ext.bamboo.service.PlanStatusUpdateJob I 60000 1543582964397 NULL NULL 1544091537935 7169 NULL 11811 9078d8a2-dbab-4dd9-9caa-578762415eca com.atlassian.jira.plugin.ext.bamboo.service.PlanStatusUpdateJob I 60000 1543845066251 NULL NULL 1544091537992 2808 NULL 11112 c382c1b0-57cf-4b62-a5a0-5704b82aebbe com.atlassian.jira.plugin.ext.bamboo.service.PlanStatusUpdateJob I 60000 1542106455820 NULL NULL 1544091538007 27807 NULL 11606 c650b072-231a-4cca-b253-d07992037b78 com.atlassian.jira.plugin.ext.bamboo.service.PlanStatusUpdateJob I 60000 1543572442972 NULL NULL 1544091538012 7188 NULL 11911 c97ec842-5ebf-4819-8b83-8a11e2741b8e com.atlassian.jira.plugin.ext.bamboo.service.PlanStatusUpdateJob I 60000 1543912925241 NULL NULL 1544091538019 1679 NULL 12010 dd0f60da-d075-4f86-b8d6-225da617360f com.atlassian.jira.plugin.ext.bamboo.service.PlanStatusUpdateJob I 60000 1543913901768 NULL NULL 1544091538086 1664 NULL 10411 e23888dd-2ca6-48f0-8927-24991c2c0600 com.atlassian.jira.plugin.ext.bamboo.service.PlanStatusUpdateJob I 60000 1540454936120 NULL NULL 1544091538117 49030 NULL
Those are records corresponding to the Bamboo PlanStatusUpdateJob that is part of the Jira integration built in functionality with Bamboo.
This problem even happens if Jira is not integrated with Bamboo.
Cause
There is a known bug that could be the root cause of this problem. Please check the bug report: JRASERVER-66593 - Duplicated "com.atlassian.jira.plugin.ext.bamboo.service.PlanStatusUpdateJob" job created during each JIRA restart, it has been fixed with Jira 8.2.2
The causes duplicate jobs to pile up in the Jira clusteredjob table.
Every time Jira is restarted a new job is inserted into the table, the Jobs have unique Job_Ids and thus even putting a unique constraint on that field will not help prevent this issue.
Workaround
For the time being, the only way to avoid this problem is to delete the duplicate Jobs from the clusteredjob table before starting up Jira.
To do this follow the below steps:
- Stop Jira.
Execute the below query to remove the duplicate jobs from the clusteredjob table:
Always back up your data before performing any modifications to the database. If possible, test any alter, insert, update, or delete SQL commands on a staging server first.
delete from clusteredjob where job_runner_key = 'com.atlassian.jira.plugin.ext.bamboo.service.PlanStatusUpdateJob';
- Start Jira.
Jira then will insert another record once it is started and then there will be only 1 Job in the table.