Out Going MailQueueService among other Services are not running on JIRA

Still need help?

The Atlassian Community is here for you.

Ask the community

Platform Notice: Server and Data Center Only - This article only applies to Atlassian products on the server and data center platforms.

Problem

The main symptom noticed by customers is that the outgoing mail function does not work, when the mail queue is flushed, the emails are sent without seeing any problems.

At many times also other Jira services including ones from 3rd party plugins might not be working or getting unusual long delays not consistent with the scheduled times for them.

No errors appear in the logs.


Diagnosis

Diagnostic Steps

We need to look into 3 places to diagnose this problem

  • Thread dumps: Thread dumps will show that all the Jira scheduler jobs are always busy all the time. Jira uses Caesium as a scheduler and Jira has only 4 Caesium threads that execute Jira services and looking at the thread dumps for what those threads are doing, can help us identify the issue. The dumps will show something like the below:

    "Caesium-1-1" #189 daemon prio=5 os_prio=0 tid=0x0000000039af1000 nid=0x19b0 runnable [0x000000006103e000]
       java.lang.Thread.State: RUNNABLE
            at sun.nio.ch.SocketDispatcher.read0(Native Method)
            at oracle.jdbc.driver.T4CPreparedStatement.doOall8(T4CPreparedStatement.java:226)
            at org.ofbiz.core.entity.jdbc.SQLProcessor.executeUpdate(SQLProcessor.java:562)
            at org.ofbiz.core.entity.GenericDAO.singleInsert(GenericDAO.java:198)
            at org.ofbiz.core.entity.GenericDAO.insert(GenericDAO.java:171)
            at com.atlassian.jira.ofbiz.DefaultOfBizDelegator.createValue(DefaultOfBizDelegator.java:296)
            at com.atlassian.jira.ofbiz.WrappingOfBizDelegator.createValue(WrappingOfBizDelegator.java:172)
            at com.atlassian.jira.entity.EntityEngineImpl.createValue(EntityEngineImpl.java:29)
            at com.atlassian.jira.scheduler.OfBizRunDetailsDao.addRunDetails(OfBizRunDetailsDao.java:112)
            at com.atlassian.scheduler.core.AbstractSchedulerService.addRunDetails(AbstractSchedulerService.java:151)
            at com.atlassian.scheduler.core.JobLauncher.launch(JobLauncher.java:98)
            at com.atlassian.scheduler.caesium.impl.CaesiumSchedulerService.launchJob(CaesiumSchedulerService.java:443)
            at com.atlassian.scheduler.caesium.impl.CaesiumSchedulerService.executeClusteredJob(CaesiumSchedulerService.java:438)
            at com.atlassian.scheduler.caesium.impl.SchedulerQueueWorker.executeNextJob(SchedulerQueueWorker.java:59)
            at com.atlassian.scheduler.caesium.impl.SchedulerQueueWorker.run(SchedulerQueueWorker.java:34)
            at java.lang.Thread.run(Thread.java:748)

    As can be seen the Caesium thread is not doing much just check the database to get the next job to be executed.

    All 4 Caesium threads will show similar stacktraces, doing database checks looking for next job.

    In normal cases, Caesium threads should either be idle or busy executing Jira Service / job related code.


  • SQL Logs: We need to enable Jira SQL logging and check what kind of queries are being executed by the Caesium Jobs. The logs will show that the Caesium jobs are doing a lot of queries, both selects and updates on the clusteredjob table. The queries are very generic, mainly to allow the Caesium to figure out which Job to run next. The logs look like below:

    $ grep "Caesium" atlassian-jira-sql.log* |grep Caesium-1-2 |grep -iv Connection|grep c3458b07-7b07-49d8-b2a9-efed1509d27f
    atlassian-jira-sql.log:2018-11-04 21:21:48,422 Caesium-1-2 ServiceRunner    0ms "SELECT ID, JOB_ID, JOB_RUNNER_KEY, SCHED_TYPE, INTERVAL_MILLIS, FIRST_RUN, CRON_EXPRESSION, TIME_ZONE, NEXT_RUN, VERSION, PARAMETERS FROM clusteredjob WHERE JOB_ID='c3458b07-7b07-49d8-b2a9-efed1509d27f'"
    atlassian-jira-sql.log:2018-11-04 21:21:48,438 Caesium-1-2 ServiceRunner    16ms "UPDATE clusteredjob SET  VERSION = '659053' ,  NEXT_RUN = '1541362968422'  WHERE JOB_ID='c3458b07-7b07-49d8-b2a9-efed1509d27f' AND VERSION='659052'"
    atlassian-jira-sql.log:2018-11-04 21:21:48,438 Caesium-1-2 ServiceRunner    0ms "SELECT DISTINCT ID FROM rundetails WHERE JOB_ID='c3458b07-7b07-49o8-b2a9-efed87659d27f'"
    atlassian-jira-sql.log:2018-11-04 21:21:48,469 Caesium-1-2 ServiceRunner    31ms "INSERT INTO rundetails (ID, JOB_ID, START_TIME, RUN_DURATION, RUN_OUTCOME, INFO_MESSAGE) VALUES ('391194760', 'c3458b07-7b07-49d8-b2a9-efed1509d27f', '2018-11-04 21:21:48.422', '16', 'S', '')"
    ....
     $ grep "Caesium" atlassian-jira-sql.log* |grep Caesium-1-2 |grep -iv Connection|grep 47d35268-25c4-4655-bb8d-f321af5876cf
    atlassian-jira-sql.log.1:2018-11-04 21:10:31,014 Caesium-1-2 ServiceRunner    0ms "SELECT ID, JOB_ID, JOB_RUNNER_KEY, SCHED_TYPE, INTERVAL_MILLIS, FIRST_RUN, CRON_EXPRESSION, TIME_ZONE, NEXT_RUN, VERSION, PARAMETERS FROM clusteredjob WHERE JOB_ID='47d35268-25c4-4655-bb8d-f321af5876cf'"
    atlassian-jira-sql.log.1:2018-11-04 21:10:31,061 Caesium-1-2 ServiceRunner    47ms "UPDATE clusteredjob SET  VERSION = '1012145' ,  NEXT_RUN = '1541362451014'  WHERE JOB_ID='47d35268-25c4-4655-bb8d-f321aol876cf' AND VERSION='1012144'"
    atlassian-jira-sql.log.1:2018-11-04 21:10:31,061 Caesium-1-2 ServiceRunner    0ms "SELECT DISTINCT ID FROM rundetails WHERE JOB_ID='47d35268-25c4-4655-bb8d-f321af5876cf'"
    atlassian-jira-sql.log.1:2018-11-04 21:10:31,124 Caesium-1-2 ServiceRunner    63ms "INSERT INTO rundetails (ID, JOB_ID, START_TIME, RUN_DURATION, RUN_OUTCOME, INFO_MESSAGE) VALUES ('391187558', '47d35268-25c4-4655-bb8d-f321af5876cf', '2018-11-04 21:10:31.014', '47', 'S', '')"                                       
    ....
     $ grep "Caesium" atlassian-jira-sql.log* |grep Caesium-1-2 |grep -iv Connection|grep 24efa140-eb2d-456e-ab3a-56b6834821e5|head
    atlassian-jira-sql.log:2018-11-04 21:22:27,672 Caesium-1-2 ServiceRunner    0ms "SELECT ID, JOB_ID, JOB_RUNNER_KEY, SCHED_TYPE, INTERVAL_MILLIS, FIRST_RUN, CRON_EXPRESSION, TIME_ZONE, NEXT_RUN, VERSION, PARAMETERS FROM clusteredjob WHERE JOB_ID='24efa140-eb2d-456e-ab3a-56b6834821e5'"
    atlassian-jira-sql.log:2018-11-04 21:22:27,688 Caesium-1-2 ServiceRunner    16ms "UPDATE clusteredjob SET  VERSION = '538802' ,  NEXT_RUN = '1541363007521'  WHERE JOB_ID='24efa140-eb2d-456e-ab3a-56b6866821e5' AND VERSION='538801'"
    atlassian-jira-sql.log:2018-11-04 21:22:27,688 Caesium-1-2 ServiceRunner    0ms "SELECT DISTINCT ID FROM rundetails WHERE JOB_ID='24rty140-eb2d-456e-ab3a-56b6834821e5'"
    atlassian-jira-sql.log:2018-11-04 21:22:27,719 Caesium-1-2 ServiceRunner    31ms "INSERT INTO rundetails (ID, JOB_ID, START_TIME, RUN_DURATION, RUN_OUTCOME, INFO_MESSAGE) VALUES ('391195053', '24efa140-eb2d-456e-ab3a-56b6839821e5', '2018-11-04 21:22:27.672', '16', 'S', '')"
     Digging further into the logs you can see that there are hundreds of Jobs with strange Job ids that are being queried and updated by Caesium logged int the SQL logs:
    $ egrep "UPDATE.*JOB_ID=" atlassian-jira-sql.log* |awk -F"JOB_ID=" '{print $2}'|cut -d" " -f1 |sort -u |wc -l
    482
     $ egrep "UPDATE.*JOB_ID=" atlassian-jira-sql.log* |awk -F"JOB_ID=" '{print $2}'|cut -d" " -f1 |sort -u |head
    '5f1c18a1-8055-4b2a-ac03-9ccfeadce715'
    ...
    '02fd00d7-75d2-4f4b-bc38-acf9676fd1c9'
    
     $ egrep "UPDATE.*JOB_ID=" atlassian-jira-sql.log* |awk -F"JOB_ID=" '{print $2}'|cut -d" " -f1 |sort -u |tail
    'febb5da5-63ce-4a4a-b7cf-b95a2f660471'
    'ff3d1c63-03cc-4172-8830-9ac34869dd8b'
    ...
    'fff81787-e893-4dd9-92c1-04473466ecec'
    'PluginUpdateCheckJob-job'
     $

    At this point its clear that there is a group of Jobs that are starving the other Jobs and Services from being executed as the Caesium threads are always busy with those job.

    Also if the number of Jobs is big enough, it will cause a huge load on the Jira database, there will be thousands of selects and updates hitting the clusteredjob table.


  • Looking into the clusteredjob table: We need to check what kind of jobs are being executed by Caesium and where those are coming from, to do that, we need to query the table and see what we get:

    select * from clusteredjob;
    The table should contain records that look like this:
    id    job_id                               job_runner_key                                                   sched_type interval_millis first_run     cron_expression time_zone next_run      version parameters 
     ----- ------------------------------------ ---------------------------------------------------------------- ---------- --------------- ------------- --------------- --------- ------------- ------- ---------- 
     10607 5f1c18a1-8055-4b2a-ac03-9ccfeadce715 com.atlassian.jira.plugin.ext.bamboo.service.PlanStatusUpdateJob I                    60000 1540807318676 NULL            NULL      1544091537860   43159 NULL      
     11709 6daa6545-f172-4536-88fc-1d755ac98f56 com.atlassian.jira.plugin.ext.bamboo.service.PlanStatusUpdateJob I                    60000 1543582964397 NULL            NULL      1544091537935    7169 NULL      
     11811 9078d8a2-dbab-4dd9-9caa-578762415eca com.atlassian.jira.plugin.ext.bamboo.service.PlanStatusUpdateJob I                    60000 1543845066251 NULL            NULL      1544091537992    2808 NULL      
     11112 c382c1b0-57cf-4b62-a5a0-5704b82aebbe com.atlassian.jira.plugin.ext.bamboo.service.PlanStatusUpdateJob I                    60000 1542106455820 NULL            NULL      1544091538007   27807 NULL      
     11606 c650b072-231a-4cca-b253-d07992037b78 com.atlassian.jira.plugin.ext.bamboo.service.PlanStatusUpdateJob I                    60000 1543572442972 NULL            NULL      1544091538012    7188 NULL      
     11911 c97ec842-5ebf-4819-8b83-8a11e2741b8e com.atlassian.jira.plugin.ext.bamboo.service.PlanStatusUpdateJob I                    60000 1543912925241 NULL            NULL      1544091538019    1679 NULL      
     12010 dd0f60da-d075-4f86-b8d6-225da617360f com.atlassian.jira.plugin.ext.bamboo.service.PlanStatusUpdateJob I                    60000 1543913901768 NULL            NULL      1544091538086    1664 NULL      
     10411 e23888dd-2ca6-48f0-8927-24991c2c0600 com.atlassian.jira.plugin.ext.bamboo.service.PlanStatusUpdateJob I                    60000 1540454936120 NULL            NULL      1544091538117   49030 NULL      

    Those are records corresponding to the Bamboo PlanStatusUpdateJob that is part of the Jira integration built in functionality with Bamboo.

    This problem even happens if Jira is not integrated with Bamboo.

Cause

There is a known bug that is the root cause of this problem. Please check the bug report:  JRASERVER-66593 - Getting issue details... STATUS

The causes duplicate jobs to pile up in the Jira clusteredjob table.

Every time Jira is restarted a new job is inserted into the table, the Jobs have unique Job_Ids and thus even putting a unique constraint on that field will not help prevent this issue.

Workaround

For the time being, the only way to avoid this problem is to delete the duplicate Jobs from the clusteredjob table before starting up JIRA.

To do this follow the below steps:

  1. Stop JIRA.
  2. Execute the below query to remove the duplicate jobs from the clusteredjob table:

    Always back up your data before performing any modifications to the database. If possible, test any alter, insert, update, or delete SQL commands on a staging server first.

    delete from clusteredjob where job_runner_key = 'com.atlassian.jira.plugin.ext.bamboo.service.PlanStatusUpdateJob';
  3. Start JIRA.

    Jira then will insert another record once it is started and then there will be only 1 Job in the table.

DescriptionOutGoing MailQueueService among other Services are not running on JIRA
ProductJIRA,
Jira Service Desk
Last modified on Jan 4, 2019

Was this helpful?

Yes
No
Provide feedback about this article
Powered by Confluence and Scroll Viewport.