Troubleshooting slow/stuck notification issues in Jira/Service Management Server

Still need help?

The Atlassian Community is here for you.

Ask the community

Platform Notice: Server and Data Center Only - This article only applies to Atlassian products on the server and data center platforms.


On this page:

Problem

With the recent introduction of Jira batched notifications in Jira 8.x and the fact that Service Desk implements its own notification system, it has become more difficult to investigate why some notifications are sent with a long delay or simply never sent.

To efficiently troubleshoot this type of issue, it is important to first understand all the types of notifications that exist in both Jira and Service Desk, how they work, and all the services involved in the generation and sending of these notifications. This knowledge will help steer the investigation into the right direction and focus on the right possible root causes.

Understanding what types of notifications exist in Jira and Service Desk and how they work

When Jira and Service Desk are both running together, 3 types of notifications might come into play, depending on how Jira is configured and what projects are involved. These 3 types of notifications are listed below:

  • the Jira non-batched notifications
  • the Jira batched notifications
  • the Service Desk customer notifications

The Jira notifications:

  • are sent to:
    • any users working on Jira Core and Jira Software tickets
    • users acting as agents in Service Desk tickets
  • are configured through the Notification Schemes, via the page Project Settings > Notifications
  • can be:
    • either batched, if the setting ⚙ > System > Batching email notifications is enabled
    • or non-batched, if the setting ⚙ > System > Batching email notifications is disabled

The Service Desk customer notifications:

  • are sent to users acting as customers in Service Desk tickets
  • are configured through the Customer Notification configuration, via the page Project Settings > Customer Notifications
  • have their own batching mechanism which is different than the one used by the Jira batched notifications. They are batched with a period of 1 minute which is hardcoded and can't be configured

Regardless of the type of notification that is triggered, each notification ends up in the same Mail Queue:

  • this queue can be monitored in the page ⚙ > System > Mail Queue
  • this queue is automatically flushed by the Mail Queue Service every 1 minute by default
  • the Mail Queue Service sends the notification to the recipient via the SMTP mail server configured in ⚙ > System > Outgoing Mail

The diagram below shows how the 3 types of notifications work and:

  • which services/jobs are involved (in green) 
  • which database tables are involved (in blue)

Preliminary troubleshooting steps

As you can see in the diagram above, various factors including Jira services, database tables, and a mail server are involved in the entire process from the moment an event triggers a notification, and the notification email is actually sent.

There mainly 4 areas where things can go wrong:

  • the connection between Jira and the SMTP server
  • the Mail Queue Service
  • the Jira Batched notification job
  • the Service Desk customer notification job

Before conducting a complex investigation which would involve digging into the logs, analyzing thread dumps and running database queries, it is important to first ask yourself some preliminary questions. These questions which will help you focus on the area of the entire notification workflow, and are summed up in the diagram below:



After you went through this diagram, move on to the right section of this KB article pointed out by the diagram:

Troubleshooting SMTP mail server issues


If the issue lies between the Jira application and the SMTP mail server, then if you go to the page ⚙ > System > Mail Queue, you should see emails piling up in the error queue, or in the mail queue with the pink background color as shown below:

There can be various reasons why the Mail Queue Service is unable to sent the notification email via the SMTP server. Such reasons include, but are not limited to:

  • Network connectivity issue between the Jira server and the SMTP Mail server
  • Mail throttling configuration on the SMTP Mail Server side
  • SSL configuration issue on Jira side
  • Anti-virus or Firewall blocking traffic between the Jira server and the SMTP Mail Server

Look for errors in the <JIRA_HOME>/log/atlassian-jira-outgoing-mail.log file and check if you find errors matching the ones listed below. (warning) Please note that these errors are only relevant if they occur often in the logs and are thrown with most emails. If you only see these errors thrown occasionally, then these errors might be false positive:

com.atlassian.mail.MailException: com.sun.mail.smtp.SMTPSendFailedException: 421 4.4.2 Message submission rate for this client has exceeded the configured limit
java.net.SocketTimeoutException: Read timed out
Too many login attempts, please try again later
com.atlassian.mail.MailException: com.sun.mail.util.MailConnectException: Couldn't connect to host, port: smtp.office365.com, 587; timeout -1;
      nested exception is:
        java.net.ConnectException: Connection timed out (Connection timed out)
        at com.atlassian.mail.server.impl.SMTPMailServerImpl.sendWithMessageId(SMTPMailServerImpl.java:222) [atlassian-mail-5.0.0.jar:?]
com.atlassian.mail.MailException: com.sun.mail.smtp.SMTPSendFailedException: 530 5.7.0 Must issue a STARTTLS command first. 7sm25921297qkx.49 - gsmtp
	at com.atlassian.mail.server.impl.SMTPMailServerImpl.sendWithMessageId(SMTPMailServerImpl.java:225) [atlassian-mail-2.7.18.jar:?]
com.atlassian.mail.MailException: javax.mail.MessagingException: Could not convert socket to TLS;
      nested exception is:
    	java.io.IOException: Can't verify identity of server: <SERVER_NAME>
com.atlassian.mail.MailException: javax.mail.MessagingException: Could not convert socket to TLS;
      nested exception is:
    	javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target


If you see any of these errors thrown frequently in the logs, then one of the KB article listed below might apply:

Troubleshooting Mail Queue Service issues

If the problem lies with the Mail Queue Service, you should see at least one of the symptoms listed below, which are an indication that the Mail Queue service is either stuck, or running too slowly, or not scheduled on time:

  • A high number of emails are piling up in the mail queue in ⚙ > System > Mail Queue as shown in the image below:
  • The size of mail queue is not decreasing automatically, or it's decreasing very slowly
  • The mail queue gets emptied when manually flushing it

There can be various reasons why the mail queue service is not running as expected. The most common root causes are listed below.

Root cause 1 - Mail Queue Service Configuration

The Mail Queue Service is scheduled to be running by default at once per minute. But sometimes due to user preference, it can be altered to different delay and this will take direct effect on how long it will be delayed to reach the end user.

This can be verify by navigating to ⚙ > System > Advanced > Services and checking the "Schedule" value:

  • The default value is 0 * * * * ?  which means every minute
  • If you observe a different cron expression than the default value, then it is possible that it was modified to run less frequently than every minute. In this case, try to change the expression to the default one and check if the issue is resolved

Root cause 2 - Too many Jira mail handlers taking too long to be executed

We have also seen issues where Incoming Mail Handlers take a long time to scan mailboxes connected to Jira due to the mailbox housing a very large amount of messages. This causes the Caesium threads to not be released as often as they should, increasing the time it takes for the Mail Queue Service to run.

To verify if the Mail Queue Service is not executed because the execution time of mail handlers, please refer to the KB article below:

Jira notifications piling up in the mail queue due to mail handlers using too much resource

Root cause 3 - Outgoing Mail Server configured with infinite timeout value

When the Outgoing Mail Server (SMTP mail server) is configured with a connection timeout set to 0 or no connection timeout set at all, the Mail Queue Service is waiting infinitely to get a connection from SMTP server. If for any reason the connection is unstable while the mail queue is in the middle of sending a notification, it might get stuck in that infinite state. The consequence of that is that the mail queue service is no longer scheduled every minute, and the mail queue indefinitely piles up.

To verify if the SMTP server timeout value is preventing the mail queue service from running, please refer to the KB article below:

Jira notifications piling up in the mail queue due to SMTP server infinite timeout setting

Root Cause 4 - Jira server hostname DNS resolution issue

If you observe the following behaviors:

  • the mail queue is automatically emptied very slowly
  • when manually flushing the queue, the mail queue is still emptied slowly
  • it takes a long time for each email to be sent (for example 20-40 seconds) as per the <JIRA_HOME>/log/atlassian-jira-outgoing-mail.log file (after logging and debugging was enabled for Outgoing Mail):

    grep 'was sent with Message-Id' atlassian-jira-outgoing-mail.log:
    
    2020-07-27 04:53:41,256+0200 DEBUG [] Sending mailitem To='email1@test.com' Subject='Updated: (ABC-123)' From='null' FromName='Test User (Jira)' Cc='null' Bcc='null' ReplyTo='null' InReplyTo='null' MimeType='text/html' Encoding='UTF-8' Multipart='javax.mail.internet.MimeMultipart@24d56930' MessageId='null' ExcludeSubjectPrefix=false' anonymous    Mail Queue Service Message was sent with Message-Id <JIRA.428775.1594212067000.32599.1595818401102@Atlassian.JIRA>
    2020-07-27 04:54:01,456+0200 DEBUG [] Sending mailitem To='email2@test.com' Subject='Updated: (ABC-123)' From='null' FromName='Test User (Jira)' Cc='null' Bcc='null' ReplyTo='null' InReplyTo='null' MimeType='text/html' Encoding='UTF-8' Multipart='javax.mail.internet.MimeMultipart@3b62f1d5' MessageId='null' ExcludeSubjectPrefix=false' anonymous    Mail Queue Service Message was sent with Message-Id <JIRA.428775.1594212067000.32600.1595818441274@Atlassian.JIRA>
    2020-07-27 04:54:21,639+0200 DEBUG [] Sending mailitem To='email3@test.com' Subject='Updated: (ABC-123)' From='null' FromName='Test User (Jira)' Cc='null' Bcc='null' ReplyTo='null' InReplyTo='null' MimeType='text/html' Encoding='UTF-8' Multipart='javax.mail.internet.MimeMultipart@19310596' MessageId='null' ExcludeSubjectPrefix=false' anonymous    Mail Queue Service Message was sent with Message-Id <JIRA.428775.1594212067000.32601.1595818481472@Atlassian.JIRA>

In this case, one of the 2 knowledge base articles below might apply:

Jira notifications piling up in the mail queue due to IPv6 issues on JVM
Jira notifications piling up in the mail queue due to a server hostname resolution issue

Root Cause 5 - Group Filter subscriptions sent to big list of users, or high number of private filter subscriptions

The Mail Queue Service might frequently get stuck, as it is busy queueing and sending a large number of filter subscription emails.

To check if you are impacted by this situation, please check the following knowledge base article:

Jira notifications sent with a long delay due group filter subscriptions or high number of private filter subscriptions

Root Cause 6 - A 3rd party add-on is taking up all the resources needed by the Mail Queue Service

The Mail Queue Service relies on Caesium threads to be scheduled every 1 min (by default). If any of the add-ons listed below is installed in your Jira instance, then there is a chance that this add-on might be constantly using all the Caesium threads, preventing the Mail Queue Service (and any other Jira service) from running on schedule:

To verify if you are impacted by this scenario, please check the following knowledge base article:

Jira incoming mail and outgoing mail functionalities are not running due to an add-on taking all the Caesium resources

Root Cause 7 - Too many instances of the Bamboo Service

If you are running Jira on a version lower than 8.2.2, then you might be impacted by the bug below:
JRASERVER-66593 - Getting issue details... STATUS

Due to this bug, after every Jira re-start, a new duplicate entry for the job com.atlassian.jira.plugin.ext.bamboo.service.PlanStatusUpdateJob will be added to the clusteredjob table. Since all scheduled jobs share the same resources (4 Caesium threads), if this job clogs the clusteredjob table, then other jobs such as the Mail Queue Service may never have a chance to run.

Running the following SQL query will help validate this root cause:

select count (*) from clusteredjob
WHERE job_runner_key = 'com.atlassian.jira.plugin.ext.bamboo.service.PlanStatusUpdateJob';

If this query returns a high number of duplicate rows, then you might be impacted by the bug listed above.

Troubleshooting Jira Notification issues (Batched and Non-Batched)


If you verified that none of the Jira notifications are sent whether the batched notifications settings (in the page ⚙ > System > Batching Notifications) is enabled or not, then it means that the entire Jira notification module is dysfunctional.

Only 1 root cause has been identified so far.

Root Cause - The Insight add-on is disabled due to an incorrect upgrade path

If the Jira notifications (batched and non-batched) stopped working after upgrading Jira to 8.16.0 or higher (or Service Management 4.16.0 or higher), you might be impacted by the issue described in the KB article below:

Jira Mail Notifications fail if Insight plugin is disabled

Troubleshooting Jira Batched Notification mail issues

If you verified that Jira notifications are properly sent after you disable batched notifications in the page ⚙ > System > Batching Notifications, then chances are that the jobs responsible for the Jira batched notifications are failing to be executed, or face some slowness.

There can be various reasons why the Jira batched notifications job is not running as expected. The most common root causes are listed below.

Root Cause 1 - Expected behavior

If you are observing a constant delay with the Jira notifications (for example: 30 min, 10 min, etc...), then this delay is expected if you have enabled batched-notifications in the page ⚙ > System > Batching email notifications is enabled. Batched notifications can be sent at different frequencies on this page, and as quickly as every 2 minutes.

Please note that the delay observed with the reception of the notifications might be slightly longer than the frequency that was set. For example, if the frequency is set to 10 minutes, notifications will be sent about 12-13 minutes after the event happened. This is expected, due to the way batched notifications were implemented in the Jira code.

Root Cause 2 - MSSQL Database configuration issue in Jira

If you observe that Jira Batched Notifications are not sent at all and if you are using a MSSQL database, you might be impacted by the issue described in the KB article below:

Batched notifications are not working when using MS SQL database

Root Cause 3 - Oracle Database driver incompatibility with Jira

If you observe that Jira Batched Notifications are not sent at all and if you are using an Oracle database, you might be impacted by the issue described in the KB article below:

Batch notifications are not working when using Oracle database

Root Cause 4 - Performance issue impacting the batched notification jobs when Jira is connected to a MSSQL or Oracle Database

If you observe that Batched notifications are intermittently sent with a long delay (for example 40 minutes, or several hours), you might be impacted by the performance bug below:

JRASERVER-71350 - Getting issue details... STATUS

Root Cause 5 - Batched notification job does does not recover from a database connectivity issue (Jira Server only - Doest not apply to Jira Data Center)

If Jira experiences a temporary database connectivity issue (for example, during a DB server maintenance or re-start), some scheduled jobs such as the Batched Notification job might never recover from this, and the scheduler will never try to execute them.

In this case, you might be impacted by the bug listed below:

JRASERVER-62072 - Getting issue details... STATUS

Root Cause 6 - Batched notification job does does not recover from a database connectivity issue (Jira Data Center only - Does not apply to Jira Server)

When using Jira Data Center:

  • whenever a node from the cluster starts a job (for example, the batched notification job), a lock is set in the database in the table clusterlockstatus, used to prevent any other node from running the same job

  • when the node completes its job, this node is supposed to release the lock from the clusterlockstatus table, so that other nodes can run the job in the future

  • if the node fails to release the lock (for example, due to database connection problem), the node does not try again to unlock it, and the job remains locked until the next re-start of all the JIRA nodes

If you are using Jira Data Center on a version lower than 8.3.0, then you might be impacted by the bug below:

JRASERVER-66597 - Getting issue details... STATUS

Troubleshooting Service Desk Customer notification mail issues

If you verified that Jira notifications are sent as expected, but only Service Desk Customer Notifications are not sent (or sent with a long delay), then chances are that the job responsible for the customer notification is failing to be executed, or is facing some slowness.

There can be various reasons why the customer notification job is not running as expected. The most common root causes are listed below.

Root Cause 1 - Customer notification are delayed due complex SLA configurations

The Service Management SLAs, Customer Notifications and Automations all share the same threads (SdOffThreadEventJobRunner:thread). Because of that, if the Service Desk project contains SLAs configured with a long list of goals using complex JQL queries, we might experience the following symptoms:

  • delays and inconsistencies with the SLA displayed in the tickets
  • delays with the triggering of Automations
  • delays with the Customer Notifications

This root cause is detailed in the KB article below:

Service Management customer notifications are sent with a long delay due to complex SLA configurations

Root Cause 2 - Customer notification job does does not recover from a database connectivity issue (Jira Server only - Doest not apply to Jira Data Center)

If Jira experiences a temporary database connectivity issue (for example, during a DB server maintenance or re-start), some scheduled jobs such as the Customer Notification job might never recover from this, and the scheduler will never try to execute it.

In this case, you might be impacted by the bug listed below:

JRASERVER-62072 - Getting issue details... STATUS

Root Cause 3 - Customer notification job does does not recover from a database connectivity issue (Jira Data Center only - Does not apply to Jira Server)

When using Jira Data Center:

  • whenever a node from the cluster starts a job (for example, the batched notification job), a lock is set in the database in the table clusterlockstatus, used to prevent any other node from running the same job

  • when the node completes its job, this node is supposed to release the lock from the clusterlockstatus table, so that other nodes can run the job in the future

  • if the node fails to release the lock (for example, due to database connection problem), the node does not try again to unlock it, and the job remains locked until the next re-start of all the Jira nodes

If you are using Jira Data Center on a version lower than 8.3.0, then you might be impacted by the bug below:

JRASERVER-66597 - Getting issue details... STATUS

If a comment including huge number of links (100k), the job responsible for the Customer notifications might get completely stuck. In such case:

  • re-starting the Jira application won't fix the issue
  • the following long running thread might be found while collecting thread dumps:

    "Caesium-2-1" #20668 daemon prio=5 tid=0x0000000004770000 nid=0x5b42 runnable [0x00007f740d93f000]
       java.lang.Thread.State: RUNNABLE
    	at java.lang.String.indexOf(String.java:1769)
    	at java.lang.String.indexOf(String.java:1718)
    	at org.apache.commons.lang.StringUtils.replaceEach(StringUtils.java:4075)
    	at org.apache.commons.lang.StringUtils.replaceEach(StringUtils.java:3868)
    	at com.atlassian.servicedesk.internal.feature.customer.request.IssueUrlConverterImpl.replaceIssueUrlsWithPortalRequestUrls(IssueUrlConverterImpl.java:69)
    	at com.atlassian.servicedesk.internal.feature.customer.request.CustomerTextRendererImpl.updateCustomerTextIntertal(CustomerTextRendererImpl.java:159)
    	at com.atlassian.servicedesk.internal.feature.customer.request.CustomerTextRendererImpl.updateEmailTextForCustomer(CustomerTextRendererImpl.java:154)
    	at com.atlassian.servicedesk.internal.notifications.render.StylingBodyFinaliserImpl.buildMultiPartHtmlEmailBody(StylingBodyFinaliserImpl.java:79)
    	at com.atlassian.servicedesk.internal.notifications.render.StylingBodyFinaliserImpl.buildMessageBodyForRecipient(StylingBodyFinaliserImpl.java:72)
    	at com.atlassian.servicedesk.internal.notifications.render.StylingBodyFinaliserImpl.lambda$buildHtmlBody$0(StylingBodyFinaliserImpl.java:55)
    	at com.atlassian.servicedesk.internal.notifications.render.StylingBodyFinaliserImpl$$Lambda$2582/1308621309.apply(Unknown Source)
    	at io.atlassian.fugue.Either$RightProjection.map(Either.java:872)
    	at io.atlassian.fugue.Either.map(Either.java:217)
    	at com.atlassian.servicedesk.internal.notifications.render.StylingBodyFinaliserImpl.buildHtmlBody(StylingBodyFinaliserImpl.java:55)

If you are using Service Management version lower than 4.5.0, then you might be impacted by the bug listed below:

JSDSERVER-6516 - Getting issue details... STATUS

Root Cause 5 - Customer notification job gets stuck when a ticket is shared with thousands of participants

If a Service Management request is shared with a high number of participants, the customer notification job might get stuck and stop sending any customer notifications. In such case, re-starting the Jira application won't fix the issue.

This behavior is caused by a Service Management bug tracked in the link below:
JSDSERVER-7346 - Getting issue details... STATUS


Providing data to Atlassian Support

If you were not able to identify what is causing the notifications to be stuck or delayed, please reach out to Atlassian Support via this link.

To help the Atlassian support team investigate the issue faster, you can follow the steps below and attach all the collected data to the support ticket raised with Atlassian:

  1. Go to the page ⚙ > System > Logging and Profiling
  2. Click on Enable on outgoing mail logging
  3. Then underneath that, Click Enable Debugging
  4. From the same page, click Configure logging level for another package
  5. Use com.atlassian.jira.service as the package name, and select "DEBUG" for the "Logging Level"
  6. Repeat the same step above, but this time with the packages com.atlassian.servicedesk.plugins.notifications and com.atlassian.jira.plugins.inform.batching.cron.BatchNotificationJob
  7. Wait for about 30 min, so that we can collect enough logs
  8. Generate a support zip by making sure to tick the option Thread Dumps when you generate it.
    1. (warning) To include the thread dumps when creating the support zip, go to ⚙ > System > Troubleshooting and support tools > Create Support zip > Customize Zip, tick the Thread dumps option, and click on the Save button
    2. (warning) If you are using Jira Data Center, please generate a support zip from each Jira node
  9. Collect the following screenshots:
    1. The page ⚙ > System > Outgoing Mail after clicking on the Edit button
    2. The page ⚙ > System > Mail Queue
    3. The page ⚙ > System > Batching email notifications
  10. Run the following SQL Query against the Jira Database:

    select * from rundetails where job_id in (
    'com.atlassian.jira.service.JiraService:10000',
    'com.atlassian.jira.plugins.inform.batching.cron.BatchNotificationJobSchedulerImpl',
    'sd.custom.notification.batch.send');
  11. Attach to the support ticket:
    1. The screenshots
    2. The support zip
    3. The result from the SQL query




Description

This page covers the steps needed to resolve the following issue. The notification mail for all events are slow to reach the users, sometimes can reach up to hours to reach the intended recipient. No warning or errors is spotted on the mail queue, mails are leaving gradually but slowly.

ProductJira



Last modified on Jun 29, 2021

Was this helpful?

Yes
No
Provide feedback about this article
Powered by Confluence and Scroll Viewport.