Application metrics reference
On this page:
App monitoring can give you a deeper insight into what apps are doing in your instance. This can be useful when troubleshooting issues with a specific app, or to help you determine whether an app may have contributed to a drop in overall performance or stability.
Learn how to set up app monitoring
Full list of app performance metrics
This is the full list of metrics that are exposed by the app monitoring agent. This is in addition to any JMX beans that are exposed by the application.
indexing.field.addIndex |
Measures how long it takes for a custom field indexer to index 0 value of a particular custom field. Action Custom field indexers impact indexing and reindexing times. If a certain field indexer is taking too long, contact the app vendor to investigate. You can find out which app is responsible for the indexer using the To find out custom field’s usages and detailed information using custom field ID see How to find any custom field's IDs Sample query
|
indexing.field.isFieldIndexableForIssue |
Measures how long it takes for a custom field indexer to determine whether a field should be indexed or not. Action Custom field indexers impact indexing and reindexing times. If field indexing determination is taking too long, contact the app vendor to investigate. You can find out which plugin is responsible for the indexer using the To find out custom field’s usages and detailed information using custom field ID see How to find any custom field's IDs Sample query
|
search.index |
Measures how long it takes for a Lucene index search to take place. Action If an app triggers searches that are abnormal in quantity or duration, contact the app vendor to investigate. You can find out which app is responsible for the index searches using the Sample query
|
issue.reindexing |
Measures how long it takes for an Issue to be reindexed. Action If an app triggers issue reindexing that is abnormal in quantity or duration, contact the app vendor to investigate. You can find out which app is responsible for the Issue reindexing using the Sample query
|
comment.reindexing |
Measures how long it takes for a Comment to be reindexed. Action If an app triggers comment reindexing that is abnormal in quantity or duration, contact the app vendor to investigate. You can find out which app is responsible for the comment reindexing using the Sample query
|
db.core.executionTime |
Measures how long it takes for a database query to be executed from when a SQL statement is provided to providing the results. This underpins all database operations in Jira, which means the Action If an app triggers SQL queries that are abnormal in quantity or duration, contact the app vendor to investigate. The There is an optional tag SQL that can be enabled, which can be used for debugging exactly what the database queries are. We don't recommend enabling this optional tag in production as it will lead to rapidly growing memory consumption. Sample query
|
|
Measures how long an app is taking to upgrade a part of the data it stores in the database. Upgrade tasks can happen when an app is updated or enabled. During this time the app functionality will be unavailable, and may temporarily increase load on the database and the node the upgrade task is running on. Action To reduce the impact of upgrade tasks, consider upgrading apps during off-peak hours. This is especially important for apps that store lots of data. Upgrade tasks should not take more than a few minutes. If it takes more than an hour, contact the vendor. Sample query
|
|
Measures how long a database transaction takes. Action Transactions should not take more than a few seconds, if it takes longer than 10 minutes, consider contacting the vendor. Sample query
|
|
Measures the duration of various database operations on records (create, find, delete, deleteWithSQL, get, stream, count). Action If operations coming from an app are taking an abnormally long time (for example more than 10 minutes), this could mean the operation query might be long running, or the database is under load. Contact the vendor and investigate if long running queries are expected. Sample query
name="<operation>" attribute, for example name="find" . |
|
Measures how long a database cluster lock was held. Used by Jira in a clustered environment. Action Lock contention can lead to performance degradation. It may be normal for a thread to hold on to a lock for a long time, if there aren't any threads waiting for the lock. See Sample query
|
|
Measures how long a database cluster lock was waited for. Used by Jira in a clustered environment. Action If many threads are waiting for the same lock, it can lead to performance degradation. Contact the vendor responsible to flag and investigate the issue. Sample query
|
db.sal.transactionalExecutor |
Measures how long a Shared Application Layer (SAL) transaction takes, when executed inside the Action The transaction can have many SAL operations, it may be there are too many operations, or the query is long running, or the database is under load. Sample query
|
web.resource.condition |
Measures how long a web resource condition will take to determine whether a resource should be displayed or not. Action Slow web resource conditions can lead to slow page load times especially if they are not cached. Reach out to the app vendor responsible to flag and investigate. Sample query
|
webTemplateRenderer |
Measures how long a Soy template or Velocity template web panel takes to render. Action The template renderer might be long running. Contact the vendor responsible and investigate if long running queries are expected. Sample query
|
web.fragment.condition |
Measures how long a web fragment condition will take to determine whether a web fragment should be displayed or not. Action Web fragments conditions determine whether a link or a menu section or a panel on a page should be displayed. Slow web fragment conditions lead to slow page load times especially if they are not cached. Reach out to the app vendor responsible to flag and investigate Sample query
|
cacheManager.flushAll |
Indicates that all caches are being flushed by an app. This operation should not be triggered by external apps and can lead to product slowdowns. Action Use the Additionally, the Sample query
|
cache.removeAll |
Indicates that a single cache has had all of its entries removed. This may or may not cause slowdowns in products or apps. Action Check how often these cache removals occur, and from which product. Use the Additionally, the Sample query
|
cachedReference.reset |
Indicates that a single entry in a cache has been reset. This may or may not cause slowdowns in products or apps. Action Check how often these cache resets occur, and from which product. Use the Additionally, the Sample query
|
rest.request |
Measures HTTP requests of the REST APIs that use the Action Check the frequency and duration of the rest requests. Sample query
|
|
Measures HTTP requests of the given unique URL that uses the Action Check the frequency and duration of the HTTP requests. If excessive or very slow, consider reaching out to the app vendor and flag this issue to them. You could also enable the optional Sample query
|
|
Measures how long the long running tasks are taking. Action Check the duration of the task and if it’s taking too long, look for the Sample query
|
|
Measures how long a task in queue is taking. Generally used for email queues or specific short running task. Action Check the duration of the task and if it’s taking too long look for the Sample query
|
|
Measures how many times apps have been enabled/disabled since uptime. Action Some caches are cleared when apps are disabled/enabled and may have a performance impact. If you see high counts, check the UPM or application logs to investigate which app is contributing to high counts. Sample query
|
Recommended alerts
Automated alerts help you identify issues early, without needing to wait for an end-user to bring problems to your attention. Most APM tools provide alerting capabilities.
The following alerts are based on our research into common issues with apps. We've used Prometheus and Grafana, but you may be able to adapt these rules for other APM tools.
To find out how to set up alerting in Prometheus, see Alerting overview in the Prometheus documentation.
Heap memory usage
Excessive Heap memory consumption often leads to out of memory errors (OOME). While fluctuations in Heap memory consumption are expected and normal, a consistent increase or failure to release this memory, can lead to issues. We suggest creating an alert which is triggered when there is less than 10% free Heap memory left on a node for an amount of time, such as 2 minutes.
- alert: OutOfMemory
expr: 100*(jvm_memory_bytes_used{area="heap"}/jvm_memory_bytes_max{area="heap"}) > 90
for: 2m
labels:
severity: warning
annotations:
summary: Out of memory (instance {{ $labels.instance }})
description: "Memory is filling up (< 10% left)"
CPU utilisation
Consistently high CPU usage can be caused by numerous issues such as process intensive jobs, inefficient code (loops), or too little memory.
We recommend creating an alert that is triggered when CPU load exceeds 80% for an amount of time, such as 2 minutes.
- alert: HighCpuLoad
expr: (java_lang_OperatingSystem_ProcessCpuLoad * 1000 > 80
for: 2m
labels:
severity: warning
annotations:
summary: High CPU load (instance {{ $labels.instance }})
description: "CPU load is > 80%"
Full GC
Full garbage collection (GC) occurs when both young and old Heap generations are collected. This is time consuming and pauses the application. Full GC can happen for a number of reasons, but a sudden spike may happen when too many large objects are loaded into memory.
We recommend monitoring any significant increase in the number of full GCs. How you do this will vary depending on the type of Collector being used. For the G1 Garbage Collector (G1GC), monitor the java_lang_G1_Old_Generation_CollectionCount
metric.
Blocked threads
A high number of blocked or stuck threads means there are fewer threads available to process requests. An increase in blocked threads could indicate a problem.
We recommend creating an alert that is triggered when the number of blocked threads exceeds 10%.
- alert: BlockedThreads
expr: avg by(instance) (rate(jvm_threads_state{state="BLOCKED"}[5m])) * 100 > 10
for: 0m
labels:
severity: warning
annotations:
summary: Blocked Threads (instance {{ $labels.instance }})
description: "Blocked Threads are > 10%"
Database connection pool
The database connection pool should be tuned for the size of the instance (such as the number of users and plugins). It also needs to match what the database allows.
We recommend creating an alert that is triggered when the number of connections is consistently near the maximum for an amount of time.
Example alert:
- alert: DatabaseConnections
expr: 100*(<domain>_BasicDataSource_NumActive{connectionpool="connections"}/<domain>_BasicDataSource_MaxTotal{connectionpool="connections"}) > 90
for: 5m
labels:
severity: warning
annotations:
summary: Database Connections (instance {{ $labels.instance }})
description: "Database Connections are filling up (< 10% left)"
Replace <domain> with the Product metric domain, such as com_atlassian_jira
or com_atlassian_confluence
.
Reacting to alerts
Some issues are transient or may resolve themselves, while others could be a warning sign of a major performance degradation.
When investigating the source of the problem, the app specific metrics below can help. If it's clear from the metrics that one particular app is spending more time or calling an API more frequently, you could try disabling that app to see whether performance improves. If it's a critical app, raise a support ticket, and include any relevant data extracts from your monitoring with the support zip.