App metrics reference
On this page:
App monitoring can give you a deeper insight into what apps are doing in your instance. This can be useful when troubleshooting issues with a specific app, or to help you determine whether an app may have contributed to a drop in overall performance or stability.
Learn how to set up app monitoring
Full list of app performance metrics
This is the full list of metrics that are exposed by the app monitoring agent. This is in addition to any JMX beans that are exposed by the application.
|
This metric indicates that information has been reindexed. Action Reindexing can degrade your site's performance. Ideally, you would reindex during off-peak times. The Sample query
|
|
Measures how long a search request takes. Action Use the If you notice an app is making a lot of searches, or consistently takes a long time to process search results, reach out to the app vendor. Sample query
|
|
Measures how long an app is taking to upgrade a part of the data it stores in the database. Upgrade tasks can happen when an app is updated or enabled. During this time the app functionality will be unavailable, and may temporarily increase load on the database and the node the upgrade task is running on. Action If an app stores a lot of data in database consider scheduling any updates when Confluence is less busy. Sample query
|
|
Measures how long an Active Objects (AO) transaction takes when executed inside the Action The transaction can have many AO operations. The problem may be that there are too many operations, the query is long running, or the database is under load. Sample query
|
|
Measures how long an Active Objects (AO) operation (create, find, delete, deleteWithSQL, get, stream, count) that uses the Action The operation query may be long running, or the database is under load. Sample query
name="<operation>" attribute, for example name="find" . |
|
Measures how long a database cluster lock was held. Used by Confluence in a clustered environment. Action Lock contention can lead to performance degradation. It may be normal for a thread to hold on to a lock for a long time, if there aren't any threads waiting for the lock. See Sample query
|
|
Measures how long a database cluster lock was waited for. Used by Confluence in a clustered environment. Action If many threads are waiting for the same lock, it can lead to performance degradation. Sample query
|
db.sal.transactionalExecutor |
Measures how long a Shared Application Layer (SAL) transaction takes, when executed inside the Action The transaction can have many SAL operations, it can be either there are too many operations or the query is long running, or the database is under load. Sample query
|
web.resource.condition |
Measures how long a web resource condition will take to determine whether a resource should be displayed or not. Action Slow web resource conditions can lead to slow page load times especially if they are not cached. Sample query
|
plugin.disabled.counter |
Measures how many times an app was disabled since uptime. Action Some caches are cleared when an app is disabled or enabled. This can have performance impact. If this number increases, check UPM or the application logs to investigate which app is contributing to this number. Sample query
|
plugin.enabled.counter |
Measures how many times an app was enabled since uptime. Action Some caches are cleared when an app is disabled or enabled. This can have a performance impact. If this number increases, check UPM or the application logs to investigate which app is contributing to this number. Sample query
|
soyTemplateRenderer |
Measures how long a Soy Template web panel takes to render. Action The template renderer might be long running. Sample query
|
webTemplateRenderer |
Measures how long an Atlassian Template web panel takes to render. Action The template renderer might be long running. Sample query
|
web.fragment.condition |
Measures how long web fragment condition will take to determine whether a web fragment should be displayed or not. Action Web fragments conditions determine whether a link or a section on a page should be displayed. Slow web fragment conditions lead to slow page load times especially if they are not cached. Sample query
|
cacheManager.flushAll |
Indicates that all caches are being flushed by an app. This operation should not be triggered by external apps and can lead to product slowdowns. Action Use the Sample query
|
cache.removeAll |
Indicates that a single cache has had all of its entries removed. This may or may not cause slowdowns in products or apps. Action Check how often these cache removals occur, and from which product. Use the Sample query
|
cachedReference.reset |
Indicates that a single entry in a cache has been reset. This may or may not cause slowdowns in products or apps. Action Check how often these cache resets occur, and from which product. Use the Sample query
|
rest.request |
Measures HTTP requests of the REST APIs that uses the atlassian-rest module. Action Check the frequency and duration of the rest requests. Sample query
|
Recommended alerts
Automated alerts help you identify issues early, without needing to wait for an end-user to bring problems to your attention. Most APM tools provide alerting capabilities.
The following alerts are based on our research into common issues with apps. We've used Prometheus and Grafana, but you may be able to adapt these rules for other APM tools.
To find out how to set up alerting in Prometheus, see Alerting overview in the Prometheus documentation.
Heap memory usage
Excessive Heap memory consumption often leads to out of memory errors (OOME). While fluctuations in Heap memory consumption are expected and normal, a consistent increase or failure to release this memory, can lead to issues. We suggest creating an alert which is triggered when there is less than 10% free Heap memory left on a node for an amount of time, such as 2 minutes.
- alert: OutOfMemory
expr: 100*(jvm_memory_bytes_used{area="heap"}/jvm_memory_bytes_max{area="heap"}) > 90
for: 2m
labels:
severity: warning
annotations:
summary: Out of memory (instance {{ $labels.instance }})
description: "Memory is filling up (< 10% left)"
CPU utilisation
Consistently high CPU usage can be caused by numerous issues such as process intensive jobs, inefficient code (loops), or too little memory.
We recommend creating an alert that is triggered when CPU load exceeds 80% for an amount of time, such as 2 minutes.
- alert: HighCpuLoad
expr: (java_lang_OperatingSystem_ProcessCpuLoad * 1000 > 80
for: 2m
labels:
severity: warning
annotations:
summary: High CPU load (instance {{ $labels.instance }})
description: "CPU load is > 80%"
Full GC
Full garbage collection (GC) occurs when both young and old Heap generations are collected. This is time consuming and pauses the application. Full GC can happen for a number of reasons, but a sudden spike may happen when too many large objects are loaded into memory.
We recommend monitoring any significant increase in the number of full GCs. How you do this will vary depending on the type of Collector being used. For the G1 Garbage Collector (G1GC), monitor the java_lang_G1_Old_Generation_CollectionCount
metric.
Blocked threads
A high number of blocked or stuck threads means there are fewer threads available to process requests. An increase in blocked threads could indicate a problem.
We recommend creating an alert that is triggered when the number of blocked threads exceeds 10%.
- alert: BlockedThreads
expr: avg by(instance) (rate(jvm_threads_state{state="BLOCKED"}[5m])) * 100 > 10
for: 0m
labels:
severity: warning
annotations:
summary: Blocked Threads (instance {{ $labels.instance }})
description: "Blocked Threads are > 10%"
Database connection pool
The database connection pool should be tuned for the size of the instance (such as the number of users and plugins). It also needs to match what the database allows.
We recommend creating an alert that is triggered when the number of connections is consistently near the maximum for an amount of time.
Example alert:
- alert: DatabaseConnections
expr: 100*(<domain>_BasicDataSource_NumActive{connectionpool="connections"}/
<domain>_BasicDataSource_MaxTotal{connectionpool="connections"}) > 90
for: 5m
labels:
severity: warning
annotations:
summary: Database Connections (instance {{ $labels.instance }})
description: "Database Connections are filling up (< 10% left)"
Reacting to alerts
Some issues are transient, or may resolve themselves, while others could be a warning sign of a major performance degradation.
When investigating the source of the problem, the app specific metrics below can help. If it's clear from the metrics that one particular app is spending more time or calling an API more frequently, you could try disabling that app to see whether performance improves. If it's a critical app, raise a support ticket, and include any relevant data extracts from your monitoring with the support zip.