Application metrics reference

On this page

Still need help?

The Atlassian Community is here for you.

Ask the community

On this page:

App monitoring can give you a deeper insight into what apps are doing in your instance. This can be useful when troubleshooting issues with a specific app, or to help you determine whether an app may have contributed to a drop in overall performance or stability.

Learn how to set up app monitoring

Full list of app performance metrics

This is the full list of metrics that are exposed by the app monitoring agent. This is in addition to any JMX beans that are exposed by the application. 

db.ao.upgradetask

Measures how long an app is taking to upgrade a part of the data it stores in the database.

Upgrade tasks can happen when an app is updated or enabled. During this time the app functionality will be unavailable, and may temporarily increase load on the database and the node the upgrade task is running on.

Action

To reduce the impact of upgrade tasks, consider upgrading apps during off-peak hours. This is especially important for apps that store lots of data.

Upgrade tasks should not take more than a few minutes. If it takes more than an hour, contact the vendor.

Sample query

com_atlassian_bitbucket_metrics_Value
  {
   category00="db",
   category01="ao",
   name="upgradetask"
  }

db.ao.executeInTransaction

Measures how long a database transaction takes.

Action

Transactions should not take more than a few seconds, if it takes longer than 10 minutes, consider contacting the vendor.

Sample query

com_atlassian_bitbucket_metrics_Value
  {
   category00="db",
   category01="ao",
   name="executeInTransaction"
  }

db.ao.entityManager

Measures the duration of various database operations on records (create, find, delete, deleteWithSQL, get, stream, count).

Action

If operations coming from an app are taking an abnormally long time (for example more than 10 minutes), this could mean the operation query might be long running, or the database is under load. Contact the vendor and investigate if long running queries are expected.

Sample query

com_atlassian_bitbucket_metrics_95thPercentile
  {
   category00="db",
   category01="ao",
   category02="entityManager"
  }


Can be filtered further by adding a name="<operation>" attribute, for example name="find".

cluster.lock.held.duration

Measures how long a database cluster lock was held. Used by Bitbucket in a clustered environment.

Action

Lock contention can lead to performance degradation. It may be normal for a thread to hold on to a lock for a long time, if there aren't any threads waiting for the lock.

See db.cluster.lock.waited.duration to find out if there are any threads waiting for the lock.

Sample query

com_atlassian_bitbucket_metrics_Value
  {
   category00="cluster",
   category01="lock",
   category02="held"
  }

cluster.lock.waited.duration

Measures how long a database cluster lock was waited for. Used by Bitbucket in a clustered environment.

Action

If many threads are waiting for the same lock, it can lead to performance degradation. Contact the vendor responsible to flag and investigate the issue.

Sample query

com_atlassian_bitbucket_metrics_Value
  {
   category00="cluster",
   category01="lock",
   category02="waited"
  }
db.sal.transactionalExecutor

Measures how long a Shared Application Layer (SAL) transaction takes, when executed inside the DefaultTransactionalExecutor.

Action

The transaction can have many SAL operations, it may be there are too many operations, or the query is long running, or the database is under load. 

Sample query

com_atlassian_bitbucket_metrics_Value
  {
   category00="db",
   category01="sal",
   name="transactionalExecutor", 
   statistic="active"
  }
web.resource.condition

Measures how long a web resource condition will take to determine whether a resource should be displayed or not.

Action

Slow web resource conditions can lead to slow page load times especially if they are not cached. Reach out to the app vendor responsible to flag and investigate.

Sample query

com_atlassian_bitbucket_metrics_95thPercentile
  {
   category00="web",
   category01="resource",
   name="condition"
  }
webTemplateRenderer

Measures how long a Soy template or Velocity template web panel takes to render.

Action

The template renderer might be long running. Contact the vendor responsible and investigate if long running queries are expected.

Sample query

com_atlassian_bitbucket_metrics_95thPercentile
  {
   name="webTemplateRenderer",
   templateRenderer="velocity"
  }
web.fragment.condition

Measures how long a web fragment condition will take to determine whether a web fragment should be displayed or not.

Action

Web fragments conditions determine whether a link or a menu section or a panel on a page should be displayed. Slow web fragment conditions lead to slow page load times especially if they are not cached. Reach out to the app vendor responsible to flag and investigate 

Sample query

com_atlassian_bitbucket_metrics_95thPercentile
  {
   category00="web", 
   category01="fragment", 
   name="condition"
  }
cacheManager.flushAll

Indicates that all caches are being flushed by an app. This operation should not be triggered by external apps and can lead to product slowdowns.

Action

Use the invokerPluginKey tag to determine which app invoked the flush. Reach out to the app vendor and flag this issue to them.

Additionally, the className tag refers to the implementation of CacheManager invoked and may be helpful.

Sample query

com_atlassian_bitbucket_metrics_Count
  {
    category00="cacheManager",
    name="flushAll"
  }
cache.removeAll

Indicates that a single cache has had all of its entries removed. This may or may not cause slowdowns in products or apps.

Action

Check how often these cache removals occur, and from which product. Use the pluginKeyAtCreation tag to determine which app created the cache.  

Additionally, the className tag refers to the implementation of Cache, which may be helpful. If the frequency is excessive, consider reaching out to the app vendor and flag this issue to them.

Sample query

com_atlassian_bitbucket_metrics_Count
  {
   category00="cache",
   name="removeAll",
   invokerPluginKey!="undefined"
  }
cachedReference.reset

Indicates that a single entry in a cache has been reset. This may or may not cause slowdowns in products or apps.

Action

Check how often these cache resets occur, and from which product. Use the pluginKeyAtCreation tag to determine which app created the cache. 

Additionally, the className tag refers to the implementation of CachedReference, which may be helpful. If the frequency is excessive, consider reaching out to the app vendor and flag this issue to them.

Sample query

com_atlassian_bitbucket_metrics_Count
  {
   category00="cachedReference",
   name="reset",
   invokerPluginKey!="undefined"
  }
http.rest.request

Measures HTTP requests of the REST APIs that use the atlassian-rest module.

Action

Check the frequency and duration of the rest requests. 

Sample query

com_atlassian_bitbucket_metrics_95thPercentile
  {
   category00="http", 
   category01="rest", 
   name="request"
  }

http.sal.request

Measures HTTP requests of the given unique URL that uses the atlassian-sal module.

Action

Check the frequency and duration of the HTTP requests. If excessive or very slow, consider reaching out to the app vendor and flag this issue to them. You could also enable the optional URL tag to identify which URLs are causing the issue, you can do so by setting a system variable like so atlassian.metrics.optional.tags.http.sal.request=url

Sample query

com_atlassian_bitbucket_metrics_95thPercentile
  {
   category00="http", 
   category01="sal", 
   name="request"
  }

longRunningTask

Measures how long the long running tasks are taking.

Action

Check the duration of the task and if it’s taking too long, look for the taskClass and pluginKey to identify the source then contact the app vendor to flag this issue.

Sample query

com_atlassian_bitbucket_metrics_95thPercentile
  {
   name="longRunningTask",
   taskName=myLongRunningTask"
  }

task

Measures how long a task in queue is taking. Generally used for email queues or specific short running task.

Action

Check the duration of the task and if it’s taking too long look for the queueName and pluginKey to identify the source then contact the app vendor to flag this issue.

Sample query

com_atlassian_bitbucket_metrics_95thPercentile
  {
   name="task",
   taskName=myEmailQueue"
  }

plugin.enabled.counter / plugin.disabled.counter

Measures how many times apps have been enabled/disabled since uptime.

Action

Some caches are cleared when apps are disabled/enabled and may have a performance impact. If you see high counts, check the UPM or application logs to investigate which app is contributing to high counts.

Sample query

com_atlassian_bitbucket_metrics_Count
  {
   category00="plugin",
   category01="enabled",
   name="counter"
  }

Recommended alerts

Automated alerts help you identify issues early, without needing to wait for an end-user to bring problems to your attention. Most APM tools provide alerting capabilities.

The following alerts are based on our research into common issues with apps. We've used Prometheus and Grafana, but you may be able to adapt these rules for other APM tools.

To find out how to set up alerting in Prometheus, see Alerting overview in the Prometheus documentation.

Heap memory usage

Excessive Heap memory consumption often leads to out of memory errors (OOME). While fluctuations in Heap memory consumption are expected and normal, a consistent increase or failure to release this memory, can lead to issues. We suggest creating an alert which is triggered when there is less than 10% free Heap memory left on a node for an amount of time, such as 2 minutes.

  - alert: OutOfMemory
    expr: 100*(jvm_memory_bytes_used{area="heap"}/jvm_memory_bytes_max{area="heap"}) > 90
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Out of memory (instance {{ $labels.instance }})
      description: "Memory is filling up (< 10% left)"

CPU utilisation

Consistently high CPU usage can be caused by numerous issues such as process intensive jobs, inefficient code (loops), or too little memory.

We recommend creating an alert that is triggered when CPU load exceeds 80% for an amount of time, such as 2 minutes.

  - alert: HighCpuLoad
    expr: (java_lang_OperatingSystem_ProcessCpuLoad * 1000 > 80
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: High CPU load (instance {{ $labels.instance }})
      description: "CPU load is > 80%"

Full GC

Full garbage collection (GC) occurs when both young and old Heap generations are collected. This is time consuming and pauses the application. Full GC can happen for a number of reasons, but a sudden spike may happen when too many large objects are loaded into memory.

We recommend monitoring any significant increase in the number of full GCs. How you do this will vary depending on the type of Collector being used. For the G1 Garbage Collector (G1GC), monitor the java_lang_G1_Old_Generation_CollectionCount metric.

Blocked threads

A high number of blocked or stuck threads means there are fewer threads available to process requests. An increase in blocked threads could indicate a problem.

We recommend creating an alert that is triggered when the number of blocked threads exceeds 10%.

  - alert: BlockedThreads
    expr: avg by(instance) (rate(jvm_threads_state{state="BLOCKED"}[5m])) * 100 > 10
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Blocked Threads (instance {{ $labels.instance }})
      description: "Blocked Threads are > 10%"

Database connection pool

The database connection pool should be tuned for the size of the instance (such as the number of users and plugins). It also needs to match what the database allows.

We recommend creating an alert that is triggered when the number of connections is consistently near the maximum for an amount of time.

Example alert:

- alert: DatabaseConnections
    expr: 100*(<domain>_BasicDataSource_NumActive{connectionpool="connections"}/<domain>_BasicDataSource_MaxTotal{connectionpool="connections"}) > 90
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Database Connections (instance {{ $labels.instance }})
      description: "Database Connections are filling up (< 10% left)"

Replace <domain> with the Product metric domain, such as com_atlassian_bitbucket  or com_atlassian_confluence.

Reacting to alerts

Some issues are transient or may resolve themselves, while others could be a warning sign of a major performance degradation.

When investigating the source of the problem, the app specific metrics below can help. If it's clear from the metrics that one particular app is spending more time or calling an API more frequently, you could try disabling that app to see whether performance improves. If it's a critical app, raise a support ticket, and include any relevant data extracts from your monitoring with the support zip.

Last modified on Sep 23, 2022

Was this helpful?

Yes
No
Provide feedback about this article
Powered by Confluence and Scroll Viewport.