Interpreting infrastructure metrics for in-product diagnostics
Infrastructure metrics for JMX monitoring and in-product diagnostic in Confluence Data Center help you monitor the health and performance of your site infrastructure:
Learn more about other Atlassian cross-product metrics for in-product diagnostics
Outgoing mail server connection state
The connection state metric for the outgoing mail server (SMTP) is created once it’s configured and removed when it’s deleted.
mail.outgoing.connection.state
mail.outgoing.connection.state.custom
attempts to connect to a mail server and pings it with the NOOP
or RSET
SMTP commands. This operation is performed once in a minute. The metric reports the failed state when a connection can’t be established or a response for commands is invalid. During the measurement process, the mail server will timeout after 10 seconds and report the disconnected
value.
Available custom metric values: connected
(true
or false
), totalFailures
(the sum of the false values since the restart).
Warning markers | Signs of healthiness |
---|---|
Example:
|
Example:
|
Incoming mail server connection state
The connection state metric for each incoming mail server is created once it’s added to Confluence. Metrics are compatible with all types of mail servers and all authentication types.
mail.incoming.connection.state
mail.incoming.connection.state.custom
attempts to connect to a remote server and perform a read-only open the INBOX folder
operation. This operation is performed once in a minute.
Only one measuring process can run at a time to avoid overwriting results. Since the mail servers are remote and can have an individually defined connection timeout, during the measurement process, the timeouts are overridden to the standard maximum of 10 seconds.
For servers using basic authentication, the attempt to open a connection will fail if there is a problem. For servers using OAuth, a bad connection is only visible when trying to open the Inbox folder.
When an incoming mail server is deleted, its metric is unregistered too. Incoming mail servers are differentiated by using their names stored as metric tags. When changing the names of mail servers, note that this will create new metrics.
A sample MBean ObjectName for this metric will look as follows:
com.atlassian.confluence:type=metrics,category00=mail,category01=incoming,category02=connection,category03=state,name=custom,tag.serverName=<mailName>
The tag tag.serverName
will contain the name of your configured incoming mail server, however, every space character will change to _
. For example, the mail name Google mail box
will change to Google_mail_box
.
If you have two servers with the names Google mail box
and Google_mail_box
, there will be only one metric with the tag Google_mail_box
.
For OAuth, the authentication token is valid for about an hour. If there’s an issue with the authentication process or refreshing the token, you'll know about it once the currently active authentication token gets outdated.
Available custom metric values: connected
(true
or false
), totalFailures
.
Warning markers | Signs of healthiness |
---|---|
Example:
|
Example:
|
External user directories connectivity
Two factors for external user directories are measured:
values –
true
orfalse
totalFailures
– the total number of failures
value – the current value in milliseconds and statistics
These values are measured for every type of user directory except for the internal user directory. This operation is performed once in a minute. As a key to differentiating user directory metrics, their names stored as metric tags are used. If you change the name of a user directory, a new metric will be created. Swapping names between user directories will adversely affect the readability of metrics.
Disabling or removing user directory will immediately remove its metrics from JMX.
Let’s assume you have two user directories: Example_UD_1
and Example_UD_2
.
Example_UD_1
has an average latency of around 20 ms. Example_UD_2
has a latency of around 120 ms. Once you swap their names, from the metrics' perspective, you'll see that the latencies of Example_UD_2
have become significantly slower, while the latencies of Example_UD_1
have become higher.
We don't recommend swapping the names of user directories if this isn't necessary.
user.directory.connection.state.custom
user.directory.connection.state.custom
performs the same check as the Test connection configuration under Administration > System > User directories. The metric is checked differently depending on the type of the user directory:
LDAP – checks the connection with the machine (connectivity only).
Internal with LDAP Authentication – checks connection with the machine (connectivity only).
Active directory – searches users (connectivity and authentication).
Crowd – searches users (connectivity and authentication).
Check the chart with a 2-hour outage on the external user directory.
Available custom metric values: connected
(true
or false
), totalFailures
(the sum of false values).
Warning markers | Signs of healthiness |
---|---|
Example:
| For Example:
|
user.directory.connection.latency
user.directory.connection.latency
is measured by performing user search on an uncached user directory. The query for users doesn’t specify any parameters or restrictions and the maximum result number is specified to one.
Check the chart with the user.directory.connection.latency.value
metric dropping to the -1
value.
Available metrics: value
, statistics
.
Warning markers | Signs of healthiness |
---|---|
The 50th percentile of the statistics is growing over time. This means that the average response time is higher than usual and users might experience some delay when trying to authenticate or get authorization. Example:
|
The 50th percentile is stable with a reasonable amount of latency. It can differ based on where your user directory is located. Example:
|
Shared home write latency
For clustered Data Center instances, the time to write sample data to the shared home is measured. High times of shared home latency will impact attachments, avatars, index snapshots, and other items.
home.shared.write.latency
home.shared.write.latency
measures the time in milliseconds to write a sample file on the shared home. A few measurements are taken every minute to improve metric accuracy.home.shared.write.latency.value
contains a calculated median latency from the last iteration.home.shared.write.latency.statistics
contains aggregated statistics from every individual measurement and should give better insight into the outliers and latency distribution.
Only one measuring process can run at a time. Whenever a timeout of 15 seconds per three file writes is breached, the .value
metric will be updated with the -1
value. In this case, the shared home can be considered unreachable.
Check the chart with the healthy home.shared.write.latency.statistics
metric.
Available metrics: value
, statistics
.
Warning markers | Signs of healthiness |
---|---|
Example:
|
Example:
|
Local home write latency
Local home write latency metrics measure the local disk write performance: home.local.write.latency.synthetic
and home.local.write.latency.indexwriter
.
High local disk latency will have a significant impact on the Confluence performance in index persistence.
home.local.write.latency.synthetic
home.local.write.latency.synthetic
measures the time of a synthetic file write operation with a guarantee of persistence on the local home. A few measurements are taken every minute to improve metric accuracy.home.local.write.latency.synthetic.value
contains a calculated median latency from the last iteration. This value is reported in microseconds for better precision.home.local.write.latency.synthetic.statistics
contains aggregated statistics from every individual measurement and should give better insight into the outliers and the latency distribution.
If it takes more than five seconds to make seven writes, the local disk is considered unreachable. This should never happen, as the Confluence instance would become unusable.
Check the chart with the healthy home.local.write.latency.synthetic.value
metric with a latency lower than 2 ms.
Available metrics: value
, statistics
.
Warning markers | Signs of healthiness |
---|---|
Example:
|
Example:
|
home.local.write.latency.indexwriter
home.local.write.latency.indexwriter
reports the time of flushing the index buffer to the local disk. The metric is based on the real traffic and represents the current status and performance of the index subsystem. This metric is updated only when the Lucene buffer is persisted, usually after index updates.home.local.write.latency.indexwriter.statistics
contains aggregated statistics from the measurements.
These metric values don’t reflect pure disk performance. The reported time is highly related to the volume of updated documents and may sporadically report high latency times unrelated to the disk performance.
Check the chart with the healthy home.local.write.latency.indexwriter.statistics
metric.
Available metrics: statistics
.
Warning markers | Signs of healthiness |
---|---|
Example:
|
Example:
|
Internode latency
node.latency.statistics
measures the time in milliseconds to send a ping message to other nodes through Hazelcast. This can be interpreted as an internode communication latency. This metric will be unregistered from JMX when the connection status is set to disconnected
and the metric will appear again when the latency can be measured.
Check the chart of the internode latency during a Confluence redeployment to new machines.
node.connection.state.custom
metric reports the current connection status to other nodes in a cluster. If there is a response to a ping message, the status will be reported as connected
. If the ping message couldn't reach the other node due to a network connection issue, the status will be reported as disconnected
.
Check the chart of the internode node connection state during a Confluence redeployment.
Both metrics are created for each node in the cluster and include the tag tag.destNode=<nodeId>
.
In Confluence, we can’t distinguish if a node has crashed, lost connection, or was shut down. In every case when the node is unreachable, the metric will report the disconnected
status for at most 15 minutes. After this time, the metric won't be reported anymore.
Metric | Warning markers | Signs of healthiness |
---|---|---|
connected | disconnected state for any node | All nodes are in the connected state. |
node.latency.statistics | The latency higher than 10 ms will impact the node cache replication mechanism. Users may find inconsistent data on different nodes. | The latency lower than 10 ms indicates that Confluence can quickly replicate changes between nodes. |
Example |
|
|
Synchrony connectivity
Synchrony connectivity requires the enabled Synchrony Interop Bootstrap plugin, which is on by default in Confluence Data Center.
The Synchrony connectivity metrics display the connectivity state to the Confluence-managed Synchrony application or the Standalone Synchrony clustered application, depending on the configuration.
The job that is scheduled to run automatically checks the Synchrony health check endpoint at <synchrony-url>/synchrony/heartbeat
and records whether the response status was successful or not. This operation is performed once in a minute. You can find out more about the Synchrony statuses in How to check the status of Synchrony for Confluence Data Center.
The synchrony.connection.state.custom
metric performs the same check as the Status option that you can find in the Collaborative editing configuration under Synchrony monitoring and configuration.
Check the chart of a successful connection to Synchrony for every node in a Confluence cluster.
Available custom metric values:
connected
(true
orfalse
) reports the connectivity state to the Synchrony application.totalFailures
is a cumulative number of disconnected state outcomes. It’ll be refreshed after you turn off the JMX or the Synchrony Interop Bootstrap plugin.
Warning markers | Signs of healthiness |
---|---|
Example:
|
Example:
|