Problem management

Still need help?

The Atlassian Community is here for you.

Ask the community

What is problem management?

The goal of problem management is to reduce the impact of incidents that are caused by problems with IT infrastructure, and prevent the incidents from happening again. Problem investigations prioritize problems that have the greatest potential for causing serious disruption to critical IT services. 

Whereas incident management aims to find the shortest path to restoring normal service, problem management aims to find the underlying causes of an incident, and the best way to resolve and prevent them. When incidents occur, the incident response team restores normal service as quickly as possible, without necessarily identifying or resolving the underlying cause of the incident. If incidents occur rarely or have little impact, assigning resources to perform root cause analysis is difficult to justify. However, if a major service outage incident or a series of repeated incidents cause significant impact, the problem management team investigates the underlying cause of the incidents, and identifies the best method to eliminate the root cause.

Benefits of problem management

Problem management provides a logical extension of incident management for improving the overall quality of IT services. Some of the benefits of a formal approach to problem management include:

  • Improved quality of IT service and reliable services for the business
  • Reduced incident volume and minimized service interruptions for the business
  • Improved sharing of knowledge across the IT organization as the team learns from past mistakes and issues.
  • Valuable reporting and analytic insights from historical data that help the team identify trends and the means of preventing failures or reducing the impact of failures.
  • Permanent solutions that help reduce the number and impact of incidents over the long term.
  • Improved first touch resolution by the service desk team because they have access to lessons learned, known errors and work arounds documented in a central knowledge base.

Problem management process

This process shows a sample problem investigation based on ITIL recommendations. You can adapt it to suit your other ITIL and support processes.

Problem management uses analysis to identify the cause of the problem. Thus, it takes longer than incident management, and should only be done when the incident is no longer urgent. A problem investigation is often a direct result of a Post Incident Review (PIR) where the IT team needs to know the root cause for a recurring service outage. Problem management can take time, so it's important to set time limits to keep the cost of resolution low.

The first step of a problem investigation is to diagnose the problem, validate workarounds, and document findings. Once the problem has been diagnosed and a workaround identified, the problem is referred to as a “known error.” Known errors are documented in a centralized knowledge base referred to as the known error database (KEDB). The service desk team can use the knowledge in the known error database when they respond to new incidents.

Problem management process summary

The ITIL problem management process ensures that the problem investigation is successfully completed.  Here's a summary of the steps included in a problem investigation:

  1. Detect the problem - The service desk team raises a problem because they see incidents occur across the organization with similar conditions, or they see a reoccurring incident. The problem investigation team reviews these problems, incident patterns, and alerts from event management to identify patterns that might impact the overall quality of IT services.
  2. Create a problem record - When the team detects a problem, the service desk or problem management team record it. The problem record captures the time and date of occurrence, the symptoms, related incident(s), previous troubleshooting steps, and the problem category and source. This information helps the problem management team research the root cause. 
  3. Categorize the problem - The problem categorization helps the service desk sort and model common incidents, track problem trends, and asses the impact of service capacity, demand, and quality. The problem categorization should match the incident categorization.  
  4. Determine priority - A problem's priority is determined by its urgency and impact on users and the business. Urgency is how quickly the organization requires a resolution to the problem. Impact measures the potential damage the problem can cause to services and the organization. 
  5. Investigate and diagnose - How fast a problem is investigated and diagnosed depends on its priority. High-priority problems should be addressed first since their impact on services and the organization are the greatest. The diagnosis phase involves analyzing the incidents that led to the problem along with any troubleshooting the service desk team performed to find a workaround. This level of analysis involves reviewing data and conferring with experts to gain the proper insights needed to find the root cause of the issue.  
  6. Identify a workaround - A workaround enables the service desk to address open incidents and restore normal service while the problem is investigated. 
  7. Create a known error record - The workaround is shared with the service desk team as a known error. It’s good practice to record known errors in a knowledge base (KB) article. ITIL refers to this KB as a known error database (KEDB). Documenting the workaround allows the service desk team to resolve incidents quickly and avoid further problems being raised on the same issue.
  8. Resolve the problem - Problem investigations are resolved when they find the root cause of a set of incidents and prevent those incidents from recurring. Resolutions may require a change request to implement the final resolution. For example, you might need to apply a software update to production to address a known performance issue. The step of the resolution phase is documenting the steps taken to implement the resolution. This should be published to the IT knowledge base so the service desk has the information for future reference.  
  9. Review the problem - ITIL classifies this as a major problem review. During a review, the problem management team evaluates the problem investigation documentation to identify what happened and why. This allows the team to identify any process improvements that are needed. Lessons learned during this step should be documented and shared with the IT organization.

Set up problem management in Jira Service Desk

Configure the workflow and fields

We recommend the following workflow for managing problem records: 

Problem management fields

We recommend the following fields for your problem management process:

FieldDescriptionSample values
DescriptionCaptures basic information about the problem 
StatusThe state of the problem 
Pending reasonWhy the problem management process is pendingWaiting on vendor, More info required, Awaiting approval, Pending on change request
PriorityDetermined by the urgency and impact of the problem. Your team can define the value according to your own processes. Critical, High, Medium, Low
UrgencyHow quickly the problem needs to be resolvedCritical, High, Medium, Low
ImpactThe extent of the problem and the potential damage it causes while it's unresolvedExtensive / Widespread, Significant / Large, Moderate / Limited, Minor / Localized
Operational categorizationClassifies a problem for the purpose of assignment and reporting from the operational perspectiveConfiguration > Printer
Product categorizationClassifies a problem for the purpose of assignment and reporting from the product perspectiveHardware > Printer
SourceWhere the problem was discoveredPhone, Email, Monitoring event 
ComponentThe service impacted by the problem 
ResolutionClassifies how the problem was resolvedKnown error
Root causeThe cause of the problem. Might need to be documented as a known error in the knowledge base. 
WorkaroundTemporary solutions the team can use until the problem is solved. 

 

Publish a known error to the Confluence IT space

After a problem is diagnosed and a workaround identified, the problem is referred to as a “known error.” Known errors are documented in the team's knowledge base. Confluence is an ideal location for publishing knowledge articles to help incident management solve known errors. After the known error is identified, the next step is to determine how to fix it. This typically involves a change request for the impacted system.

The following is an example of a Known Error Data Base (KEDB) published in Confluence.

Last modified on Aug 27, 2018

Was this helpful?

Yes
No
Provide feedback about this article
Powered by Confluence and Scroll Viewport.