Problem management
Benefits of problem management
Problem management provides a logical extension of incident management for improving the overall quality of IT services. Some of the benefits of a formal approach to problem management include:
- Improved quality of IT service and reliable services for the business
- Reduced incident volume and minimized service interruptions for the business
- Improved sharing of knowledge across the IT organization as the team learns from past mistakes and issues.
- Valuable reporting and analytic insights from historical data that help the team identify trends and the means of preventing failures or reducing the impact of failures.
- Permanent solutions that help reduce the number and impact of incidents over the long term.
- Improved first touch resolution by the service desk team because they have access to lessons learned, known errors and work arounds documented in a central knowledge base.
Problem management process
This process shows a sample problem investigation based on ITIL recommendations. You can adapt it to suit your other ITIL and support processes.
Problem management uses analysis to identify the cause of the problem. Thus, it takes longer than incident management, and should only be done when the incident is no longer urgent. A problem investigation is often a direct result of a Post Incident Review (PIR) where the IT team needs to know the root cause for a recurring service outage. Problem management can take time, so it's important to set time limits to keep the cost of resolution low.
The first step of a problem investigation is to diagnose the problem, validate workarounds, and document findings. Once the problem has been diagnosed and a workaround identified, the problem is referred to as a “known error.” Known errors are documented in a centralized knowledge base referred to as the known error database (KEDB). The service desk team can use the knowledge in the known error database when they respond to new incidents.
Problem management process summary
The ITIL problem management process ensures that the problem investigation is successfully completed. Here's a summary of the steps included in a problem investigation:
- Detect the problem - The service desk team raises a problem because they see incidents occur across the organization with similar conditions, or they see a reoccurring incident. The problem investigation team reviews these problems, incident patterns, and alerts from event management to identify patterns that might impact the overall quality of IT services.
- Create a problem record - When the team detects a problem, the service desk or problem management team record it. The problem record captures the time and date of occurrence, the symptoms, related incident(s), previous troubleshooting steps, and the problem category and source. This information helps the problem management team research the root cause.
- Categorize the problem - The problem categorization helps the service desk sort and model common incidents, track problem trends, and asses the impact of service capacity, demand, and quality. The problem categorization should match the incident categorization.
- Determine priority - A problem's priority is determined by its urgency and impact on users and the business. Urgency is how quickly the organization requires a resolution to the problem. Impact measures the potential damage the problem can cause to services and the organization.
- Investigate and diagnose - How fast a problem is investigated and diagnosed depends on its priority. High-priority problems should be addressed first since their impact on services and the organization are the greatest. The diagnosis phase involves analyzing the incidents that led to the problem along with any troubleshooting the service desk team performed to find a workaround. This level of analysis involves reviewing data and conferring with experts to gain the proper insights needed to find the root cause of the issue.
- Identify a workaround - A workaround enables the service desk to address open incidents and restore normal service while the problem is investigated.
- Create a known error record - The workaround is shared with the service desk team as a known error. It’s good practice to record known errors in a knowledge base (KB) article. ITIL refers to this KB as a known error database (KEDB). Documenting the workaround allows the service desk team to resolve incidents quickly and avoid further problems being raised on the same issue.
- Resolve the problem - Problem investigations are resolved when they find the root cause of a set of incidents and prevent those incidents from recurring. Resolutions may require a change request to implement the final resolution. For example, you might need to apply a software update to production to address a known performance issue. The step of the resolution phase is documenting the steps taken to implement the resolution. This should be published to the IT knowledge base so the service desk has the information for future reference.
- Review the problem - ITIL classifies this as a major problem review. During a review, the problem management team evaluates the problem investigation documentation to identify what happened and why. This allows the team to identify any process improvements that are needed. Lessons learned during this step should be documented and shared with the IT organization.
Set up problem management in Jira Service Desk
Configure the workflow and fields
We recommend the following workflow for managing problem records:
Problem management fields
We recommend the following fields for your problem management process:
Field | Description | Sample values |
---|---|---|
Description | Captures basic information about the problem | |
Status | The state of the problem | |
Pending reason | Why the problem management process is pending | Waiting on vendor, More info required, Awaiting approval, Pending on change request |
Priority | Determined by the urgency and impact of the problem. Your team can define the value according to your own processes. | Critical, High, Medium, Low |
Urgency | How quickly the problem needs to be resolved | Critical, High, Medium, Low |
Impact | The extent of the problem and the potential damage it causes while it's unresolved | Extensive / Widespread, Significant / Large, Moderate / Limited, Minor / Localized |
Operational categorization | Classifies a problem for the purpose of assignment and reporting from the operational perspective | Configuration > Printer |
Product categorization | Classifies a problem for the purpose of assignment and reporting from the product perspective | Hardware > Printer |
Source | Where the problem was discovered | Phone, Email, Monitoring event |
Component | The service impacted by the problem | |
Resolution | Classifies how the problem was resolved | Known error |
Root cause | The cause of the problem. Might need to be documented as a known error in the knowledge base. | |
Workaround | Temporary solutions the team can use until the problem is solved. |
Publish a known error to the Confluence IT space
After a problem is diagnosed and a workaround identified, the problem is referred to as a “known error.” Known errors are documented in the team's knowledge base. Confluence is an ideal location for publishing knowledge articles to help incident management solve known errors. After the known error is identified, the next step is to determine how to fix it. This typically involves a change request for the impacted system.
The following is an example of a Known Error Data Base (KEDB) published in Confluence.