Problem management
What is problem management?
While incident management is all about finding the shortest path to restoring normal service, problem management takes the long view with the primary goal of finding the underlying causes of an incident, and the best resolution and prevention. The aim of problem management is to reduce the adverse impact of incidents that are caused by errors within the IT infrastructure, and to prevent recurrence of incidents related to these errors. Problem investigations should be addressed with a prioritized approach that takes on problems that have the greatest potential for causing serious disruption to critical IT services.
When incidents occur, the responsibility of the incident response team is to restore normal service as quickly as possible, without necessarily identifying or resolving the underlying cause of the incident. If incidents occur rarely or have little impact, assigning resources to perform root cause analysis is difficult to justify. However, if a major service outage incident or a series of repeated incidents cause significant impact, the problem management team is tasked with investigating the underlying cause of the incidents, and to identify the best method to eliminated the root cause.
Benefits
Problem management provides a logical extension of incident management for improving the overall quality of IT services. Some of the benefits for implementing a formal approach to problem management include:
- Improved quality of IT service which results in reliable services for the business
- Reduced incident volume that minimizes service interruptions for the business
- Improved sharing of knowledge across the IT organization as the team learns from past mistakes and issues.
- Valuable reporting and analytic insights from historical data to identify trends and the means of preventing failures or reducing the impact of failures.
- Permanent solutions that help reduce the number and impact of incidents over the long term.
- Improved first touch resolution by the service desk team because they have access to lessons learned, known errors and work arounds documented in a central knowledge base.
Problem management process
This process represents a example of problem investigation based on ITIL recommendations. It provides a streamlined example that most customers can use as a starting point to adapt their existing ITIL processes and support processes.
Problem management works by using analysis techniques to identify the cause of the problem. Incident management is not usually concerned with the cause, only the cure: restoration of service. Problem management, therefore, takes longer and should be done once the urgency of the incident has been resolved. A problem investigation is often a direct result of a Post Incident Review (PIR) where the IT team needs to know the root cause for a recurring service outage. Problem management can take time. It is important to set time limits or the cost of resolution can become expensive.
The first activity of a problem investigation is to diagnose the problem and validate any workarounds. During this process, it's important that the IT team involved in the investigation document their findings and any workarounds they identify. Once the problem has been diagnosed and a workaround identified, the problem is referred to as a “known error.” These are documented in a centralized knowledge base referred to as the known error database (KEDB). The knowledge shared in the known error database is a significant resource for the service desk team responding to new incidents.
Problem Management process summary
The ITIL problem management process includes many vital steps to ensuring a problem investigation is successfully completed. Here's a summary of the steps included in a problem investigation:
- Detect the problem - The service desk will typically raise a problem because they are seeing incidents occur across the organization with similar conditions, or they are seeing a reoccurring incident. The problem investigation team is responsible for reviewing these type of problems when they are reported. They are also responsible for a proactive review of incident patterns and alerts from event management to identify patterns they should investigate that may have an impact on overall quality of IT services.
- Create a problem record - When a potential problem is detected it should be recorded by either the service desk team or the problem management team. A problem record should capture the time and date of occurrence, the symptoms, all related incident(s), any previous troubleshooting steps, and the problem category and source. This information helps the problem management team research the root cause.
- Categorize the problem - When a problem is recorded the categorization should match incident categorization. The use of incident and problem categorization allows the service desk to sort and model incidents that occur regularly. The use of categorization for of these type of records allows the IT organization to track problem trends and assess the impact of service capacity, demand and quality.
- Determine a problem’s priority - The overall priority of a problem is determined by the impact on users and the business. Urgency is another important aspect that the problem investigation team need to consider. Urgency is how quickly the organization requires a resolution to the problem. The impact is a measure of the extent of potential damage the problem can cause to services and the organization. Prioritizing the problem allows an IT team to utilize investigative resources most effectively.
- Investigating and diagnosing the problem - The speed at which a problem is investigated and diagnosed depends on its assigned priority. High priority problems should be addressed first since their impact on services and the organization are the greatest. Having the proper problem classification and categorization helps drive efficiency in the problem management process. It helps the problem investigation team identify trends more quickly. Once the problem team starts working on a problem they quickly move into the diagnosis phase of the process. This typically involves analyzing the incidents that lead to the problem along with any previous troubleshooting steps the service desk team took to find a workaround. This level of analysis typically involves detailed review of data and conferring with experts to gain the proper insights needed to find the root cause of the issue.
- Identify a workaround for the problem - In most cases when a problem investigation is in progress, incidents remain open. The primary goal in these cases is to find an appropriate workaround that will enable the service desk to restore normal service. Even after the incident is resolved with a workaround, the problem investigation continues. Problem investigations can be lengthy and take considerable time to resolve, therefore a workaround is vital. A workaround should only be considered a temporary measure.
- Create a known error record - Once the workaround has been identified it should be shared with the service desk team as a known error. It’s good practice to record a known error in a knowledge base (KB) article. ITIL refers to this KB as a known error database (KEDB). Documenting the workaround allows the service desk team to resolve incidents quickly and avoid further problems being raised on the same issue.
- Resolve the problem - Problem investigations are considered resolved when they find the underlying root cause of a set of incidents and prevents those incidents from recurring. Some resolutions may require a change request to implement the final resolution. An example of this is when a software update is required to a production application to address a know performance issue. The final important part of the resolution phase for a problem is documenting the steps taken to implement the resolution. This should be published to the IT knowledge base so the service desk has the information for future reference.
- Final review of the problem - ITIL classifies this as a major problem review. It's an important step that many IT organizations skip. A major problem review is an important activity that helps prevent future problems. During a review, the problem management team evaluates the problem investigation documentation to identify what happened and why. This allows the team to identify important lessons learned that help identify any process improvements that are needed. Lessons learned during this step should be documented and shared with the IT organization.
Setup for change management in JIRA Service Desk
Configure the workflow and fields
We recommend the following workflow for managing problem records.
Fields
We recommend the use of the following fields for your problem management process:
- Description: Use this field to capture the basics about the problem.
- Status: Indicates the current state of the problem.
- Pending Reason: Specifies the reason for why problems are moved into pending and what the incident is pending on.
- Example values: Waiting on vendor, More info required, Awaiting approval, Pending on change request.
- Priority: This is determined by the urgency and impact of the problem. Your team can define the value matching according to your own processes.
- Example values: Critical, High, Medium, Low
- Urgency: A measure how quickly a resolution of the problem is required
- Example values: Critical, High, Medium, Low
- Impact: A measure of the extent of the problem and of the potential damage caused by the Incident before it can be resolved.
Example values: Extensive / Widespread, Significant / Large, Moderate / Limited, Minor / Localized
- Operational categorization: Classifies a problem for the purpose of assignment and reporting from the operational perspective.
- Example values: Configuration > Printer
- Product categorization: Classifies a problem for the purpose of assignment and reporting from the product perspective.
- Example: Hardware > Printer
- Source: Indicates where the problem comes.
- Example values: Phone, Email, Monitoring event
- Component: Indicates the service impacted by problem
- Resolution: Classifies the resolution of the problem, e.g. Known error.
- Root cause: Once the problem investigation is completed and the root cause is identified, it's good practice to document the result. The root cause field allows you to document the details of the root cause. Depending on the results of the problem investigation you may also need to publish the root cause as a known error to the knowledge base.
- Workaround: Documents the temporary solutions when the final solution is not available or implemented yet.
Publishing a Known Error to the Confluence IT space
Once a problem investigation has been diagnosed and a workaround identified, the problem is referred to as a “known error.” These results of these type of problem investigations should be published to the known error database. Confluence provides an ideal location for publishing these type of knowledge articles that will provide significant resource for incident management when resolving incidents caused by known errors. Once the known error has been identified, the next step is to determine how to fix it. This will typically involve a change request for the impacted system.
The following is an example of a Known Error Data Base (KEDB) published in Confluence.