Incidents, Problems, Known Errors and Changes

Incident and Problem Management are valuable process domains in ITIL. This column explores how they're applied to real-world challenges.

Feb 13, 2006
By

George Spafford

Submit Feedback »
More by Author »

E-MAIL

COMMENT

SHARE
- Digg
- del.icio.us
- Newvine
- furl
- StumbleUpon
- BlinkList
- Newsvine
- Magnolia
- Facebook
- Tailrank
- Slashdot
- Technorati
- Google Bookmarks
- Yahoo Favorites
- Windows Live
- Ask

The IT Infrastructure Library (ITIL) uses specific wording in the incident and problem management process areas to describe the lifecycle of system errors through to structural resolution.

The relationship of the terminology used is an interesting topic of discussion as we can explore the handling of a service error through the incident management process and opportunities for improvement. *

An incident is any event that is not part of the normal operation of a service and impacts, or threatens to impact, the quality of the service delivered. In response, IT opens an incident record to try to quickly restore the service to operating within the parameters of the service level agreement (SLA).

The perspective is grounded in the SLA because it should outline performance expectations from the customer -– not just from IT's perspective. This reflects the need to support the business, not just “push” technology.

If the cause is readily apparent and can be corrected, then a work-around is developed or a request for change (RFC) created. Some corrections can be done without change -- such as resetting a device -- necessitating only a work-around.

On the other hand, if a change is required, it needs to be handled through the proper change management processes. Even though incident management’s goal is the speedy restoration of service, it must not bypass change management or this will cause production build configurations to drift from their established baselines.

If the cause of the error is not readily apparent or it is felt that an investigation is required, then a problem record should be opened. This new problem record is then independent of the incident because the incident management function is tasked with restoring service as quickly as possible.

In contrast, the problem management function is tasked with identifying the underlying causal factor, which may relate to multiple incidents. It may take several incidents to transpire before problem management has enough data to understand the root cause. Once problem management identifies the causal factor and develops a work-around, then the problem becomes a “known error.”

The fact that sometimes problem management cannot immediately identify the root cause and establish a corrective action puts the two groups at odds, as incident management wants a quick fix, or work-around. If the incident management team develops a work-around, then the problem management record should be updated with the information so the problem management team can leverage the additional data.

In reviewing the incident management team’s work-around, problem management may elect to accept the work as the resolution because it addresses the root cause. If it does not, then problem management will dig deeper. If problem management develops a work-around that addresses the incident without solving the root cause, then the incident becomes a “known error.”

As mentioned above, if a change is needed, then a RFC must be filed and handled through change management. If problem management establishes the root cause and a resolution, they need to alert incident management so the “known error” tickets can benefit from the resolution and have their status shifted to “closed” once the corrective work is completed.

Opportunities for Improvement

The above outlines the relationships between incidents, problems, known errors, RFCs and, finally, resolutions. Building on the topics discussed above, there are several opportunities for process improvement:

Be able to quickly identify changes. Most availability issues stem from changes. The sooner changes can be identified or excluded, the better. Consider using an automated integrity management control to detect and report on changes found in the production environment.

Use a proper taxonomy in order to match existing incident and problems. Speeding up the search for similar, or related, incidents and problems necessitates a classification system that supports the needs of the organization.

Record meaningful notes in the ticket. Personnel involved with incidents and problems need to enter notes that are useful to other people in the ticket. Terse or cryptic comments will not aid others who may need to read and understand the ticket.

Have a resolution editor. Task someone who can write clearly with reviewing resolutions to ensure they are complete, clearly written and follow any organizational documentation standards. This may also be warranted for known errors, depending on the organization’s needs.

Summary

Incident and Problem Management are valuable process domains in ITIL. As the pervasiveness of IT increases in mission-critical aspects of the business, this trend will continue. As organizations look to ITIL to improve their processes, they will need to understand the relationship between incidents, problems, known errors, request for change and resolutions.

* This article focuses on the relationship of the terminology used to denote incidents, problems, known errors, requests for change and resolutions. For details on the processes, review the ITIL Service Support volume or go to the Incident and Problem Service Management Functions of Microsoft Operations Framework site or to the Reactive and Proactive portions of the BECTA site for Incident and Problem Management.