Occam’s Razor at 2am – Help-Desk Escalation

Occam’s Razor at 2am (Incident Management)

The Principle:

Occam’s Razor is a basic-truth-or-law-or-assumption attributed to the 14th-century English logician and Franciscan friar William of Ockham. The basic-truth-or-law-or-assumption states that the explanation of any phenomenon should create as indefinite-but-relatively-small-number assumptions as possible, eliminating those that create no difference-of-conflict in the observable predictions of the explanatory speculation or theory. Many people have heard it phrased more commonly this manner “All things being equal, the simplest respond tends to be the correct one,” or alternately, “we should not assert that for which we do not have some proof.” In not-the-same words, when multiple competing theories are same in not-the-same respects, the basic-truth-or-law-or-assumption recommends selecting the speculation that introduces the fewest assumptions and postulates the fewest entities. It is in this common-good-sense that Occam’s razor is usually understood.

Now for the story:

It happens when you least anticipate it and are sleeping (for most question Escalation Teams).

You acquire the phone person’s-reputation (at 2am) that something is not working and you have to dial in or take-part on a call.

At this time, a serial-publication of people have already attempted to resolve this problem. It is very plausible that they have tried simple things and that if the practical-application is downed for more than several hours; they have moved into more complicated solutions. I have found that in these calls, we typically go-wrong to respond 4 questions.

1. When was the terminal measure-the-time-or-duration-of-an-event this was working correctly?

a. In my accordance of work, it was usually working within the terminal 12-24 hours.

b. Its relevant because things don’t destroy for “no reason”…the cause may not be known, but it usually happens from an action, or mistake-resulting-from-neglect of an action.

c. Is it working correctly is some locations and not others (i.e. Web based building-design is broken, but local networks are up.

2. What Incidents were opened today (check all resources)?

a. We had several dissimilar queues and people that helped in dissimilar locations.

b. Call ANY resolver and inquire them if there touched anything today.

3. What upgrades or implementations occurred or were ATTEMPTED?

a. This can contribute to problems that were missed in Testing

b. Attempts can cause breaks, but if it is not rollback, or not rollbacked correctly this can cause unknown issues.

4. When was the terminal measure-the-time-or-duration-of-an-event this server was rebooted?

a. Windows Patching can cause issues since the testing on these is not rigorous.

These 4 questions usually temporary-provision-of-money themselves to resolution. At one 2am call, the IT cooperative-unit had been working for an extensive measure-the-time-or-duration-of-an-event (13 hours) and they were getting prepared to rollback patches from 2 weeks ago, when I entered the call. I asked the four questions mentioned above and found some compelling information.

It was at question 2, that we took a gradation to resolution. Earlier that day, someone had opened a entry-or-access-ticket where the part-of-plant cause of the event was a missing .exe. The Resolver did nothing based-on-error by replacing the missing .exe. He resolved the event as he should have.

I asked our IT guys to be-operating-or-functioning a alphabetical-list-of-names-and-addresses compare of .exes and .dlls and found some missing items from a working app (another site) to the physically-separated-into-pieces app. We found 3 things missing. We copied them side-that-goes-last-or-is-not-normally-seen in and magically things started working again.

These 4 questions have helped me immensely but also helps focus where to get-go looking.

In effect, everyone is looking for what changed. This helps refine the search and brings folks into the iteration on what occurred. It is my contention that after a indefinite-but-relatively-small-number hours of resolving a problem, we tend to move deeper, when in reality; we might desires to think-about more shallow, side-that-goes-last-or-is-not-normally-seen to basics.

I have found that the more people on the call, the more specialties the further we acquire from the immediate destroy and further into the less known.

InputandOutput

The inputs of the Incident Management process are:

–          Incident details sourced from (for example) Service Desk, networks or computer operations.

–          Configuration details from the Configuration Management Database (CMDB)

–          Response from Incident matching against Problems and Known Errors.

–          Resolution details

–          Response on RFC to effect resolution for Incident(s).

 

The outputs of the Incident Management process are:

–          RFC for Incident resolution; updated Incident record (including resolution and / or Work-arounds)

–          Resolved and closed Incidents

–          Communication to Customers

–          Management information (reports)

 

Ready to buy? Order the Help Desk Toolkit today 

IncidentManagement

The primary goal of the Incident Management process is to restore normal service operation as quickly as possible and minimise the adverse impact on business operations, thus ensuring that the best possible levels of service quality and availability are maintained. ‘Normal service operation’ is defined here as service operation within the Service Level Agreement (SLA) limits.

 

The scope of Incident management falls into three broad categories:

–          Applications

–          Service not available

–          Application bug / query preventing Customer from working

–          Disk-usage threshold exceeded

–          Hardware

–          System down

–          Automatic alert

–          Printer not printing

–          Configuration Inaccessible

–          Service requests

–          Forgotten passwords etc

–          Requests for documentation

 

A request for new or additional service (i.e. software or hardware) is often not regarded as an Incident but as a Request for Change (RFC). However, practice shows that handling of both failures in the infrastructure and of service requests are similar, and both are therefore included in the definition and scope of the process of Incident Management.

 

Ready to buy? Order the Help Desk Toolkit today 

BenefitsofIncidentManagement

The major benefits to be gained by implementing an Incident Management process are as follows:

–          For the business as a whole:

–          Reduced business impact of Incidents by timely resolution, thereby increasing effectiveness

–          The proactive identification of beneficial system enhancements and amendments

–          The availability of business-focussed management information related to the SLA.

 

–          For the IT organisation in particular:

–          Improved monitoring, allowing performance against SLAs to be accurately measured

–          Improved management information on aspects of service quality

–          Better staff utilisation, leading to greater efficiency

–          Elimination of lost or incorrect Incidents and Service Requests

–          More accurate CMDB information (giving an ongoing audit while registering Incidents)

–          Improved User and Customer satisfaction.

 

Ready to buy? Order the Help Desk Toolkit today 

HelpDeskConceptsandDefinitions

Incident:

An incident is defined as “any event which is not part of the standard operation of a service and which causes, or may cause, an interruption to, or a reduction in, the quality of that service”.

 

An incident influences the service delivery, although it can be small and in some cases even transparent (not noticeable) for the user.

 

Problem:

A problem is the as yet unknown cause of the occurrence of one or more incidents.

 

Known error:

This is the situation where a successful diagnosis of a problem has shown what the cause is and which CI reveals a problem. A possible solution may also be available as to how the problem can be avoided.

 

Work around:

It is possible for Problem Management to identify “work-around” in the investigation of problems. These should be made known to Incident Management so that they can be passed to the user until the permanent fix is implemented.

 

Ready to buy? Order the Help Desk Toolkit today