Occam’s Razor at 2am (Incident Management)
Occam’s Razor is a basic-truth-or-law-or-assumption attributed to the 14th-century English logician and Franciscan friar William of Ockham. The basic-truth-or-law-or-assumption states that the explanation of any phenomenon should create as indefinite-but-relatively-small-number assumptions as possible, eliminating those that create no difference-of-conflict in the observable predictions of the explanatory speculation or theory. Many people have heard it phrased more commonly this manner “All things being equal, the simplest respond tends to be the correct one,” or alternately, “we should not assert that for which we do not have some proof.” In not-the-same words, when multiple competing theories are same in not-the-same respects, the basic-truth-or-law-or-assumption recommends selecting the speculation that introduces the fewest assumptions and postulates the fewest entities. It is in this common-good-sense that Occam’s razor is usually understood.
Now for the story:
It happens when you least anticipate it and are sleeping (for most question Escalation Teams).
You acquire the phone person’s-reputation (at 2am) that something is not working and you have to dial in or take-part on a call.
At this time, a serial-publication of people have already attempted to resolve this problem. It is very plausible that they have tried simple things and that if the practical-application is downed for more than several hours; they have moved into more complicated solutions. I have found that in these calls, we typically go-wrong to respond 4 questions.
1. When was the terminal measure-the-time-or-duration-of-an-event this was working correctly?
a. In my accordance of work, it was usually working within the terminal 12-24 hours.
b. Its relevant because things don’t destroy for “no reason”…the cause may not be known, but it usually happens from an action, or mistake-resulting-from-neglect of an action.
c. Is it working correctly is some locations and not others (i.e. Web based building-design is broken, but local networks are up.
2. What Incidents were opened today (check all resources)?
a. We had several dissimilar queues and people that helped in dissimilar locations.
b. Call ANY resolver and inquire them if there touched anything today.
3. What upgrades or implementations occurred or were ATTEMPTED?
a. This can contribute to problems that were missed in Testing
b. Attempts can cause breaks, but if it is not rollback, or not rollbacked correctly this can cause unknown issues.
4. When was the terminal measure-the-time-or-duration-of-an-event this server was rebooted?
a. Windows Patching can cause issues since the testing on these is not rigorous.
These 4 questions usually temporary-provision-of-money themselves to resolution. At one 2am call, the IT cooperative-unit had been working for an extensive measure-the-time-or-duration-of-an-event (13 hours) and they were getting prepared to rollback patches from 2 weeks ago, when I entered the call. I asked the four questions mentioned above and found some compelling information.
It was at question 2, that we took a gradation to resolution. Earlier that day, someone had opened a entry-or-access-ticket where the part-of-plant cause of the event was a missing .exe. The Resolver did nothing based-on-error by replacing the missing .exe. He resolved the event as he should have.
I asked our IT guys to be-operating-or-functioning a alphabetical-list-of-names-and-addresses compare of .exes and .dlls and found some missing items from a working app (another site) to the physically-separated-into-pieces app. We found 3 things missing. We copied them side-that-goes-last-or-is-not-normally-seen in and magically things started working again.
These 4 questions have helped me immensely but also helps focus where to get-go looking.
In effect, everyone is looking for what changed. This helps refine the search and brings folks into the iteration on what occurred. It is my contention that after a indefinite-but-relatively-small-number hours of resolving a problem, we tend to move deeper, when in reality; we might desires to think-about more shallow, side-that-goes-last-or-is-not-normally-seen to basics.
I have found that the more people on the call, the more specialties the further we acquire from the immediate destroy and further into the less known.