Skype didn't deliver on their promise of an explanation of the outage


On Saturday, Skype promised to give (on Monday) "a more detailed explanation of what happened" that caused a worldwide loss of Skype service that lasted nearly 2 days. They failed to give a convincing explanation.

Skype claims, "The disruption was triggered by a massive restart of our users' computers across the globe within a very short timeframe as they re-booted after receiving a routine set of patches through Windows Update."

Other bloggers have pointed out that Windows Update happens on a schedule based on each user's local time; so the updates do not happen everywhere at once and do not trigger a simultaneous restart of computers all over the world. Moreover, Windows Update happens regularly and Skype offered no explanation as to why this sort of problem hasn't happened in the past.

Skype claims to have experienced "a flood of log-in requests, which, combined with the lack of peer-to-peer network resources, prompted a chain reaction that had a critical impact." Again, no explanation as to why this hasn't happened before. They do admit to "a previously unseen software bug within the network resource allocation algorithm which prevented the self-healing function from working quickly."

Since they had already admitted that they were not the target of a malicious attack, it was obvious that they had a bug in their system; and since we had to wait nearly 2 days for the system return to usability, it was obvious that the bug had prevented a quick fix and that it had a "critical impact". So what did they tell us that we didn't already know? Very little.

In a followup to its first explanation, Skype gave an answer the question of why problems haven't occurred in the past when Microsoft released updates:

That’s because the update patches were not the cause of the disruption. In previous instances where a large number of supernodes in the P2P network were rebooted ... there had not been such a combination of high usage load during supernode rebooting. As a result, P2P network resources were allocated efficiently and self-healing worked fast enough to overcome the challenge.

The bottom line seems to be that this was a bug waiting to happen and it has now been fixed.

Posted: Tuesday - August 21, 2007 at 12:47 AM          


©