Skype didn't deliver on their promise of an explanation of the
outage
On Saturday, Skype promised to give (on Monday)
"a more detailed explanation of what happened" that caused a worldwide loss of
Skype service that lasted nearly 2 days. They failed to give a convincing
explanation.
Skype claims, "The disruption was triggered by a
massive restart of our users' computers across the globe within a very short
timeframe as they re-booted after receiving a routine set of patches through
Windows Update."
Other bloggers have
pointed out that Windows Update happens on a schedule based on each user's local
time; so the updates do not happen everywhere at once and do not trigger a
simultaneous restart of computers all over the world. Moreover, Windows Update
happens regularly and Skype offered no explanation as to why this sort of
problem hasn't happened in the
past.
Skype claims to have experienced
"a flood of log-in requests, which, combined with the lack of peer-to-peer
network resources, prompted a chain reaction that had a critical impact."
Again, no explanation as to why this hasn't happened before. They do admit to
"a previously unseen software bug within the network resource allocation
algorithm which prevented the self-healing function from working
quickly."
Since they had already
admitted that they were not the target of a malicious attack, it was obvious
that they had a bug in their system; and since we had to wait nearly 2 days for
the system return to usability, it was obvious that the bug had prevented a
quick fix and that it had a "critical impact". So what did they tell us that we
didn't already know? Very
little.
In a followup
to its first explanation, Skype gave an answer the question of why problems
haven't occurred in the past when Microsoft released
updates:
That’s
because the update patches were not the cause of the disruption. In previous
instances where a large number of supernodes in the P2P network were rebooted
... there had not been such a combination of high usage load during supernode
rebooting. As a result, P2P network resources were allocated efficiently and
self-healing worked fast enough to overcome the
challenge.
The
bottom line seems to be that this was a bug waiting to happen and it has now
been fixed.
Posted: Tuesday - August 21, 2007 at 12:47 AM