Страница 12 из 83
However, the "OK, ready for work" signal was the *very thing that had caused the switch to go down in the first place.* And *all* the System 7 switches had the same flaw in their status-map software. As soon as they stopped to make the bookkeeping note that their fellow switch was "OK," then they too would become vulnerable to the slight chance that two phone-calls would hit them within a hundredth of a second. At approximately 2:25 p.m. EST on Monday, January 15, one of AT&T's 4ESS toll switching systems in New York City had an actual, legitimate, minor problem. It went into fault recovery routines, a
Many of the switches, at first, completely escaped trouble. These lucky switches were not hit by the coincidence of two phone calls within a hundredth of a second. Their software did not fail -- at first. But three switches -- in Atlanta, St. Louis, and Detroit -- were unlucky, and were caught with their hands full. And they went down. And they came back up, almost immediately. And they too began to broadcast the lethal message that they, too, were "OK" again, activating the lurking software bug in yet other switches. As more and more switches did have that bit of bad luck and collapsed, the call-traffic became more and more densely packed in the remaining switches, which were groaning to keep up with the load. And of course, as the calls became more densely packed, the switches were *much more likely* to be hit twice within a hundredth of a second.
It only took four seconds for a switch to get well. There was no *physical* damage of any kind to the switches, after all. Physically, they were working perfectly. This situation was "only" a software problem. But the 4ESS switches were leaping up and down every four to six seconds, in a virulent spreading wave all over America, in utter, manic, mechanical stupidity. They kept *knocking* one another down with their contagious "OK" messages. It took about ten minutes for the chain reaction to cripple the network. Even then, switches would periodically luck-out and manage to resume their normal work. Many calls -- millions of them -- were managing to get through. But millions weren't. The switching stations that used System 6 were not directly affected. Thanks to these old-fashioned switches, AT&T's national system avoided complete collapse. This fact also made it clear to engineers that System 7 was at fault.
Bell Labs engineers, working feverishly in New Jersey, Illinois, and Ohio, first tried their entire repertoire of standard network remedies on the malfunctioning System 7. None of the remedies worked, of course, because nothing like this had ever happened to any phone system before.
By cutting out the backup safety network entirely, they were able to reduce the frenzy of "OK" messages by about half. The system then began to recover, as the chain reaction slowed. By 11:30 pm on Monday January 15, sweating engineers on the midnight shift breathed a sigh of relief as the last switch cleared-up.
By Tuesday they were pulling all the brand-new 4ESS software and replacing it with an earlier version of System 7. If these had been human operators, rather than computers at work, someone would simply have eventually stopped screaming. It would have been *obvious* that the situation was not "OK," and common sense would have kicked in. Humans possess common sense -- at least to some extent. Computers simply don't.
On the other hand, computers can handle hundreds of calls per second. Humans simply can't. If every single human being in America worked for the phone company, we couldn't match the performance of digital switches: direct-dialling, three-way calling, speed-calling, call- waiting, Caller ID, all the rest of the cornucopia of digital bounty. Replacing computers with operators is simply not an option any more.
And yet we still, anachronistically, expect humans to be ru
And they would look in a lot of places.
Come 1991, however, the outlines of an apparent new reality would begin to emerge from the fog. On July 1 and 2, 1991, computer-software collapses in telephone switching stations disrupted service in Washington DC, Pittsburgh, Los Angeles and San Francisco. Once again, seemingly minor maintenance problems had crippled the digital System 7. About twelve million people were affected in the Crash of July 1, 1991. Said the New York Times Service: "Telephone company executives and federal regulators said they were not ruling out the possibility of sabotage by computer hackers, but most seemed to think the problems stemmed from some unknown defect in the software ru
And sure enough, within the week, a red-faced software company, DSC Communications Corporation of Plano, Texas, owned up to "glitches" in the "signal transfer point" software that DSC had designed for Bell Atlantic and Pacific Bell. The immediate cause of the July 1 Crash was a single mistyped character: one tiny typographical flaw in one single line of the software. One mistyped letter, in one single line, had deprived the nation's capital of phone service. It was not particularly surprising that this tiny flaw had escaped attention: a typical System 7 station requires *ten million* lines of code. On Tuesday, September 17, 1991, came the most spectacular outage yet. This case had nothing to do with software failures -- at least, not directly. Instead, a group of AT&T's switching stations in New York City had simply run out of electrical power and shut down cold. Their back-up batteries had failed. Automatic warning systems were supposed to warn of the loss of battery power, but those automatic systems had failed as well.
This time, Ke
Stranded passengers in New York and New Jersey were further infuriated to discover that they could not even manage to make a long distance phone call, to explain their delay to loved ones or business associates. Thanks to the crash, about four and a half million domestic calls, and half a million international calls, failed to get through.