When systems run amok
- 29 October, 2007 22:00
We ask tech veterans near and far for tales of personal woe and destruction at the hands of IT run amok. First: Nightmare at Napster
In 2000, Michael T. Halligan, (now CTO at BitPusher) was an overworked, underpaid grunt at Napster. "How cool!" you think. But remember, Napster's day had waned by the new millennium, and Halligan's horror story had just begun.
We had no money, and everybody in the music industry was suing us. Suffice it to say, we had very few technical resources. The day I started there was the last day of a sysadmin who had literally gone crazy working at Napster (but that's another story). That brought us down to a very senior admin and me. A week went by, and the other sysadmin called in sick for a few days-then he stopped answering phone calls, leaving me on my own with no documentation whatsoever for approximately 350 servers.
After about six months of this, the stress started taking its toll on my health, and I caught a flu that took a month to kick. Unable to stay home to recover, I brought a beanbag and some blankets so I could sleep under my desk. One morning during this, I awoke at about 4:00 to a ringing phone and a beeping pager. The voice on the other end of the phone was someone from CNET asking if Napster had shuttered its doors. I groggily said, "No comment," and started to go back to sleep when the question struck a chord. Shuttered our doors? I got online, and sure enough, we were completely down.
Insanely sick as I was, I called our service network service provider-Above.net-and found out that their network was fine. But they had noticed a 300Mb/s drop on our circuit. I crawled into my Jetta and sped the 30-mile trip to Above.net to investigate the problem.
What I found out was that our supervisor cards in our 6509 were apparently not functioning. This was a bit disconcerting since we didn't have access to them. Our former system administrator hadn't left it to us. I called him a dozen times to no avail. (I admit that I did it mostly because it was 4 a.m. and I wanted to annoy him for causing me this misery.)
After an hour of creativity, I found a backup of a config file from an office router, ran it through a Cisco password crack utility and discovered it had the same password as our production routers (joy!). During this time, I called a friend who owns a Cisco consulting shop, I started running through commands and found out that the first Supervisor module had failed, but the second one was fine. Unfortunately, the second one wasn't configured.
My friend had Cisco dispatch a technician in record time, and he got us back up in about three hours.
The world was abuzz with news that Napster had been down most of the day. For eight hours, teenagers couldn't steal their Green Day, housewives couldn't download Rod Stewart and programmers couldn't trade the latest Amon Tobin MP3s. Two hours after we went back up, we hit a record of a million concurrent users, pushing 1.2GB/s of text searches-and promptly went back down about a half hour after I had fallen asleep in my car.
Back when he was still working in IT, Ray Wang suffered one of those classic "How did they let that happen?" moments. It's no wonder he went into analysis (as a principal analyst with Forrester Research, of course...what did you think we meant?)
I was working at a company that was set to roll out a $15 million ERP implementation as part of a multichannel order management project. Our CIO was fired two weeks before the project was supposed go live. The new CIO had us come in to review architecture. His current-state processes were mapped to future-state systems. Without going into details, it looked like the team had done a detailed mapping. We'd cut off legacy systems at the same time go-live occurred. Teams were nonstop testing for weeks, making sure that any current functionality wasn't lost.
On the day we went live, we flicked on the switch and everything was humming along. We breathed a sigh of relief as orders kept coming in with no issues. Fulfillment looked fine.
But sometime around midday, we discovered that the staff couldn't keep up with the orders, and we couldn't figure out why. In the background we heard that the printers were running out of paper, which seemed like a normal issue. Otherwise, all the systems were chugging along. The website was getting good hits and orders seemed to be coming in OK. The goal of the rollout was to move most of the orders to customer self-service, and so far it seemed that the self-service Web orders were coming into the system with no problems.
However, we found out later that the system was automatically taking the data and pumping out a fax, which then printed out via the company's printers for someone on the team to reenter the data into the system! The process consultants replicated existing business processes without optimizing them and so the company ended up having to hire 20 more order entry clerks to keep up with the 400 percent increase in volume of faxes!
In 1989, consultant Elizabeth Zwicky was working her way through another fall day. Then the earth started to move...
My scariest moment came just after the Loma Prieta earthquake. I was working on the third story of a three-story building, which had never seemed particularly tall to me until the very second that I could tell that the earth was standing still and the building wasn't.
We evacuated the building, and it became clear that the structure was still standing-as was civilization. But thanks to bridge and road damage, few people could actually get home. As the first shock wore off, we became a crowd of bored people standing around in a parking lot watching the asphalt ripple, and the novelty wears off that pretty quickly.
The company I worked for had its own generator, and the big ol' Sun systems we ran had so much capacitance that it didn't even crash when we flipped to generator power, so we had all sorts of computers and lights and whatnot still running when we left the building. Eventually somebody from management came around and said, "You know how in 1908 San Francisco burned down after the earthquake? We really need to do our part to avoid recreating that, so who'd like to volunteer to go turn stuff off in case floating clouds of gas come by?" Everybody volunteered, to which they said, "I gotta tell you, the structural engineers haven't been by, so technically, the building might fall down at any moment." We contemplated it. It didn't look like it was going to fall down. And by golly, it was our job! To save Menlo Park from fire! And it was not boring! So we kept on volunteering, and they carefully wrote down all our names, counted us off, regretted that they had no hard hats and advised us to leap for the middle of the building if the outside started to peel off. Then we moved inside to turn off everything electrical we could find.
I didn't get the most interesting wing, it turned out. That would be the one where they found the researcher still working. The people who found him had to explain that we were going to turn off the servers so there was really no point sticking it out. But I did get the wing with the most damage, forcing me at one point to climb over a downed filing cabinet to get to a computer located in one of the offices. It was an outside office, in the bit they had said might peel off. Worrisomely, it was also a low, stable, two-drawer file cabinet, not some tipsy high thing. But I climbed over it and started looking for power switches on the computer.
None on the front. Reached for the back. Couldn't reach. Lay on my stomach on the desk with my head under the bookshelves, groping for the power switch and...felt things shift underneath me. Immediate thought: "I am about to have my head crushed because I am an idiot. And if not, I am going to fall to my death when the building crumbles." Then I thought up a number of obscenities inappropriate for publication, directed in approximately equal proportions at myself and at the loose leg on the desk on which I was lying. The leg, as it turned out, was the only thing moving.
I got out from under the bookshelf, ripped all the power cords I could find out of the wall and moved on, though much faster-if more carefully.
The next day, we moved back into the building.
The week after that, we reevacuated the wing I'd been in because the structural engineers declared it unsafe.
Back in the mid-90s, Brad Knowles was senior Internet mail adminstrator at America Online, at the time the largest online service provider in the world. But with great power comes great responsibility...
It's known as Black Wednesday, August 10, 1996, the day all of AOL's routers went down, and no one could get any packets to our systems-they all just got thrown away. But computers could still contact our backup name servers at ANS (a subsidiary of AOL that ran all of our external WAN connections), so they knew who all of our mail servers were and how many IP addresses we had listed.
Now, it's important to know that the Internet RFCs requires that that you wait at least two minutes when you start to set up a standard TCP/IP connection before you finally declare the other end to be dead. The standard practice is also to attempt to connect to each of the IP addresses you know for a given name, usually in the sequence in which you received them. At the time, the standard practice for mail servers was that you contacted all listed mail servers for a given domain before you gave up.
Now, step back and do the math for seven names with seven IP addresses each, and two minutes per IP address:
7 x 7 x 2 = 49 x 2 = 98
So, just making one delivery attempt to a single user at AOL was taking 98 minutes to time out. Then another 98 minutes to time out for the next user or the next message for a user at AOL.
At the time, most sites were running Sendmail. They were set to rerun their queue once an hour, and many sites would typically have just the one queue runner process. Each time you'd start up a queue runner, if you had even a single message queued up to a single person at AOL, that process would sit there and spin its wheels for at least 98 minutes trying to talk to the AOL mail servers before giving up-and it would block and not do anything else while it was spinning its wheels. But less than 60 minutes after that happened, another queue runner would get fired up-and would almost certainly hang on the same message going to AOL, or on another message going to AOL.
Do that often enough, and you get enough queue runners hung up to AOL that your queue is clogged and you're not getting mail through to anywhere else in the world. Do that long enough, and you've got so many queue runners hung up to AOL that you run out of RAM and swap space and your mail servers crash.
Well, that's what happened, and I was personally blamed for taking out all Internet email across the entire world. As a result, angry spammers publicly handed out my private telephone numbers and people were asked to complain directly to me. I was also told about at least one business that went bankrupt because it was waiting on a time-critical RFP to come in and it didn't get its bid into the system in time, so it lost the contract.
Once we finally did come back up, it literally took days for us to recover and to catch up to all the backlog that was created for us on the Internet-and it took the rest of the world a few more days beyond that to recover from the rest of their backlog.
You'll have to forgive us for leaving this one anonymous-but read it all and you'll understand why. Let's just say that the fellow who submitted it was kind enough to supply evidence that his tale is terrifyingly true...
I have a website. It has a mixture of stuff on it, mostly out of date, because being a company director I don't often have time to keep it up to date. One day I had just finished a new webpage with my most recent photos and had decided to send the link to interested friends so they could have a look at them (and tell me if any of them were particularly rubbish).
So there I am, going through my address book selecting those people who are good enough friends to be interested, when the phone rings. It's my mother and she's lost the installation disk for her printer. So being the dutiful son, I go to Epson's website, download the drivers for her and attach them to an email to send to her. The phone rings again and it's a customer with a load of questions about his service that if he had read the instructions he would have known all the answers anyway-except he's lost the instructions-again. So I click on "Create Mail" and start a third mail going. I then remember I haven't yet sent the email to my mother so I go back to the other one, add her address to the "To" section and send.
Problem: I have sent her the wrong email. I've sent her the one about the website. Blood running cold time. Mad panic. The reason: I am a transvestite and the new page was of me dressed up as the female me-and she didn't have a clue. OMG! What can I do? I start trying to think of excuses-you can imagine the blind panic. Anyway, five minutes of absolute mayhem ensues, she is almost 70 and would NOT understand, I am close to tears, then suddenly I hear a noise. Incoming mail. I am expecting it to be from her.
Instead, it was: "This is the automated sender verification system at XYZ.com. Your email will held in a queue for 7 days until you verify you are a real person by clicking on the following link. After that we will assume your email was spam and it will be deleted without being delivered."
Normally I slag off Challenge Response systems (from the antispam point of view)-but in this case-Phew!
It sounds like a tale from the Dark Ages, but it happened this year! ZaReason CEO Cathy Malmrose was helping upgrade computers at a number of schools. The experience taught her more than she'd ever imagined possible....
At a school district the size of a college campus, I worked my way up the ladder until I finally got to talk to the head honcho in charge of IT for the school. He explained that they just updated "all the school's computers". I asked him what they had updated since I know that most classrooms currently have old, nonfunctional systems. His reply: "We got that new thing, you know with the wires?" His hands were making big gestures, so I ask, "You mean a network?" His reply: "Yeah, that." This guy was in charge of all IT decisions. I screamed silently.
But it didn't stop there. When I was working with a small, poor inner-city school, the head honcho told me they had three new computers that are "great, but they don't work." I know she won't know the specs off the top of her head so I ask if I can see them. She leads me to a room with three Dell monitors sitting on the floor. "Here they are." I had to desperately find a way to explain that they wouldn't "work" without a little more hardware.
Typing in anger leads to careless mistakes-especially if you hit "Enter" too soon. Just ask one of our own, Senior Editor Kim Nash. We're happy to say her friend survived the scare, but he needed some serious help....
I remember when a friend of mine in California was spewing vitriol about his boss to me, in Massachusetts, via email. Yet he mistakenly addressed the message to said boss-which he realized only upon hitting the "Enter" key. It was after-hours on the East Coast, and my friend, now panicky and more than a bit sweaty, paged his IT manager at home for help.
Can you say, "Desperate to keep my job? Please, please, please?" In those days, the IT manager couldn't do a thing from his house, so he bundled up, got in his car and motored into the office to obliterate the offending email from the system before his boss opened it the next morning.
That act of human kindness got the IT manager a bottle of fine, fine California wine FedExed to him the very next morning.
Don't you wish all these stories could have such happy endings?