It's been an IT mantra for years: "When all else fails, reboot." Rebooting often works, but isn't there a better approach to the problem of buggy software that crashes your computer and takes your valuable data with it? That idea has been the focus of researchers at Stanford University and the University of California, Berkeley, who have been working feverishly to find better ways to bring computers back from the brink of disaster.
The researchers are seeking a fresh alternative to rebooting. Thinking backward, they reasoned that it might be a good idea to give up on the impossible job of making bug-free software and instead look for ways to recover from failures without losing data or time.
That's the concept behind "recovery-oriented computing," a 180-degree turn from traditional thinking. The idea is that since software can't be created without crash-causing flaws, it should be built to reboot much faster, allowing users to get back to work almost instantly.
"The idea is pretty simple: If availability is the fraction of time that you're up, then recovering fast is more critical than reducing the number of times that crashes happen," says David Patterson, a computer science professor at UC Berkeley.
"In the dawn of computing, people thought software bugs would go away, and they haven't, so now we need ways to co-exist with them," he says. "I think it's a fact to live with rather than a problem to be solved."
One way to do that is through an evolving technique called micro-rebooting, which quickly reboots just enough of the program processes to get the system stabilized and back on track for the user.
A micro-reboot is specific to one problematic area of the software's code and doesn't affect other parts of the application, so data in the processing pipeline is unaffected by the reboot.
Led by Patterson and Armando Fox, an assistant professor of computer science at Stanford, the project began in late 2000. Patterson, Fox and a team of graduate students had seen evidence that systems dependability could be improved. Some IT systems for use in avionics, spacecraft and health care were ultradependable because they had to be, but they were costly and complex, and that kind of reliability was impractical for typical IT use. Another way had to be found.
Heading off a crash
The researchers are experimenting with algorithms that watch over system processes and sense when something has gone awry, and a crash is imminent. The algorithms focus on determining the normal baseline operations of applications, and when they see a deviation from the baseline, the system can quickly do a micro-reboot without the user even knowing that a problem has occurred.
Keys to the research have been isolating the faults and providing redundancy so the system stays alive while the instant recovery takes place. The researchers are exploring techniques that could encourage software and hardware designs that drastically improve the "restartability" of programs and devices.
The problem with the traditional reboot is that the CTRL-ALT-DEL process takes too long. A micro-reboot is several orders of magnitude faster, Fox says. "It's not guaranteed to fix the problem, but it's guaranteed not to make things worse, so there's no reason not to try it," he says.
The researchers have been using the Java 2 Enterprise Edition application server in their studies because it's so widely used and because, as open-source software, it's readily modifiable.
J2EE is also a good starting point for the research because its applications use a modular design structure with clear boundaries between software modules, making it easier to stop and reboot one process within a part of the application.
While J2EE-based Internet applications on corporate servers have been the focus of this research so far, Fox says the technology will trickle down. "Desktops have so much performance today, maybe some of that can eventually be traded away for dependability," he says.
But challenges remain. "There's work to be done on other microrecovery methods," explains Fox. "We've identified a way that we can work together with people who use statistical monitoring algorithms," but that has created issues with the algorithms and security, he says.
What the researchers still hope to learn is exactly what's good and what's bad in J2EE that helps to solve the rebooting problems, Fox says.
Also needed is a body of research that will address the same issues in other widely used computing systems in the future.
Micro-rebooting could be built into J2EE application server software within the next two to three years, Fox says. However, "to make micro-rebooting industrial-strength, obviously there's still more work to be done," he says.
It would be more challenging to include micro-rebooting capabilities in large, proprietary applications, because the applications aren't modularized. But researchers could certainly work on that capability for future versions of the software, Fox says.
The ultimate micro-rebooting system would prophylactically go through your PC or server and reboot it frequently in the background, refreshing it without causing you to have any visible failures, Patterson says.
Corporate IT is beginning to accept the idea of fast, automatic rebooting, according to Fox. "Now you don't have to convince people about the desirability of this," he says. "Now they understand why it's important to prevent crashes on their corporate IT systems." -- Computerworld (US)
Join the CIO New Zealand group on LinkedIn. The group is open to CIOs, IT Directors, COOs, CTOs and senior IT managers.