Don’t do it all in one go
Dunedin Casino started a business continuity upgrade in 2007, using specialists Standby Consulting to review its requirements.
“The previous system was a traditional, disk-based backup system based on Veritas products. The business case for the upgrade was to reduce the recovery timeframe from a backup system to a near-online replication,” explains Bertie Enochs, head of IT at the casino.
The casino also had four disparate servers running multiple environments, with some applications incapable of handling newer Microsoft applications like Windows Vista.
Following the detailed analysis, VMware consultant ViFX was appointed to implement an infrastructure based on VMware virtualisation.
The new infrastructure featured two clusters with two servers in each cluster. One cluster is dedicated to server virtualisation and hosts 12 virtual machines, while the other supports 40 virtualised desktops.
Dunedin Casino’s production site is now also linked by a 100 Mbps cable to an IBM data centre.
Previously, the casino’s Oracle database would be backed up once every four hours; and in the case of a disaster impacting on production systems, four to six hours would have been required to load the operating system and database onto a new machine and restore the application.
Now, in a virtualised environment, the live virtual servers can be replaced in a DR environment, a process taking just a few minutes and that does not affect users at all.
“For a casino, ensuring 24x7 unrestricted performance of revenue critical business, which is to say minimal downtime of computing systems, constitutes a good business plan,” explains Enochs.
This was helped by choosing “industry standard” products, considering product performance and evaluating vendor support.
Furthermore, Enoch built “a strong relationship” with Neil Cresswell, ViFX’s technical director. This led to a situation of mutual trust and open communication between the vendor and the casino.
Organisations looking to improve their business continuity, explains Enochs, should look at virtualised BC solutions whether they use virtual servers or not.
“IT managers need to look from ‘backups’ [Disaster Recovery] to ‘near online replication’ [Business Continuity],” he says. “Traditionally, replicating servers meant an investment of tens of thousands of dollars. However, some virtual replication products are available for less than $2000, which do a far better virtual replication job than a physical replicated solution.” As well, the process should not be done all in one go.
“Virtualisation is spoken by many, but understood conceptually by just a few. Virtualisation is a totally different paradigm that even industry leaders need to get their heads around to understand it.
“For Dunedin Casino virtualisation was a journey and each step was a steep learning curve, and a precursor to the next step. “Like any journey into the unknown it is daunting in the beginning, but with the completion of each step comes the exciting stage of taking the next step, and on and on, and more boldly ... until [it is] mission accomplished,” Enochs says.
Know your options and the costs
This kind of evolution is reflected in the experience of Onslow College in Wellington.
Tom Cummings, a teacher and network manager, says business continuity is about the tension existing between the available technologies and the budget you have available.
Onslow College had used Novell technologies and decided to stick with the company in its upgrade, given that the college had just one unscheduled period of downtime from hardware failure in seven years of using Novell products.
Prior to the upgrade the school had two NetWare servers, but now has a combination of NetWare, Linux and Windows all managed by Novell eDirectory.
The main Novell Open Enterprise server is Linux-based and is virtualised using open source Xen technology. There are several other virtualised servers using Xen and these virtualised servers are hosted on an IBM blade centre running SUSE Linux Enterprise Server. A second blade provides redundancy.
Tape backup is used for daily backups of critical data and regular backups of other data. Some disks are stored offsite and should the system go down, Cummings believes in a worst case scenario it would only be down for less than a day.
But for business continuity, application continuity matters too, so they are deployed using Novell ZENworks.
“This means if the system is damaged, the user can self-heal the broken application by simply right clicking on its icon and choosing the verify option. This process will restore the application to its working state. If all else fails, the ZENworks can re-image the workstation automatically, after using the ZENworks personality migration tool to capture any personal settings. A worst-case scenario for a user is then around 45 to 60 minutes [downtime],” he says.
Another change is having 95 per cent of the college data on a SAN, rather than the former “data in a server” arrangement. So if the server goes down, the data is still available. Redundancy has also been increased in main switches.
Cummings confirms a good business continuity plan comes from knowing your options, what the costs are and what you can afford. Central to this is stability of platforms and having redundancy in the network infrastructure.
Onslow College has a network management committee that discusses all the options available, so management is familiar with the costs and benefits of what is done, what the weaknesses of its system are and the potential downside of failure.
“Before making changes, I consult with a wide variety of industry colleagues and bring that back to the committee,” he says.
The most important factors in selecting the final product were stability, costs, getting support and the ‘cost’ of change, including the TCO and any interruption to service.
Challenges in upgrading the system included understanding and implementing the Blade/SAN solution, which had a few quirks and caught the college off-guard. But a careful choice of business partners that understood the solution helped it through.
Cummings advises people to choose their backend services carefully and not be forced by providers pushing their solutions. When the college said it wished to stay with Novell, it was often told a more market-orientated solution would offer better support. “Understanding what solutions are available in the market and listening and consulting with a variety of vendors is crucial to your decision-making. Just because your current business partner may be keen on their known preferred solution, should not deter you from a solution that meets your needs, not theirs.”
Cummings says continuity strategy is about balancing what you can afford with what you need, as well as what’s available. It is also important IT staff and other managers understand these limitations and are prepared to accept them.
Novell, combined with utilising virtualisation, gave the college an affordable continuation strategy.
“Continuity is about having redundancy in your server and having the ability to automatically deliver and self-heal applications, [through] using ZENworks in our case,” he adds.Take your time and don’t rush
Kevin Drinkwater, CIO of supply chain logistics provider Mainfreight, says what constitutes a good continuity plan is one that works, which means regular testing.
Mainfreight uses a mix of disks, tapes, virtualisation and data centres for its business continuity strategy.
Its Auckland-based production centre is connected using dark fibre to a datacentre elsewhere in the city. Resiliency is built into both centres so if equipment fails, other equipment will take over.
Onsite generators ensure constant power, with tapes backed up daily at both centres.
At a people level, Mainfreight, which has 50 depots across New Zealand, ensures it can move people from one of its sites to another should one site fail.
“Mainfreight is virtually a 24-hour operation. There is very little downtime and little chance to do backup. We are now going live. Recently, our Asian operations have moved all their data processing requirements to New Zealand and we host some of our US operations. We also run all systems for all of Australia,” Drinkwater says.
“What constitutes a good plan is testing it and making sure everybody knows what the plan is.”
Mainfreight began upgrading its business continuity in 2006 as part of a technology refresh, which meant it bought both the production and disaster recovery equipment at the same time.
“There was considerable analysis and consultation undertaken by our infrastructure team along with vendor representatives. This working team presented the strategy and a recommendation to myself and the board.”
Leasing dark fibre was groundbreaking, as it meant the telecommunications link between the two data centres would be virtually instantaneous.
“We have improved our business continuity planning over eight years,” Drinkwater says. “We have been through a stepped programme whereby first we started with resiliency, then duplicate systems, [then] duplicate hardware on the same site, to the point where today we have a proper DR facility.”
Technology purchases were determined by what worked and how fast it could cut over to the DR centre.
To reduce implementation challenges, the DR infrastructure was implemented physically alongside the production environment until they knew it was working properly. Once it was working, the DR infrastructure was then moved to its site.
“It takes longer than you expect. It’s not a job that you should rush,” he says.
Drinkwater says there was a significant outage in July 2008 in its SAN that meant switching to the DR system, which took two hours. Most users did not notice any change, but moving back to the production system presented a few problems because the two systems became unaligned during the three days the main system was down.
He advises people to use the best technical expertise they can to assess the concept and implement the solution.
“Business continuity does not need to start with the ultimate of having a DR centre. You need to look at what you can afford. Implement that and then continually look at your ability to move up within the business continuity scale. You put it in over time. The first thing to do is backup the tapes. It’s all about affordability,” he says.
Now, Mainfreight plans to look at having a second DR centre offshore, so systems can continue should there be a major disaster in New Zealand or its international communications are disrupted.
“We are one of the few who test our DR in a live disaster recovery. We cannot emphasise enough to test on a regular basis. The best analogy is a car that has been sitting in a garage for 12 months. Its more than likely the battery will be flat,” Drinkwater concludes. CIO NZSidebar A: Anatomy of a good marriage
Disaster recovery is the process by which you resume business after a disruptive event. The event might be something huge — like an earthquake or the terrorist attacks on the World Trade Center — or something small, like malfunctioning software caused by a computer virus.
Given the human tendency to look on the bright side, many business executives are prone to ignoring “disaster recovery” because a disaster seems an unlikely event. While “business continuity planning” suggests a more comprehensive approach to making sure you can keep making money, the two terms are often married under the acronym BC/DR. At any rate, DR and/or BC determine how a company will keep functioning after a disruptive event until its normal facilities are restored.
All BC/DR plans need to encompass how employees will communicate, where they will go and how they will keep doing their jobs. The details can vary greatly, depending on the size and scope of a company and the way it does business. For some businesses, issues such as supply chain logistics are the most critical and are the focus on the plan. For others, information technology may play a more pivotal role, with the BC/DR plan having more of a focus on systems recovery. For example, the plan at one global manufacturing company would restore critical mainframes with vital data at a backup site within four to six days of a disruptive event, obtain a mobile PBX unit with 3000 telephones within two days, recover the company’s 1000-plus LANs in order of business need, and set up a temporary call centre for 100 agents at a nearby training facility.
Constant communication critical
The critical point is that none of these elements can be ignored, and physical, IT and human resources plans cannot be developed in isolation from each other. At its heart, BC/DR is about constant communication. Business leaders and IT leaders should work together to determine what kind of plan is necessary and which systems and business units are most crucial to the company. Together, they should nominate which staff are responsible for declaring a disruptive event and mitigating its effects. Most importantly, the plan should establish a process for locating and communicating with employees after such an event. In a catastrophic event (Hurricane Katrina being a classic example) the plan will also need to take into account that many of those employees will have more pressing concerns than getting back to work. CIO USSidebar B: The cornerstone of a business continuity strategy
Every year I hear horror stories from companies about server and network outages and the resulting loss of data and productivity. Some network users may briefly find an outage a bit charming, as older colleagues lean back and reflect, ‘This is the way it was back in the ‘70s - no internet, email, not even a fax machine. Just typewriters, phones and the mail.’
Today, it’s all about the immediate access to information, applications and to one another. Even small enterprises are increasingly online, mobile and Web 2.0-driven, to the point where IT is no longer just a business tool. It is business — the heart and the circulatory system through which most transactions flow. If your IT systems fail, your daily operations follow — and if the outage lasts too long, your business may fail.
So small to midsize businesses should ask themselves how they can create a consistently available infrastructure that responds robustly to new-age business challenges and disruptions. Server clustering and data mirroring can play an important role in implementing high availability. They can also serve as a cornerstone to an effective business continuity and disaster-recovery strategy, and — good news — they can be very affordable.
Clustering and mirroring for high availability
Server clustering is the answer for several objectives: creating scalability, load balancing and, of course, increasing system availability. Clustering for high availability allows the automated failover between servers in the cluster, providing close monitoring of applications and all their components, including the operating system, server hardware, networking and storage.
The clustering software determines when to perform a failover by continually checking each application’s “heartbeat” signal. If one system has a problem, the application on another server in the cluster takes over. To the outside world the cluster appears to be a single system, but intelligent redundancy within it creates high availability.
Application availability is only half of the IT requirement. The data that applications create and use must be equally available in order for business to continue. Disk mirroring is the recording of redundant data on two partitions of the same disk or two separate disks, for fault-tolerant operation.
Mirroring is a central component in the highest level of data protection and disaster recovery, and it differs from ordinary backups, which simply replicate a complete volume at specific points in time, often for use in testing. Mirroring creates dynamic, real time copies of data volumes, which further reduces the amount of data at risk of loss. Mirroring can be done using Level 1 Redundant Array of Independent Disks (RAID) features. RAID can be provided through the motherboard or a controller card, or built into a dedicated disk array.
Benefits and challenges
Server clustering provides three key benefits.
- High availability: Designed to avoid a single point of failure.
- Scalability: Computing power can be increased by adding more processors or computers.
- Manageability: Appears as a single-system image with a single point of control.
The benefits of data mirroring
Protects against data loss: Added redundancy offers backup in case of hardware failure.
Disaster protection: Offers quick recovery against site and region-wide incidents.
Individual disk access: Each disk or set of disks in the mirror can be accessed separately for reading purposes.
Although mirroring is essential to ensure high availability of data, it’s not a complete data protection solution by itself. Mirroring is ineffective if the data is corrupted. For example, a virus might corrupt or erase data, or a user might accidentally delete data. This is why data protection in the form of regular backups is also necessary for file-level protection.
Advice for IT
When SMEs decide to implement clustering and mirroring as part of a healthy, high-availability solution and business continuity/disaster-recovery plan, it should be managed seamlessly to maximise the benefits. Consider the following.
It’s all about the bucks: Systems that provide data protection and recovery in an hour, day or week are less expensive than ones that deliver business-critical service, which should experience close to zero downtime. You and the key managers of your business need to look at all of the business functions and processes that are dependent on IT. Then ask, “What is the financial impact on each of these services if IT goes down?”
Always start with the application: A critical first step is determining which applications require 24x7 availability. To help with this task, SMEs can build a dependency tree for each application that should be available. Make a list of what makes the application work such as its switch, server and the desktop.
RPO and RTO: Determine your businesses’ recovery point objectiveRPO) and recovery time objective (RTO). The RPO, in effect, is the amount of data loss your business can sustain, while the RTO is the amount of time you can afford your systems to be down — the maximum tolerable outage. If a disaster occurs, how much time can your business afford to lose? An hour? A day? A week? This depends on the nature of your particular business and your owners’ or managers’ appetite for business risk. So it’s important that IT alone does not decide what the RPO and RTO are.
Five nines: Most SMEs should strive to achieve five nines reliability, which means systems are available 99.999 per cent of the time. Not all businesses need or can achieve five-nines reliability — perhaps four or three nines is adequate in some cases. The decimal point differences may seem like hair splitting, but they reflect significant duration or frequency of outages. Think about it this way — a system that is 99.999 per cent available to a business that operates only 40 hours per week (and most operate more hours than that) is not available for two minutes per year. One that is 99.99 per cent available is not available for 20 minutes per year. One that is 99.9 per cent available is not available for two hours per year — and of course, management doesn’t get to decide which two.
How much does two hours of down time matter to your business, especially if you can’t pick and choose which two hours you lose? That question demonstrates the Russian roulette situation of ignoring system availability in your business plan.
To outsource or not, that is the question: What is the level of service you’ll need? Is there an in-house IT expert who has the bandwidth to manage server clustering and disk mirroring? If not, consider bringing in your solutions provider to do it for you, or even consider hosted services to support your business-critical infrastructure.
Don’t forget business continuity/disaster recovery. As clustering and mirroring are part of a healthy business continuity/disaster recovery plan, you should test your systems regularly. The frequency with which an organisation can test depends on the disaster-recovery budget, but as a benchmark, SMEs should test no less than twice annually. If it is impossible to test the entire system, periodically test the most critical applications and systems.
According to Gartner, improving availability will help to reduce direct loss of revenue and loss of future revenue, revenue loss through failure to meet contractual obligations, productivity loss or overtime costs, along with a damaged reputation. Remember, your system is your business and your business is your system. Nathan Coutinho, Network World
Join the CIO New Zealand group on LinkedIn. The group is open to CIOs, IT Directors, COOs, CTOs and senior IT managers.