Rethinking the worst case

Rethinking the worst case

How have business continuity strategies evolved in the new technology landscape of cloud and virtualisation? A panel of CIOs shared their insights on how they managed the raft of incidents impacting New Zealand enterprises in the past year.

Around the table

(Back from left)

Craig Columbus,CIO, Russell McVeagh

Jean-Pierre Aucoin, manager of architecture and standards, Vodafone NZ

Arian de Wit, GM information and technology, National Institute of Water and Atmospheric Research (NIWA)

Andries van der Westhuizen, group IT transformation manager, Stevenson Group

Neil Gong, IS manager, New Zealand Management Academies (NZMA)

Charles Clarke,Software APAC technical director, Veeam

Andrew Binnie, global IT manager, Cunningham Lindsey

Walter Chieng, director of ICT, Saint Kentigern Trust

Don Williams, APAC regional director, Veeam

James Taylor, disaster recover assurance manager, Air New Zealand

Sally Berlin, group IT solutions, Air New Zealand

Richard Horton, CIO, Fidelity Life Assurance

Simon Casey, CIO, Barfoot & Thompson

Front row (from left)

Campbell Such,general manager, IT Bidvest

Divina Paredes, editor, CIO New Zealand (moderator)

Seated (from left):

Ross Lockwood, general manager, property and asset services, Transfield

Christine Jull, IT leader, GE Money

Andrew Cammell, CIO, Chapman Tripp

Jonathan Bird, information technology and process manager, Designer Textiles International

David Pollard,general manager – technology, PlaceMakers

At a recent CIO roundtable discussion, a panel of ICT leaders delved into how the new environment of visualisation and cloud technologies are impacting their business continuity programs (BCP).

A key theme that emerged is the primacy of people management, whatever the cause of disruption – nature, market forces and technology mishaps.

Here are excerpts from the discussion.

BCP at the forefront

Craig Columbus, Russell McVeagh: In terms of business continuity and disaster recovery, I’ve had unfortunate experiences in the United States on a couple of different occasions. Coming to New Zealand, I carry with me a lot of those memories, and especially a bit of understanding that it’s not about technology. Technology is a key component in making sure that your DR plans go well, but it’s often about the people and the processes as well.

The recent earthquakes in Christchurch and subsequently Wellington have been beneficial in the sense that it has helped me get the attention of key members of our team and helped them understand that there’s more to it than just the technology. We are in the midst right now of doing an entire DR/BCP [disaster recovery/business continuity program] overview end-to-end, encompassing all the business units. IT is a facilitator, but is certainly not going to be the only component [being] reviewed.

Arian de Wit, NIWA: We’re a Crown Research Institute, but we are getting more and more into delivering mission-critical services to paying customers. Top of mind for me is to make sure that those paid for services are up 24x7, and alongside that are rising staff expectations of uptime for internal services as well. We used to be able to say email goes down between 6pm and 9pm on Wednesdays so we can patch the server. Now, that’s not good enough.

In pictures: CIO roundtable - Business continuity in a shifting technology landscape

Jean-Pierre Aucoin, Vodafone: I’m originally from Canada so I did actually experience a disaster at one of the data centres years ago and I lived through the whole process of recovering from that. I must say it’s nothing to do with technology, it’s all process and procedure and we were well trained to deal with it. We recovered, thankfully, because we had the people and the knowledge and the process. But, you know, this was 15 years ago. Today, if you look at the technology, we can use it to enable us to be prepared for it.

At Vodafone, we have to maintain mission-critical systems like the network as well as several key business processes that we’ve automated for our customers. So it’s [BCP] always high on the agenda for us. We are making a decision today about which applications are mission critical, so we categorise them.

Where applicable or possible, test failover systems while doing systems maintenance. This doubles as a test for the failover system and a way of minimising downtime, if any.

Walter Chieng, St Kentigern Trust

Richard Horton, Fidelity Life: Top of mind for me would be legacy hardware issues and how they impact on our DR strategy.

David Pollard, PlaceMakers: I definitely agree that it’s a much bigger issue than just the technology, but obviously technology is a key part of it.

Sally Berlin, Air New Zealand: What’s changed at Air New Zealand in the last couple of years is that business continuity and disaster recovery are at the forefront of what we do, as opposed to an afterthought. Some people might be aware that we had a very public data centre outage with IBM two to three years ago. That sparked a significant technology project where we moved into two new data centres across Auckland. But another key part of that was we developed a new role, the DR and risk assurance manager, reporting through to CIO Julia Raue.

Every project needs to factor in DR. We need to ensure that before we hand it over to production, we implemented a DR test, unless the application wasn’t deemed to be a 24x7, which is pretty unusual these days.

Every project needs to factor in DR. We need to ensure that before we hand it over to production, we implemented a DR test, unless the application wasn’t deemed to be a 24x7, which is pretty unusual these days.

Sally Berlin, Air New Zealand

James Taylor, Air New Zealand: I’m pretty happy with the state of our disaster recovery position, the state I’ve inherited. It is way ahead of where it was a number of years ago. While there’s been significant technology input into where we are today, I’d like to think one of the best pieces of that has been the fact that we regularly test our systems.

And it isn’t just the systems we’re testing, it is the people, the documentation, the alerting, the processes. It’s something that I will continue to really push for. And if you’ve got a system you’re afraid to test, then there’s even more reason to test it because there are obviously reasons why you’re afraid. It’s far better to have that test fail and have it impact your systems at 2am in the morning on a Sunday than it is at 3pm in the afternoon on the Sunday of Easter… that is a true disaster.

There’s continual work to do to improve it. But it feels good that when we break things all the time, we know the people, processes and systems can handle it.

Walter Chieng, Saint Kentigern Trust: Where applicable or possible, test failover systems while doing systems maintenance. This doubles as a test for the failover system and a way of minimising downtime, if any. The other reason why I encourage this is to get the ICT team used to using/testing failover systems. One of the points that was brought up a number of times was the fear to test failover systems and the group agreed these are the very systems that should be tested.

Getting used to switching over to failover systems not only gives us the confidence that the system is there when we need it, but will also get the team used to managing systems in ‘crisis’ situations. A team that knows what it feels like will probably be more agile and able to adapt to the myriad circumstances that may confront them.

The other point is looking at BC/DR from the insurance perspective. Are we able to insure against loss of productivity, reputation, customers, etc? If it is not financially viable to insure against these ‘losses’, would it give BC/DR a higher priority?

Andrew Cammell, Chapman Tripp: Christchurch was probably the first site to test our disaster recovery capabilities, which most people at the time thought was an IT issue. But it quickly turned out it wasn’t an IT issue; it was very much about people and how they reacted and what they did. That’s an interesting learning for us.

I would agree that the focus has certainly changed over the last few years I’ve been involved with Chapman Tripp in terms of this being seen as a technology issue, to one which is very much about the business. We now have a BCP committee, made up of a management team, that has regular updates around what’s going on and a big focus on the people.

Jonathan Bird, Designer Textiles: We’re bricks and mortar and we manufacture Merino fabric for people like Icebreaker, and we have what you might call a diverse environment. We have a manufacturing facility here in Auckland where we have outsourced all of our IT infrastructure and we run virtual desktops. Outsourcing is not an abrogation of responsibility certainly in this area of after disaster recovery.

I’m actively involved with the people we outsource to in ensuring they’ve a good plan in place and are doing their required testing as part of our service level agreement.

We also have a factory in Vietnam, which offers constant and frightening opportunities for disaster. Power is not guaranteed. In fact nothing is guaranteed in a place like Vietnam. It’s growing so fast. They have some amazing infrastructure, but they also have some very antiquated infrastructure in place as well.

We also have international sales offices elsewhere, including sales people in the US, and we have an office about to open in Hong Kong. So continuous connectivity is also something that can be a challenge.

Christine Jull, GE Capital: It’s about process and people and technology. Every situation is different and everything that you need to do on the day is not necessarily what you would have practised. So I think the quality of the leadership team of the business, how it operates, who the customers are and where it operates are just as crucial to making the right decisions.

Our challenge at the moment is we’ve spent quite a bit of time over the last couple of years getting automated and site backup and [having] all of those things working nicely. I think the issue for us now is how we take ourselves further into a virtual desktop environment. The continuous capability that we’ve talked about before is really important.

Every situation is different and everything that you need to do on the day is not necessarily what you would have practised.

Christine Jull, GE Capital

Arian de Wit: Following on up Christine’s comment that the incident will never be exactly what you planned/practiced for, we have taken that approach to its logical conclusion and called our plan a Disaster Preparedness and Response Plan. This lists our mitigation measures [current and proposed/planned], notes how often they are tested, and then records a brief but clear procedure on how a response team is convened and led when an incident occurs. It’s about five pages long. No scenarios and details procedures for handling each scenario. Customer-critical services have further documentation regarding their operational resilience measures.

Ross Lockwood, Transfield: I have gained[BC] experience from the energy power failure in Auckland a few years ago. And I must admit you can have all the backups you like, but there’s nothing like a natural disaster to find out just how poor they are. What happened is we learned from that; just simple things like fibre and copper coming into the building with our links from completely different directions, completely different telematic exchanges, backup, file service onsite, as well as backup in the data centres so you can stand alone if you have to.

Another factor is the generator. Something I learned was to make sure the actual power generator is one that plugs in, not one that’s permanently wired. If anything ever goes wrong, you can then just go down the road and hire another one and plug it in. In Auckland, we found people would get generators that sometimes took a day and a half to actually be wired into the building, because they had to put cable in, plus the wiring and so on.

Andries Van Der Westhuizen, Stevenson Group: Stevenson Group is a diverse type of business from engineering, truck dealerships, coal mining, agriculture, to concrete quarries. About two years ago we started to rebuild our whole of IT because like James Taylor said, we had a lot of systems that we didn’t want to test because we knew they would not come back.

For example, this year we moved quite a few people from Office 2000 to 2013, a big step. So it’s not just about business continuity and rebuild, it’s also a bit of change management. I’m also busy with a program to look at business continuity, and for us, it is [about] different drivers and different business.

Campbell Such, Bidvest: If you think of business continuity from a big picture view, one of the other key things for us is around innovation. Business continuity isn’t just about keeping the lights on, it’s also about growing a business.

We’ve put a lot of our focus on the infrastructure DR side of things over the last few years. And with a centralised IT system, our branches are all responsible for their own decision making. So apart from capex approval and running your IT system and hitting their numbers, that’s pretty much how we run our business. They’re very close to the customer and the decisions are made there.

There’s a lot of effort going into how we raise the awareness for them to make decisions to keep the business lights on in their branches. Then we focus on the back-end systems around the DR. This year I’m pleased we’ve been through the process.

I think we’ve made some good progress down the proud path of continuity around the infrastructure. Now, our next step is around the innovation side of the business and how we’re going to support the topline growth.

Andrew Binnie, Cunningham Lindsey: We’re insurance assessors so we work for insurance companies and actually go out and assess all these disasters we’ve just been talking about.

Disaster to us means two things: It means there’s been a disaster, and so our systems need to be running and responsive. But then we need to actually get out in the field and respond to that disaster as a business.

It’s really important for us to have a business continuity plan that thinks about accessibility to systems. In a disaster, we can’t necessarily rely on mobile connectivity. We’ve got a lot of tablets out in the field at the moment, so we intentionally selected a solution for our tablets that could synchronise and work on local [connectivity], rather than just being a Web-based app. If we’re out in Christchurch and none of the mobile networks are working, we can’t use our systems. So we make decisions based on how we operate. It is about process and our people and ensuring that they can actually do their jobs in a disaster situation.

Benefits and setbacks

Neil Gong, NZMA: As a typical small and medium sized organisation, resource is always the biggest constraint. We struggle to put enough resources into the ‘high probability, high impact’ areas so it’s even more difficult to invest in the ‘low probability, high impact’ tasks where DR/BCP tends to sit. Also, remember they are just risks, not even issues. Why spend money on fixing things that haven’t really happened? So normally you will find some gaps in DR/BCP between SMBs and those large organisations.

At the moment we are working on a number of projects that improve the resilience of our IT infrastructure at NZMA. This effectively raises the reliability and availability of the critical information systems such as our student management system, finance and CRM systems. We ensure the business cases for these projects are presented and discussed at the executive level and the CEOs are the project sponsors.

We clearly see the benefits of the cloud services. As a smaller organisation, the cloud gives us the capability of leveraging the same services as those big players. Imagine: We can use the same CRM system as Telecom NZ and Vodafone without costing us much, plus access the scalability and reliability we wouldn’t be able to get if we were to build our own.

There are still a number of questions that remain unanswered, however. Data security and sovereignty is one of the key concerns we have. While we actively embrace the cloud, we need to be aware of the risks and new challenges it brings.

The new environment of virtualisation and cloud allows us to re-prioritise the limited IT resources and focus on ‘value-adding and innovative’ areas rather than the ‘routine, transactional and keeping the lights on’ tasks.

The new environment of virtualisation and cloud allows us to re-prioritise the limited IT resources and focus on ‘value-adding and innovative’ areas rather than the ‘routine, transactional and keeping the lights on’ tasks.

Neil Gong, NZMA

Social media platforms give us new channels to connect and engage with our students and the communities, plus the possibility of using them to communicate with our staff and students during an emergency.

MOOC [Massive Online Open Courses] is another emerging platform that enables us to provide new learning experiences and reach potential students we couldn’t reach before. The virtual online learning environment also allows us to potentially continue to operate in a disaster while physical access to the campus is unavailable.

Lessons from the earthquake

Craig Columbus: We have good systems in place, but what we did change is how we were treating employees. We make sure, for example, that every single employee has an emergency kit at their desk, they have food, they have water, they have a blanket, they have what they need under their desk. And is that really going to be needed in a big quake? Maybe not, but it really is irrelevant because it’s a comfort factor for everyone to know that it is there if they do need it.

We’ve done things such as implemented two-way radio systems that we previously would not have implemented because we, as a business, communicate via mobile phone or via Sneakernet. Well, it turns out Sneakernet’s not so great when you’re on the 25th floor and you’ve got all these floors to go up and down, because the lifts were shut down. So we’ve put two-way radios in place because the mobile phone network can get quite congested. You may or may not be able to get call, or even texts through. The radios are accessible for certain personnel who will be able to communicate outside of that.

We already had in place a Business Continuity Committee. Business continuity is not something that we adopted as a result of this, it was already ongoing. We send out regular reminders. For those people who have lived through it, I guarantee it will still be fresh in their minds. For those who didn’t experience it, we do have the regular reminders going out and we do have processes in place.

It doesn’t matter how well you’ve planned. It doesn’t matter how thick your checklists are, or how thorough you’ve been. Ten per cent of it is going be used in a disaster and the rest you’re going to fly by the seat of your pants, because everything that you thought was going to be in place and working is probably not, and something will have gone wrong. It’s key to train people and not just to follow the list.

Your response in the event of a fire on your premises is going to be very different from your response in the event of a region-wide earthquake. The emotional impact and the psychological impact is going to be very different. Being flexible is also key.

Andrew Binnie:From an IT point of view, firstly, it was about providing our people with the ability to carry on working… You don’t want people running on slow systems in a time like this. A good core infrastructure was really important at that time for us.

Simon Casey, Barfoot & Thompson: One topic of interest in our business is once we have a DR environment and there’s a major system failure, people are thinking ‘why aren’t you cutting over to DR?’ My IT manager and I are very disciplined around saying ‘no, this is not a disaster, this is an operational failure’.

If you say we will just drop over to disaster recovery it could take weeks to come back. We did have a recent event where we had a failure early in the morning, an infrastructure failure. We were working in the morning and set a target that if we can’t recover by midday, we were happy to lose half a day of operational activity. Fortunately, we resolved that problem by 7.30am, so there was actually no business impact. But I think it’s quite a challenge for IT and for senior IT people to decide when we’ll use that environment during an operational failure.

Andrew Binnie:It’s important that the executive owns that decision, not just IT. So you’ve done a business impact analysis, what’s the risk of it happening, and also the impact of it happening? If you put all that through a matrix you can say ‘well, there is a low risk of it happening but massive impact’, so it’s high on the point scale. Then the business decides overall, not just IT, as to what the priorities are.

Different categories of a disaster

Craig Columbus:We need to talk about categories of what constitutes a disaster. The reason I say that is because if you have a regional event, a major earthquake that has affected every single person, home, business in that area, your clients are going to have a bit of sympathy for you and more goodwill than if you have a fire in your data centre that only affects you and you didn’t have the capability to recover from that quickly and continue operations. So there’s a goodwill conversation that comes in as well when you’re talking about the types of disasters.

Simon Casey: When we did our risk analysis for our DR project, the most fascinating finding was that the biggest risk to the business income, revenue and longevity was a single building failure that your customers would not tolerate. They’d say ‘I don’t care if your building fell down’ and they’ll just swap to a competitor.

We worked out every likely scenario and I agree that when an event unfolds, it’s not the one you planned. But to build your business case you should really look at the likelihood of certain events occurring. We looked at flooding, fire, earthquake, disease, all that type of stuff. But when it comes to protecting your IT assets, it really came down to your server environment, your data network and so on. So then you go from a very broad set of risks down to a small set of assets you need to protect. And the other one was the environment that houses your computer environment. So you put your focus into that.

You need to put all the scenarios into your risk model; one or two examples are not going to do it. I think we probably did 12 to 15, and you could feel that a flood could be similar to an earthquake because we’re looking at outage duration, likely outage durations and recovery times.

Sally Berlin: It is brand damage as well. It is easy to measure loss of revenue because you kind of know how much you make on an hourly, daily basis. But brand is a bit harder.

Jean-Pierre Aucoin: We had an office in Christchurch and lost the whole site[during the earthquake]. We had to get a portable cell site just to get connectivity, so we could serve our customers and then extend coverage and all that. One thing we did learn is that New Zealand is very obsessed with Eftpos; nobody carried any cash. So when Christchurch went out, every ATM machine network was down and nobody had any cash to buy anything. We actually flew someone from Auckland to Christchurch to deliver a suitcase full of cash just to give to our staff.

The point I’m trying to make is that when it comes to disaster recovery or business continuity, there’s always a different fit for purpose for us. We think of our customer first.

Planning for constant change

Andrew Cammell: This idea of BCP being a separate exercise is a bit like security as well. When you’re putting something new in, you’ve got to think about how you’re going to make it secure, how you’re going to protect it… It’s all just part and parcel of doing the changes, making changes.

One thing about the IT industry is it is constantly changing. Change management, BCP, security – they’re all things that we do all the time, forever.

Charles Clarke: If that’s the case for BCP, what’s next, what’s the next innovation do you think will shift the BCP landscape, shift DR or make us rethink backup for that sort of stuff?

Craig Columbus: We know that data, communication and contact needs to be ubiquitous, any device, anywhere, anytime. And, we have to make sure that data is protected, it’s secure for appropriate eyes only. That’s the future right there. It’s just how do we get there? That’s unfolding every day in front of us because we used to do this in very controlled environments. Now, I’ve got the world at my fingertips on my various devices. So yes, the future is going to be around making sure that data is available anywhere, any time, on any device.

Veeam sponsored the CIO roundtable on business continuity in a shifting technology landscape’ held in Auckland.

Read more: Power outage in Auckland propels business continuity in the spotlight

Photos by Jason Creaghan

Follow Divina Paredes on Twitter: @divinap

Follow CIO New Zealand on Twitter:@cio_nz

Sign up for CIO newsletters for regular updates on CIO news, views and events.

Join us on Facesbook.

Join the CIO New Zealand group on LinkedIn. The group is open to CIOs, IT Directors, COOs, CTOs and senior IT managers.

Join the CIO New Zealand newsletter!

Error: Please check your email address.

Tags leadershipCIO100GE MoneyBusiness Continuitypower outageDesigner Textiles InternationalBidvestSaint Kentigern TruststrategyFidelity Life AssuranceNZ Management AcademiesBarfoot & ThompsonCIO roundtableTyphoonCunningham LindseyChapman Trippvodafone NZdisaster recoveryRussell McVeaghTransfieldCIO roleplacemakerscloud computingAir New ZealandNational Institute of Water and Atmospheric Researchvirtualizationconsumerisation of technologyEarthquakeVeeam SoftwareStevenson Group

More about Air New ZealandAndrew Corporation (Australia)APACGEGE Capital AustraliaIBM AustraliaTechnologyVeeamVodafone

Show Comments