| |
Business Continuity in a Windows Environment
 |
Businesses and organisations of all sizes increasingly depend on IT systems for their operations. Ensuring that these systems, and their stored data, keep operating is a critical part of business planning. Business continuance is the process of ensuring that critical data and systems remain available even if hardware, software, or environmental problems interrupt the primary servers' normal operation.
This article describes some technologies and approaches for achieving business continuity within the Windows environment,
|
but also discusses the method for determining what should be protected and the level of protection necessary.
Deciding What to Protect
The simplest approach to business continuance is to treat all data as equal, and equally important to business operations. However, this approach leads to unnecessary cost and complexity; many areas of the Windows infrastructure already have fault resilience. For example, Active Directory's multiple-master replication model means that loss of one or more domain controllers (DC) is survivable as long as at least one DC remains. Likewise, DNS, WINS, and most other infrastructure servers can handle the loss of one or more participants. Furthermore, the design of these services is such that a failed server can often be quickly replaced and rebuilt, with its data restored via replication from its remaining peers.
In contrast, line-of-business applications and their associated data generally don't enjoy this same level of protection. Messaging, databases, resource planning, and other critical systems carry the data without which a business cannot operate, so it's critical to ensure that those data are protected.
Figure 1 illustrates the systems that are commonly protected versus those that often are not.

Figure 1: Infrastructure and Line of Business servers and their protection |
Preparing an organisation for the unexpected is the role of Business continuity planning (BCP), which includes all aspect of business, from office space to switch board responses. The specific preparations taken by the IT group to ensure continuous access to information resources is a subset of BCP known as disaster recovery planning (DRP). The key point to remember is this: until critical line of business data can be restored, most other business resumption efforts cannot even begin.
Stop Evaluate. And conduct a simple Business Impact Analysis (BIA)
The decision as to what systems are worth protecting must be made based on business requirements and costs, particularly the cost of interrupted operations. Unplanned downtime costs money; it's not uncommon for post-failure analysis to show direct costs are tens of thousands of pounds greater than the investment that would have cost to protect it.
Before you determine the technology to be implemented in a disaster recovery solution, the business needs to stop and evaluate. Unless the business understands the impact of loss data and lost services to their business then getting the protection right becomes best endeavours.
 |
To determine the most appropriate solution, Servo recommends a business impact analysis (BIA) is completed on each server. The BIA determines what affect the loss of a specific server will have on the organisation; for example, a failure disrupting the accounts payable system may have wide ranging consequences for cash flow, customer retention and credit rating. |
Findings from the analysis will help determine the strategies for offsetting the risks and provide more targeted investment into disaster recovery technology. When considering any business continuity technology, establishing some baseline expectations is often helpful. It's important to examine possible solution in terms of two goals: data protection and data availability.
These goals can be measured using two quantitative measures:

Figure 2 :Recovery Point Objective and Recovery Time Objective |
Recovery Time Objective (RTO) represents the amount of time between the start of an outage and the resumption of normal business operations. The RTO for an outage that can be resolved by reloading from a tape backup includes the time necessary to locate and mount the tape, the time required to restore the data from the tape, and any time necessary to post-process the restored data before restarting the downed applications.
Recovery Point Objective (RPO) represents the point to which business data can be restored. This can be thought of as the latency between the live data and its backup. It measures how out-of-date the backed up data copy will be compared to the original and how much data will have been lost. For example, a nightly backup means that the RPO will be the time between when data was written to the tape and when the failure occurs: a failure any time on Tuesday has an RPO of Monday night.
Agreeing the RPO and RTO with senior management, helps justify expenditure and from an IT management perspective, help you priorities your focus on key systems rather than your computer room as one entity.
From simple backup to full site recovery:
5 key technologies
Tape backup systems are inexpensive and fairly reliable, but as they are scheduled events they offer poor RPO and RTO. Although backups can never be eliminated as they hold historical records, removing the business dependency on them can greatly improve the organisations ability to recover from lost data.
 |
Disk to Disk to Tape (Delayed Protection):
Disk to Disk to tape backups or backup staging, is an excellent way of improving data recovery times as well as providing increased security from media damage. By backing up to low cost disks, such as an S-ATA disk array, backup's can be maintained in near-line storage, whilst consolidated copies of the backup jobs can be placed on tape and sent to an off-site vault. By performing backups to disk first, multiple jobs can be executed simultaneous because of the nature of random access disks, unlike tape devices which can only handle a single job at a time (unless multiplexing is available). As the most recent data always exists on disk, recovery of data is quick and efficient, without the need to recall and mount media. Backup to disk can reduce the data recovery time by speeding up recovery and providing immediate access to local backups. From a business-continuity perspective it ensures all servers are better protected and allow media to be release for off-site vaulting. From a DR perspective, the RTO and RPO is not improved, because recovery from tape will continue to be relatively slow and data will only be protected from the last backup. |

 |
Replication (Real Time Protection):
To reduce data loss (RPO), real time data replication should be considered as a method of enhancing backups. Software based replications such as NSI Doubletake or HP Openview Storage Mirroring provides the ability to complete many-to-one replications, providing real time protection and enabling parallel backups to tape. Servo recommends HP NAS Proliant DL380 G4 Storage Server with embedded Windows Storage Server 2003 to providing ease of management and full integration with existing Microsoft Infrastructures. To provide immediate off-site data protection and improve site outage RTO, replication should be relayed to an off-site hosting centre or using low cost short haul services to a regional or branch site.
|
 |
Bare Metal Recovery (Server Recovery):
Software based real time replication protects against data lost (RPO) and because the data is immediately accessible on disk, the recovery time (RTO) is greatly improved. However, unless hot standby clustered servers are provided, replication of data only doesn't enable the business to rapidly recover services. Bare Metal Recovery provides the ability to rapidly recover server's operating systems from booting the replaced servers off a bootable CDrom. The bootable CDrom loads a cut down version of windows and allows a previous image of the server to be download and installed at a partition level, rebuild the operating system back to its last known good state. Bare Metal Recovery can provide an automated rapid recovery of servers in a faction of the time, greatly reducing your RTO and is a low cost solution, when hot standby servers can not be justified. Once a server has been rebuilt, replication software can be reverse synchronised from the target, providing rapid recovery of data volumes at LAN speeds without lost of data. However, BMR still needs manual intervention and although it reduces the RTO it doesn't eliminate it.
|
 |
Clustering + Replication (Application Recovery):
Software replication protects against data loss and although it improves recovery times, downtime is still measured in hours. If the recovery time (RPO) is less than 4 hours, or the operation needs to be guaranteed around the clock, then high availability solutions need to be considered. Clustering Services, works on the principle of one to one relationship or standby node (N+1), where all servers are pre-configured to run the line of business application. The most efficient and reliable clusters is when the nodes access the same storage on a shared storage area network, ensuring no transactional data is lost. For continuous access this storage is replicated to a D.R. site in synchronous mode, providing transactional safe replication. However many Windows applications don't need this level of availability and with host based asynchronous replications, using NSI Doubletake, applications such as SQL and Exchange can be protected without having to invest in storage area networks, whilst still providing rapid failover to dedicated hot standby servers.
|
 |
HotSite, WarmSites and ColdSites (Site Recovery):
When looking at business continuity, your recovery time objective (RTO) will assist in deciding what level of DR site is required to meet the business recovery SLA .
 |
A DR hot site provides a fully operational computing environment that includes servers, storage, and networking equipment. Applications and data at the hot site is synchronised with the primary site and, in a disaster, operational support of IT systems can be quickly switched from the primary site to the hot site. A DR hot site provides the highest levels of protection against outages by being readily available and typically has high availability clusters deployed. |
 |
A DR warm site generally refers to a site facility with all the necessary hardware and communications equipment needed to run a business, however, the systems are not kept in a constant state of operational readiness. When a disaster is called applications and data must be recovered at the warm site to provide support for ongoing business operations. A DR warm site can protect against outages that last for a long time, but is not available instantaneously. Typical Windows software based replications is used in warm sites. |
 |
A DR cold site facility provides power, communications access, and the environment for hosting a computing infrastructure, but no actual hardware. Following a disaster, IT staff must re-create the primary site from scratch, requiring considerable work before being capable of hosting business applications. A DR cold site can protect against outages that last for a long time, but takes considerable time to get up and running. Contracts with third party recovery providers are often maintained, reducing the cost of rental of standby buildings. |
|
Testing all the elements of your business continuity plan
Whichever Windows technology is chosen to meet the business disaster recovery requirements, it is important to rehearse the events to ensure procedures are correct and to ensure systems are successfully being protected. In this article we have touched on backup, replication and clustering technology that will assist in reducing your RPO and RTO to meet recovery SLA 's. We have also touched briefly on reviewing the type of DR site needed to meet the recovery requirements. But unless DR processes are developed and tested, the technology implemented could be failing to protect your systems. Below is a diagram describing the core components of business continuity. It highlights that technology alone can not ensure IT System can successfully be recovered.

Figure 3 :Business Continuity is more than just technology recovery. |
Summary
Envisioning scenarios with the capacity to cripple an organisation's technology assets no longer requires a tremendous stretch of the imagination. Whether a result of power failure, terrorist attack, flooding, or other natural disaster, organisations have had recent reminders of the critical importance of planning for disasters. No matter how unlikely it may seem today, every organisation must face up to the near certainty of a business-wide failure of IT systems occurring at a future date. Anticipating these events, and planning corrective courses of action, is now a prerequisite to business success.
Servo can assist in developing a business disaster recovery plan (DRP) for organisations by determining which critical application need protecting through a Business Impact Analysis (BIA) and based on the level of data lost (RPO) and system availability (RTO), determine the most appropriate technology to meet the organisation recovery SLA . Using the latest technology form Microsoft, HP, NSI Software, Veritas and C.A. Servo is confident that a solution can be developed to meet tightest budget, whilst still offering value for money.

|
|
| |
|
|

|