SQL Server Disaster Recovery Planning – The Basics of Risk Assessment
With SQL Server 2012 now out, I have updated this post to include information about Availability Groups. If you are using Failover Cluster Instances instead use these steps with only minor changes detailed in step #7 below.
When designing a Disaster Recovery plan there are three basic steps that must be done before anything else:
1) Assess your current DR situation
2) Create a plan
3) Test your plan
It is easiest to combine these into one activity, usually starting with assessing what you have, then move onto creating a DR plan and finally simulation/testing it. Many companies do ‘paper’ testing of their plans but never actually get them running in a simulated environment until it’s too late. I think that is a huge mistake. Doing things in this order allows you to fix the gaps before you have already spent time/money on implementation, not after. It also ensures that your plan will actually work by having it go through the paces before it has huge amounts of money or effort invested in it! These steps are explained below with some added context around how to know what your risks are and where priorities should be placed for recovery strategies.
The first 3 steps happen within an organization without involving any vendors or consultants; however there is no substitute for professional help when doing things like writing scripts or testing failovers so you might find yourself reaching out for assistance during those tasks down the road.
Step 1) Assess Your Current DR Situation
Assessment is probably the step most companies ignore or do poorly. How many times have you heard someone say “we tested our DR plan and it worked 100% of the time” But what they really mean is, “We simulated it once in our lab and everything worked fine”. That is NOT good enough! What happens when there are two SQL Servers in a cluster? Even if you have a copy of your production DB’s how you do know that restoring them won’t cause corruption? You can’t make a disaster recovery plan until risk assessment has been completed.
Risk assessments should be done periodically (at least annually) because even if nothing has changed since your last assessment new threats may exist due to growth/acquisitions, changes in business units, etc. It’s important to know what is important to your organization and why. Data sensitivity and criticality should be ranked so that restoration priorities can be put in place and corresponding DR/BC plans created and tested.
Some things you need to assess: What does your backup solution look like? Do you have a backup or do you rely on replication? If replication is used, how often are the databases synchronized? How long would it take for an administrator to re-point the application to another data center if all connectivity was lost? What about availability of power, cooling, networking devices & SQL Servers in the event of a major disaster such as natural disaster or fire? Are there any legal requirements (e.g. financial or regulatory) that dictate recovery point or recovery time objectives? How is your data center configured and do you know the physical security controls (who can access it, when they can access it, etc)?
These two documents written by Mike Ruthruff and Warren Frame provide a good foundation for conducting an organizational risk assessment:
You should also check out this extensive list of questions from Brian Moran. It provides much more detail than I will about what to look at in your environment. Also there are many other sources online for how to conduct a full enterprise risk assessment e.g. here andhere.
OK so now you have gone through all these steps and documented everything you learned along the way. Now onto step 2…
Step 2) Determine the RPO & RTO
What are RPO and RTO?
Recovery Time Objective (RTO) is the maximum tolerable downtime that a business process can experience. For example, if your users have to enter transactions manually because you don’t have electronic submission of entries, then what would be an acceptable amount of time for them to go without entering data before they stop coming to work or find something else to do? This value combined with the Recovery Point Objective (RPO) will determine how much data needs to be backed up in order for you to meet compliance. It should be noted that this does not give license to just backup everything all the time! You still need good practices around change control/change management so that only necessary data is being backed up.
The RPO and RTO is what you need to determine so that you can start writing your DR & BC plans. The goal should be to recover at least the most recent critical transactions that occurred before an outage was experienced since those require the shortest amount of time for resumption of business processes. Once you have determined all the requirements such as how many servers will participate in a failover, how much storage capacity you’ll need for standby machines/storage, etc then it’s time to move on to step 3…
Step 3) Determine Your Recovery Strategy
There are three main types of Disaster Recovery strategies: Cold, Warm and Hot site with various permutations within each category. It’s important to note that you can combine different DR strategies to create a DR plan which fits your organization’s needs.
There are two main components to Disaster Recovery: Business Continuity (BC) and Data Recovery. BC is the process of ensuring that your organization can continue to operate in the event of a major outage. This includes things such as having a backup power supply, alternate work locations and ensuring that critical staff are available. Data Recovery is the process of recovering your data after a disaster has occurred. This includes restoring data from backups and re-establishing communication between systems.
Step 4) Implement & Test Your DR Plan
Once you have determined all the steps in your DR plan, it’s now time to start implementing it. This includes setting up the standby systems, configuring replication between systems, testing the failover plan and updating your staff on the procedures. You should also test how well your organization responds to a disaster by simulating an outage. This will help to identify any gaps in your plan and allow you to make the necessary changes.
Disaster Recovery (DR) is the process of recovering from a major outage or disaster. A Disaster can be something as simple as a power outage or as catastrophic as a natural disaster such as a hurricane or tornado.
The goal of DR is to ensure that critical business processes can resume as quickly as possible after a disaster has occurred.
In order to have an effective DR plan, you need to first conduct an organizational risk assessment. This will help you to identify the critical business processes and the potential risks that could impact them. Once you have identified these risks, you need to put together a plan to mitigate them.
The next step is to determine the Recovery Point Objective (RPO) and the Recovery Time Objective (RTO). The RPO is the amount of data that can be lost without impacting critical business processes. The RTO is the amount of time that you need to recover from a disaster.
Step 5) Assess Your Current State
In order to determine how well your organization would respond to a disaster, you need to first assess your current state. This includes things such as your IT infrastructure, business continuity plans and staff training. You should also test your recovery procedures to ensure that they will be effective in the event of a real disaster.
There are three main types of Disaster Recovery strategies: Cold, Warm and Hot site. A Cold site is the most basic type of DR strategy. It involves having a backup location where you can set up your systems and start operations. A Warm site is more advanced than a Cold site. It includes all the components of a Cold site, as well as the ability to start up your systems and resume business operations. A Hot site is the most advanced type of DR strategy. It includes all the components of a Warm site, as well as the ability to run your systems in parallel with the original systems.
You can also combine different DR strategies to create a plan that fits your organization’s needs. For example, you may want to have a Cold site for your primary location and a Warm site for your backup location.
Once you have determined all the steps in your DR plan, it’s now time to start implementing it. This includes setting up the standby systems, configuring replication between systems, testing the failover plan and updating your staff on the procedures. You should also test how well your organization responds to a disaster by simulating an outage. This will help to identify any gaps in your plan and allow you to make the necessary changes.
Step 6) Review and Update your DR Plan
Once you have implemented your DR plan, it’s important to review and update it on a regular basis. This includes things such as reviewing the risks that could impact your critical business processes, testing your recovery procedures and updating your staff on the latest procedures.
Disaster Recovery is the process of recovering from a disaster. A disaster can be anything from a natural disaster such as a hurricane or tornado, to a man-made disaster such as a data breach or power outage
The goal of Disaster Recovery is to ensure that critical business processes can resume as quickly as possible after a disaster has occurred. There are two main components to Disaster Recovery: Business Continuity and Data Recovery.
Business Continuity is the process of ensuring that your organization can continue to operate in the event of a disaster. This includes things such as having a backup power supply, alternate work locations and ensuring that critical staff are available.
Step 7) Develop a Business Continuity Plan
In order to have an effective Business Continuity Plan, you need to first identify the critical business processes. These are the processes that are essential to your organization’s day-to-day operations. Once you have identified these processes, you need to put together a plan to mitigate the risks that could impact them.
There are three main types of risks that you need to consider: Natural disasters, Man-made disasters and System failures. You should also identify any dependencies that your critical business processes have on other systems.
Once you have identified the risks, you need to develop a plan to address them. This includes things such as having a backup power supply, alternate work locations and ensuring that critical staff are available. You should also test your Business Continuity Plan to ensure that it will be effective in the event of a real disaster.
Now that you have an understanding of Disaster Recovery, it’s important to ensure that your organization is prepared for a major outage. By implementing a DR plan, you can minimize the impact that a disaster would have on your business
Conclusion:
Conducting a risk assessment and determining an appropriate RPO & RTO is the most critical step in building your DR plan. Once you have all that mapped out, and then decide what your recovery strategies will be and how each site/application/etc will participate in your failover strategy (warm standby vs active-active).From there, you will need to implement the plan and test it regularly. Reviewing and updating the DR plan is an ongoing process that should be done on a regular basis.
Disaster Recovery planning is an important process for any organization. By implementing a DR plan, you can minimize the impact that a disaster would have on your business. There are two main components to Disaster Recovery: Business Continuity and Data Recovery.
Business Continuity is the process of ensuring that your organization can continue to operate in the event of a disaster. This includes things such as having a backup power supply, alternate work locations and ensuring that critical staff are available.