Though data loss, and discontinuation of work due to IT failure, are costly, disaster recovery plans are still largely missing from many business operations. Those that have them often are not frequently tested and need more complete protection.
For New York businesses the need for a disaster recovery plan goes beyond financial foresight, as the amended Cybersecurity Regulation 23 NYCRR Part 500, requires businesses to have one. It also requires that the plan is tested at least once a year.
Disaster recovery testing involves simulating data loss and role-playing disasters to verify the effectiveness of your recovery plan. This includes testing your employees and ensuring your company can restore data and applications essential to operations.
Checklist Testing
Checklist testing evaluates a disaster recovery plan by cross-referencing it against comprehensive checklists derived from the collective knowledge of the organization. Businesses can verify the completeness and accuracy of critical recovery procedures. However, the simplicity of this approach may overlook complex vulnerabilities that require more in-depth testing.
Tabletop Testing
This testing method leans on skilled stakeholders who talk through the disaster recovery plan discussing potential issues. Though their knowledge is valuable and this can help identify gaps and improve clarification, it lacks the technical testing needed to confirm how the plan will perform.
Walk Through Testing
Walk-throughs build on tabletop testing where instead of the stakeholders talking through the plan, they carry out the steps. This hands-on approach ensures a unified understanding of the process, fosters familiarity with critical equipment and resources, and helps to pinpoint procedural gaps or potential roadblocks. However, while effective for verifying procedural accuracy and resource availability, walkthrough testing may not uncover all technical issues that could arise during a real-world disaster scenario.
Simulation Testing
Stakeholders partake in a role-playing situation where a specific disaster has occurred. They must walk through the event looking at the disaster recovery plan and responding accordingly. The test should include physical and digital operations to match that of a real event. Communication, access to documentation, and effectiveness of instructions are all evaluated in this test.
Parallel Testing
Though a more costly test as it requires businesses to set up a duplicate environment of the live production system, this test directly interacts with the system allowing a more accurate understanding of potential weaknesses.
Full-Interruption Testing
Full interruption testing is the most comprehensive and realistic way to assess a disaster recovery plan by simulating a real disaster using the production environment. Due to its disruptive nature and significant impact on business operations, it should only be conducted after all other less intrusive testing methods have been thoroughly implemented and validated.
Disaster Recovery Testing Scenarios
Testing your disaster recovery plan should include a variety of scenarios to ensure your business is prepared. Here are some key scenarios to consider:
Equipment Failures
Servers crash, hard drives fail, and network connections can be severed. Any of these failures can cause data loss and disrupt business operations. It’s important to test backup systems, and failover mechanisms to ensure recovery is possible if equipment fails.
User Errors
Human error has long been a part of technology. For disaster recovery, we are concerned with being protected against accidental deletions, incorrect data entries, or misconfigurations. Testing the ability to reverse changes and restore operations is imperative.
Natural Disasters
With natural disasters, it’s not a matter of if, but when. Even for areas not prone to large storms, there is always the threat of fires and floods. To be proactive, your disaster recovery testing should evaluate your ability to relocate operations, access offsite backups, and maintain communication during a crisis.
Loss of Key Personnel
Every business has go-to employees, but it’s never a good idea to rely solely on a few people. Employees may choose to leave roles and their unexpected loss can leave your organization vulnerable. Testing should swap out staff to see how you respond in the event someone is absent. Documenting procedures and cross-training staff can also provide the redundancy needed to overcome an unexpected departure.
Malware risks
Ransomware has been on the rise and though diligence goes a long way, businesses must evaluate their ability to detect and contain malware. Testing staff on potential scams, and providing alerts of potential threats should be part of your general IT practices. Regular updates to systems software should include looking for and patching vulnerabilities.
Best Practices For Disaster Recovery Testing
- Test Frequently
- Test a Variety of Scenarios
- Test Both Your Technology & Your People
- Document Everything
- Define Metrics (How you performed and goals to improve)
- Evaluate the Results Of Your Tests
- Review and Update Your Plan Regularly