Business Continuity and Disaster Recovery will be the weakest point for most VCDX candidates; I know it was for me, I flopped miserably during my first VCDX attempt. So make it your strongest point, become a Disaster Recovery guru.
List of articles in my VCDX Deep-Dive series (more than 70 posts)
You may already have this addressed in your VCDX design. If so, collect all of the Recoverability and Availability requirements with your design decisions and then map these to an A3 diagram to ensure that it all fits together. Run through the scenarios you are protecting against and validate them. You will be surprised at the inconsistencies you will expose.
Here is the list (in no particular order, I typed them as they occurred to me):
- Where do your DR requirements come from?
- Do you just make them up?
- What is a Business Impact Analysis?
- Understand the relationship between MTD, WRT, RTO, RPO and Availability
- What does five/four/three 9s of availability actually mean?
- How will that impact the cost of the solution?
- Should I protect every system?
- What should I protect against?
- How do I protect against Site Failure?
- How do I protect against Hardware or Software failure?
- How do I protect against Datastore/LUN failure?
- How do I protect against the Accidental/Malicious Deletion of Data?
- What is a Runbook?
- Can I automate it?
- Are manual Runbooks a good idea?
- How do I automate DR Failover/Failback?
- When should I execute Disaster Recovery Drills?
- How does Switchover differ from Failover?
- Does my Availability calculation include Planned Downtime or just Unplanned?
- What availability will the VMware vSphere availability/recovery mechanisms actually give me?
- What will they protect against?
- VMware vSphere HA – Host-HA, VM-HA, App-HA
- VMware SRM/vSphere Replication
- VMware vSphere Data Protection
- VMware vSphere VADP/CBT
- VMware vCenter Server Heart Beat
- Storage Replication – Asynchronous and Synchronous
- What other mechanisms are there to protect my Customer’s services and data?
- Application Clustering
- DB replication/protection mechanisms
- Do I need a third site for Witness/Quorum?
- Multi-site Data Center connectivity – Bandwidth and Latency
- Backup times, RECOVERY TIMES
- Backup/Recovery – Physical Tape, IP-based
- Tape Movement procedures
- Global Site Load Balancing
- DNS – Intranet and Internet
- Load Balancing
- How do I implement DR Automation for multiple platform solutions (eg. vSphere, zOS and AIX)?
- Who are the DR Automation players out there?
- Who are the Backup/Recovery experts?
- How do I design solutions with the right blend of Disaster Recovery, High Availability and Backup/Recovery?
- What are the operational considerations?
- When I implement my Disaster Recovery plan, how does it impact my customers?
- What will they see?
- How long will they be interrupted for?
- Do they have to do anything?
- Do I have to communicate with them?
- How will I communicate with them?
Resources you may want to consider:
Validate these scenarios:
- My customer has an RPO of 60 minutes. The data being protected is 5TB in size. My design intends to use the existing LTO-2 tape library for recovery. Will I meet the customer’s RPO in the event of data corruption?
- My customer has an availability requirement of five 9s for a Tier-1 Business Critical Application. My design uses vSphere HA and vSphere Replication to meet this SLA with a manual runbook for site failover/failback. Is this a major risk to the SLA when the primary site fails?