VCDX – Troubleshooting Scenario Strategy

D-day is nearly finished, you have just completed the 75 minute presentation and the 30 minute design scenario.  With barely a pause you dive into the final 15 minute “Troubleshooting Scenario”.  How should you handle this task?  What should be your strategy?

List of articles in my VCDX Deep-Dive series (more than 70 posts)

Some initial pointers:

  • Segment the whiteboard into a template (Q&A, Compute, Network, Storage, VM, VIM).
  • You are not being graded for fixing the issue, you are being marked on your troubleshooting methodology.
  • You need to demonstrate a systematic and reasoned approach to problem resolution.
  • Do not skip from silo to silo, start in the area where you suspect the issue lies, complete the “top-down” or “bottom-up” process and then move to the next silo.
  • Do not expect to complete every silo.  The strategy below MAY allow you to complete 2 silos.
  • 15 minutes is a very short amount of time, so move quickly and methodically through your process.
  • The panelists want you to succeed, listen carefully to their responses for clues and hints.
  • React to their answers if something does not make sense.
  • Each scenario has additional hidden bonus points; the panelists are following a script where you need to ferret out the information by asking the correct questions.
  • Reason out loud, you cannot be graded on your thought process if you are silent.
  • Be mindful of the 15 minute timer, stick to your strategy and be wary of the “Rabbit Hole” – do not get stuck in one section, spinning your wheels, asking questions that do not advance and improve your position.

Troubleshooting_Scenario

Questions to Consider (1-5min)

  1. What is the Problem Description?
  2. What is the Criticality of the system?
  3. Has a Service Request been opened with the Vendor(s)?
  4. Is the issue Continuous or Intermittent?
  5. Is the issue System-wide or Localised?
  6. Is the solution compliant with the VMware Compatibility Guide?
  7. Have there been any recent changes?
  8. What are the reported Alarms/Alerts/Logs?

Based upon the answers to the initial questions, you will have to pick the most likely silo where you think the problem(s) lie.

It is also worth being able to describe the vSphere stack as described in this blog post; in particular being able to draw it on a whiteboard very quickly and use it for the silos below.

Compute & Availability (5min)

  • Extract the Compute & Availability Blueprint from the Panelists (basic block diagram – similar to the Design scenario examples).
  • Investigate: VM -> Scheduler/Malloc, vSphere HA, vSphere DRS/DPM, Resource Pools, Reservations, Limits.
  • Resolve: If you find settings that could be misconfigured, suggest a change, explain why and continue to move through the stack.
  • Once you have finished the Compute & Availability stack analysis, select the next silo and move on.

Storage (5min)

  • Extract the Storage Blueprint from the Panelists (basic block diagram – similar to the Design scenario examples).
  • Investigate: VM -> VMM -> NMP (FC/FCoE/iSCSI) -> SIOC (iSCSI/NFS) -> SDRS -> Datastore -> HBA (FC/FCoE) -> SAN SW -> Storage Array -> Spindles.
  • Resolve: If you find settings that could be misconfigured, suggest a change, explain why and continue to move through the stack.
  • Once you have finished the Storage stack analysis, select the next silo and move on.

Network (5min)

  • Extract the Network Blueprint from the Panelists (basic block diagram – similar to the Design scenario examples).
  • Investigate: VM -> VMM -> vSwitch -> NIOC -> vmnic -> Access SW -> Core SW -> Router/Firewall.
  • Resolve: If you find settings that could be misconfigured, suggest a change, explain why and continue to move through the stack.
  • Once you have finished the Network stack analysis, select the next silo and move on.

Virtual Infrastructure Management (Datacenter) (5min)

  • Extract the vCenter Blueprint from the Panelists (basic block diagram – similar to the Design scenario examples).
  • Investigate (Datacenter): vSphere Client/Browser -> Web Client/vCenter -> Inventory Service -> SSO Service -> Database.
  • Resolve: If you find settings that could be misconfigured, suggest a change, explain why and continue to move through the stack.
  • Once you have finished the Virtual Infrastructure Management (Data Center) stack analysis, select the next silo and move on.

Common Resolution & Test methods to Consider

If you request information/tests/tool output, make sure you can explain how to access and use them and then interpret and explain the information returned.

  • Change one thing at a time, test, leave/rollback, then propose the next step
  • Move “problem VM” to another Host/Datastore/vSwitch
  • Console access to “problem VM”: access application, storage, iometer, copy files, network, ping –t gateway, check event logs
  • SSH access to ESXi host: vmkping neighbour hosts/iSCSI/NAS/vmk, esxtop analysis
  • Check alarms and event logs of vCenter, reports of vCOPs (if used)

Other reference sites:

Published by

vcdx133

Chief Enterprise Architect and Strategist, 4xVCDX#133, NPX#8, DECM-EA.

6 thoughts on “VCDX – Troubleshooting Scenario Strategy”

  1. This can serve as a standard troubleshooting template as well for any VMware admins.
    A must read for VCDX aspirants.
    Thanks a lot for taking time out for such informative posts.

  2. I like these bite-size guides, keep up the good work. If you can expand on each; I think you have the ultimate VCAP-DCD study guide. There is a lot of thought-provoking material here to actually challenge SysAdmin trying to break into design. Josh Odgers & youself should work get together to produce a guide…

    …Keep up the good work.

Comments are closed.