VCDX – Troubleshooting Scenario Strategy

vcdx133 D-day is nearly finished, you have just completed the 75 minute presentation and the 30 minute design scenario.  With barely a pause you dive into the final 15 minute “Troubleshooting Scenario”.  How should you handle this task?  What should be your strategy?

List of articles in my VCDX Deep-Dive series (more than 70 posts)

Some initial pointers:

  • Segment the whiteboard into a template (Q&A, Compute, Network, Storage, VM, VIM).
  • You are not being graded for fixing the issue, you are being marked on your troubleshooting methodology.
  • You need to demonstrate a systematic and reasoned approach to problem resolution.
  • Do not skip from silo to silo, start in the area where you suspect the issue lies, complete the “top-down” or “bottom-up” process and then move to the next silo.
  • Do not expect to complete every silo.  The strategy below MAY allow you to complete 2 silos.
  • 15 minutes is a very short amount of time, so move quickly and methodically through your process.
  • The panelists want you to succeed, listen carefully to their responses for clues and hints.
  • React to their answers if something does not make sense.
  • Each scenario has additional hidden bonus points; the panelists are following a script where you need to ferret out the information by asking the correct questions.
  • Reason out loud, you cannot be graded on your thought process if you are silent.
  • Be mindful of the 15 minute timer, stick to your strategy and be wary of the “Rabbit Hole” – do not get stuck in one section, spinning your wheels, asking questions that do not advance and improve your position.

Troubleshooting_Scenario

Questions to Consider (1-5min)

  1. What is the Problem Description?
  2. What is the Criticality of the system?
  3. Has a Service Request been opened with the Vendor(s)?
  4. Is the issue Continuous or Intermittent?
  5. Is the issue System-wide or Localised?
  6. Is the solution compliant with the VMware Compatibility Guide?
  7. Have there been any recent changes?
  8. What are the reported Alarms/Alerts/Logs?

Based upon the answers to the initial questions, you will have to pick the most likely silo where you think the problem(s) lie.

It is also worth being able to describe the vSphere stack as described in this blog post; in particular being able to draw it on a whiteboard very quickly and use it for the silos below.

Compute & Availability (5min)

  • Extract the Compute & Availability Blueprint from the Panelists (basic block diagram – similar to the Design scenario examples).
  • Investigate: VM -> Scheduler/Malloc, vSphere HA, vSphere DRS/DPM, Resource Pools, Reservations, Limits.
  • Resolve: If you find settings that could be misconfigured, suggest a change, explain why and continue to move through the stack.
  • Once you have finished the Compute & Availability stack analysis, select the next silo and move on.

Storage (5min)

  • Extract the Storage Blueprint from the Panelists (basic block diagram – similar to the Design scenario examples).
  • Investigate: VM -> VMM -> NMP (FC/FCoE/iSCSI) -> SIOC (iSCSI/NFS) -> SDRS -> Datastore -> HBA (FC/FCoE) -> SAN SW -> Storage Array -> Spindles.
  • Resolve: If you find settings that could be misconfigured, suggest a change, explain why and continue to move through the stack.
  • Once you have finished the Storage stack analysis, select the next silo and move on.

Network (5min)

  • Extract the Network Blueprint from the Panelists (basic block diagram – similar to the Design scenario examples).
  • Investigate: VM -> VMM -> vSwitch -> NIOC -> vmnic -> Access SW -> Core SW -> Router/Firewall.
  • Resolve: If you find settings that could be misconfigured, suggest a change, explain why and continue to move through the stack.
  • Once you have finished the Network stack analysis, select the next silo and move on.

Virtual Infrastructure Management (Datacenter) (5min)

  • Extract the vCenter Blueprint from the Panelists (basic block diagram – similar to the Design scenario examples).
  • Investigate (Datacenter): vSphere Client/Browser -> Web Client/vCenter -> Inventory Service -> SSO Service -> Database.
  • Resolve: If you find settings that could be misconfigured, suggest a change, explain why and continue to move through the stack.
  • Once you have finished the Virtual Infrastructure Management (Data Center) stack analysis, select the next silo and move on.

Common Resolution & Test methods to Consider

If you request information/tests/tool output, make sure you can explain how to access and use them and then interpret and explain the information returned.

  • Change one thing at a time, test, leave/rollback, then propose the next step
  • Move “problem VM” to another Host/Datastore/vSwitch
  • Console access to “problem VM”: access application, storage, iometer, copy files, network, ping –t gateway, check event logs
  • SSH access to ESXi host: vmkping neighbour hosts/iSCSI/NAS/vmk, esxtop analysis
  • Check alarms and event logs of vCenter, reports of vCOPs (if used)

Other reference sites:

6 thoughts on “VCDX – Troubleshooting Scenario Strategy

  1. This can serve as a standard troubleshooting template as well for any VMware admins.
    A must read for VCDX aspirants.
    Thanks a lot for taking time out for such informative posts.

  2. Pingback: Newsletter: June 21, 2014 | Notes from MWhite

  3. I like these bite-size guides, keep up the good work. If you can expand on each; I think you have the ultimate VCAP-DCD study guide. There is a lot of thought-provoking material here to actually challenge SysAdmin trying to break into design. Josh Odgers & youself should work get together to produce a guide…

    …Keep up the good work.

  4. Pingback: VCDX – Ask, Listen, React | vcdx133.com

  5. Pingback: VCDX Troubleshooting Skills | TheSaffaGeek

  6. Pingback: VCDX: Troubleshooting Scenario - VMice - Virtual-Ice for the masses!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s