How to debug a Bluescreen minidump

How to debug a Bluescreen minidump

So, your machine is bluescreen-ing on a semi-regular basis.  It’s annoying the @#$% out of you, but you can’t find anything in the system logs that indicates what’s causing it.  Maybe (like in my case) the computer in question is your DVR box and sometime during the night Media Center is waking up, trying to update a program guide, and then blue-screening.  Nothing helpful is left in the logs, but you did get a minidump file.  If you get a minidump, my friend, you are in business!

  1. Make sure you have a minidump file with your bluescreen.  You should see a numbered file with the .dmp extension with the date/time for the bluescreen located in  C:\Windows\Minidump
  2. Download a handy free tool called BlueScreenView by Nirsoft.  This handy tool will automatically decipher a minidump file and you can verify that it matched what you saw on the blue-screen.  It won’t give you everything you need, but it will tell you if you have the right mini-dump for the crash you saw. It also shows you the codes thrown so you don’t have to write them down by hand at the bluescreen.  You’ll note that often BlueScreenView reports a source of the error (ntkrnl.exe in my case) but this is usually NOT the real root cause.  As we’ll soon find out, the high-level source it cites isn’t always the real problem, but was a module loaded BY that source or the module in which the source was loaded.
  3. Do these one-time setup steps.  In order to make sense of the minidumps, you need some tools provided by Microsoft:
    1. Download and install the Debugging Tools for Windows pack.  Make sure it gives you the right version for your OS (win7 x64, vista x32, etc).  This pack contains the kernel debugging tools you’ll need.
    2. windbg.exe will likely be installed in c:\program files\Debugging Tools for Windows (x64) (or whatever x32/x64 you have)
    3. Open a command prompt as administrator, CD to the windbg.exe directory
    4. run:
      windbg.exe -IA
      windbg will start up, and inform you that it is now the registered file association handler for all dump files. Close windbg.exe
    5. Restart windbg, and go to file->Symbol File Path
    6. Enter:
      SRV*C:\Development\SymCache*http://msdl.microsoft.com/download/symbols
      You can set the local directory ('C:\Development\symcache' in my case) to whatever you want, but everything following the rest must be exact.  This instructs windbg to load the needed symbols from Microsoft’s internet site (release modules usually don’t have symbols, and letting you recompile your own kernel by giving the source out isn’t something MS usually lets you do. :)) Whenever you debug something and windbg needs the symbols, it checks your cache location first and downloads the needed symbols if they are not found and stores them in the cache.  So the more you debug the more symbols you build up and faster future debugging will go.  Exit windbg and save the settings.
  4. Open windbg.exe (again), and do a file->open dump and open the minidump in c:\windows\minidump that corresponds to the bluescreen you’re trying to debug.  You might need to be administrator when starting windbg.
  5. Windbg will automatically start downloading symbols, and doing some basic analysis.  It may look like it’s done/just sitting there sometimes, but don’t do anything until you see it’s ‘diagnosis’.  Usually looking like this:
    Use !analyze -v to get detailed debugging information.
    BugCheck 9F, {3, fffffa800af7f440, fffff80000b9c4d8, fffffa800745f860}
    Probably caused by : usbhub.sys
  6. But don’t take this as the final word on the crash source and send nasty letters to the usbhub.sys driver writer!  Type !analyze -v as it suggest, and you’ll likely get a more detailed analysis, like this:
    DRIVER_POWER_STATE_FAILURE (9f)
    A driver is causing an inconsistent power state.
    Arguments:
    Arg1: 0000000000000003, A device object has been blocking an Irp for too long a time
    Arg2: fffffa800af7f440, Physical Device Object of the stack
    Arg3: fffff80000b9c4d8, Functional Device Object of the stack
    Arg4: fffffa800745f860, The blocked IRP
    Debugging Details:
    ------------------
    DRVPOWERSTATE_SUBCODE: 3 IMAGE_NAME: usbhub.sys
    DEBUG_FLR_IMAGE_TIMESTAMP: 4a5bcc2d
    MODULE_NAME: usbhub
    FAULTING_MODULE: fffff8800767a000 usbhub
    CUSTOMER_CRASH_COUNT: 1
    DEFAULT_BUCKET_ID: VISTA_DRIVER_FAULT
    BUGCHECK_STR: 0x9F
    PROCESS_NAME: System
    CURRENT_IRQL: 2


    STACK_TEXT:
    fffff800`00b9c488 fffff800`02ef3273 : 00000000`0000009f 00000000`00000003 fffffa80`0af7f440 fffff800`00b9c4d8 : nt!KeBugCheckEx
    fffff800`00b9c490 fffff800`02e9029e : fffff800`00b9c5c0 fffff800`00b9c5c0 00000000`00000001 00000000`00000000 : nt! ?? ::FNODOBFM::`string'+0x292b0
    fffff800`00b9c530 fffff800`02e8fdd6 : fffff800`03034700 00000000`00146bde 00000000`00000000 00000000`00000000 : nt!KiProcessTimerDpcTable+0x66
    fffff800`00b9c5a0 fffff800`02e904be : 00000030`9c591870 fffff800`00b9cc18 00000000`00146bde fffff800`03002e48 : nt!KiProcessExpiredTimerList+0xc6
    fffff800`00b9cbf0 fffff800`02e8fcb7 : 00000010`31b602c1 00000010`00146bde 00000010`31b602f2 00000000`000000de : nt!KiTimerExpiration+0x1be
    fffff800`00b9cc90 fffff800`02e8ceea : fffff800`02ffee80 fffff800`0300cc40 00000000`00000002 fffff880`00000000 : nt!KiRetireDpcList+0x277
    fffff800`00b9cd40 00000000`00000000 : fffff800`00b9d000 fffff800`00b97000 fffff800`00b9cd00 00000000`00000000 : nt!KiIdleLoop+0x5a

    STACK_COMMAND: kb
    FOLLOWUP_NAME: MachineOwner
    FAILURE_BUCKET_ID: X64_0x9F_3_AiCharger_IMAGE_usbhub.sys
    BUCKET_ID: X64_0x9F_3_AiCharger_IMAGE_usbhub.sys

Now we see the whole story.  We see that in the usbhub.sys device driver, something listed in it’s ‘DPC’ table failed to respond in time to some request the usbhub.sys made.  That process was put on the timer expiration list which threw the bluescreen.  Since usbhub.sys is a hub with many things plugged into it, odds are good that the DPC list is the list of device drivers for each device plugged into the hub, a list of events that need handling, or a list of devices themselves.  When we look at the ‘failure bucket’ we see the AiCharger_IMAGE_usbhub.sys device was the source of the failure.  Odds are good the usbhub.sys is loading ‘images’ that contain the device’s driver for each of the devices plugged into the hub; and the one that failed in this case has the name AiCharger.   If I look in my Device Manager in Windows, I find a driver called AiCharger.sys – under the USB devices.  Ah ha!  A quick Google reveals this is a driver that enables smart/high-speed USB charging of iPhone/iPod devices on my Asus motherboard. If I go one step further, I can speculate that the bug is in the portion of the driver that is supposed to respond to sleep/wake/power events and that somehow the call to wake up the iPhone I have plugged in isn’t responding.  Dang – Asus owes me a donut for doing all the work for them.

So, now you know who’s really responsible.  You send a bug note to Asus with the dump results and un-install the AiCharger tool/stop leaving your iPhone connected at night to the machine when it’s asleep until they get a fix for AiCharger.  You also find out that someone else already had the same problem
There are many other debugging commands you can also use, and those are all outlined here.  Hopefully this will help YOU out the next time some crazy bluescreen you can’t figure out; and you won’t be re-installing the OS to get rid of it.

Protips: 99% of the time, bluescreens are usually a driver and not something in the actual Windows system.  Especially if they are repeatable.  Always get the latest drivers first.
When the crashes are wake/sleep/resume/power related, often you should go to the device driver in the Device Manager and uncheck any ‘allow system to turn off the power of this device’ as a second step if the latest driver doesn’t solve it.  This prevents Windows from making calls into possibly faulty driver code.  Power mangament issues are very common with drivers still.
If you get dumps and the crashes are different places every time or random in timing – then you might have bad memory or a bad motherboard that’s corrupting things.  Check heat sinks or temps and possibly change ram/mb’s.

Other resources:
-The official Microsoft list of bluescreen failure codes with documentation on each one:
http://msdn.microsoft.com/en-us/library/ff542347%28v=VS.85%29.aspx

-Another list of the various bluescreen failure codes and their plaintext sub-code descriptions with some notes from external folks:
http://www.faultwire.com/solutions_index/fatal_error-1.html#IssueList

-Microsoft Answers forum that has really responsive and informative threads on just about every blue-screen investigation ever done.  These guys chew up minidumps all day and can help you track down just about anything that’s going on (if just searching the forum doesn’t do it for you automatically):
http://social.answers.microsoft.com/Forums/en-US/w7repair/threads

-Another Microsoft forum that seems to do a fair amount of this kind of debug work:
http://social.technet.microsoft.com/Forums/en/w7itproperf/threads

One thought on “How to debug a Bluescreen minidump

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.