Browsed by
Category: Technical

How to debug a Bluescreen minidump

How to debug a Bluescreen minidump

So, your machine is bluescreen-ing on a semi-regular basis.  It’s annoying the @#$% out of you, but you can’t find anything in the system logs that indicates what’s causing it.  Maybe (like in my case) the computer in question is your DVR box and sometime during the night Media Center is waking up, trying to update a program guide, and then blue-screening.  Nothing helpful is left in the logs, but you did get a minidump file.  If you get a minidump, my friend, you are in business!

  1. Make sure you have a minidump file with your bluescreen.  You should see a numbered file with the .dmp extension with the date/time for the bluescreen located in  C:\Windows\Minidump
  2. Download a handy free tool called BlueScreenView by Nirsoft.  This handy tool will automatically decipher a minidump file and you can verify that it matched what you saw on the blue-screen.  It won’t give you everything you need, but it will tell you if you have the right mini-dump for the crash you saw. It also shows you the codes thrown so you don’t have to write them down by hand at the bluescreen.  You’ll note that often BlueScreenView reports a source of the error (ntkrnl.exe in my case) but this is usually NOT the real root cause.  As we’ll soon find out, the high-level source it cites isn’t always the real problem, but was a module loaded BY that source or the module in which the source was loaded.
  3. Do these one-time setup steps.  In order to make sense of the minidumps, you need some tools provided by Microsoft:
    1. Download and install the Debugging Tools for Windows pack.  Make sure it gives you the right version for your OS (win7 x64, vista x32, etc).  This pack contains the kernel debugging tools you’ll need.
    2. windbg.exe will likely be installed in c:\program files\Debugging Tools for Windows (x64) (or whatever x32/x64 you have)
    3. Open a command prompt as administrator, CD to the windbg.exe directory
    4. run:
      windbg.exe -IA
      windbg will start up, and inform you that it is now the registered file association handler for all dump files. Close windbg.exe
    5. Restart windbg, and go to file->Symbol File Path
    6. Enter:
      SRV*C:\Development\SymCache*http://msdl.microsoft.com/download/symbols
      You can set the local directory ('C:\Development\symcache' in my case) to whatever you want, but everything following the rest must be exact.  This instructs windbg to load the needed symbols from Microsoft’s internet site (release modules usually don’t have symbols, and letting you recompile your own kernel by giving the source out isn’t something MS usually lets you do. :)) Whenever you debug something and windbg needs the symbols, it checks your cache location first and downloads the needed symbols if they are not found and stores them in the cache.  So the more you debug the more symbols you build up and faster future debugging will go.  Exit windbg and save the settings.
  4. Open windbg.exe (again), and do a file->open dump and open the minidump in c:\windows\minidump that corresponds to the bluescreen you’re trying to debug.  You might need to be administrator when starting windbg.
  5. Windbg will automatically start downloading symbols, and doing some basic analysis.  It may look like it’s done/just sitting there sometimes, but don’t do anything until you see it’s ‘diagnosis’.  Usually looking like this:
    Use !analyze -v to get detailed debugging information.
    BugCheck 9F, {3, fffffa800af7f440, fffff80000b9c4d8, fffffa800745f860}
    Probably caused by : usbhub.sys
  6. But don’t take this as the final word on the crash source and send nasty letters to the usbhub.sys driver writer!  Type !analyze -v as it suggest, and you’ll likely get a more detailed analysis, like this:
    DRIVER_POWER_STATE_FAILURE (9f)
    A driver is causing an inconsistent power state.
    Arguments:
    Arg1: 0000000000000003, A device object has been blocking an Irp for too long a time
    Arg2: fffffa800af7f440, Physical Device Object of the stack
    Arg3: fffff80000b9c4d8, Functional Device Object of the stack
    Arg4: fffffa800745f860, The blocked IRP
    Debugging Details:
    ------------------
    DRVPOWERSTATE_SUBCODE: 3 IMAGE_NAME: usbhub.sys
    DEBUG_FLR_IMAGE_TIMESTAMP: 4a5bcc2d
    MODULE_NAME: usbhub
    FAULTING_MODULE: fffff8800767a000 usbhub
    CUSTOMER_CRASH_COUNT: 1
    DEFAULT_BUCKET_ID: VISTA_DRIVER_FAULT
    BUGCHECK_STR: 0x9F
    PROCESS_NAME: System
    CURRENT_IRQL: 2


    STACK_TEXT:
    fffff800`00b9c488 fffff800`02ef3273 : 00000000`0000009f 00000000`00000003 fffffa80`0af7f440 fffff800`00b9c4d8 : nt!KeBugCheckEx
    fffff800`00b9c490 fffff800`02e9029e : fffff800`00b9c5c0 fffff800`00b9c5c0 00000000`00000001 00000000`00000000 : nt! ?? ::FNODOBFM::`string'+0x292b0
    fffff800`00b9c530 fffff800`02e8fdd6 : fffff800`03034700 00000000`00146bde 00000000`00000000 00000000`00000000 : nt!KiProcessTimerDpcTable+0x66
    fffff800`00b9c5a0 fffff800`02e904be : 00000030`9c591870 fffff800`00b9cc18 00000000`00146bde fffff800`03002e48 : nt!KiProcessExpiredTimerList+0xc6
    fffff800`00b9cbf0 fffff800`02e8fcb7 : 00000010`31b602c1 00000010`00146bde 00000010`31b602f2 00000000`000000de : nt!KiTimerExpiration+0x1be
    fffff800`00b9cc90 fffff800`02e8ceea : fffff800`02ffee80 fffff800`0300cc40 00000000`00000002 fffff880`00000000 : nt!KiRetireDpcList+0x277
    fffff800`00b9cd40 00000000`00000000 : fffff800`00b9d000 fffff800`00b97000 fffff800`00b9cd00 00000000`00000000 : nt!KiIdleLoop+0x5a

    STACK_COMMAND: kb
    FOLLOWUP_NAME: MachineOwner
    FAILURE_BUCKET_ID: X64_0x9F_3_AiCharger_IMAGE_usbhub.sys
    BUCKET_ID: X64_0x9F_3_AiCharger_IMAGE_usbhub.sys

Now we see the whole story.  We see that in the usbhub.sys device driver, something listed in it’s ‘DPC’ table failed to respond in time to some request the usbhub.sys made.  That process was put on the timer expiration list which threw the bluescreen.  Since usbhub.sys is a hub with many things plugged into it, odds are good that the DPC list is the list of device drivers for each device plugged into the hub, a list of events that need handling, or a list of devices themselves.  When we look at the ‘failure bucket’ we see the AiCharger_IMAGE_usbhub.sys device was the source of the failure.  Odds are good the usbhub.sys is loading ‘images’ that contain the device’s driver for each of the devices plugged into the hub; and the one that failed in this case has the name AiCharger.   If I look in my Device Manager in Windows, I find a driver called AiCharger.sys – under the USB devices.  Ah ha!  A quick Google reveals this is a driver that enables smart/high-speed USB charging of iPhone/iPod devices on my Asus motherboard. If I go one step further, I can speculate that the bug is in the portion of the driver that is supposed to respond to sleep/wake/power events and that somehow the call to wake up the iPhone I have plugged in isn’t responding.  Dang – Asus owes me a donut for doing all the work for them.

So, now you know who’s really responsible.  You send a bug note to Asus with the dump results and un-install the AiCharger tool/stop leaving your iPhone connected at night to the machine when it’s asleep until they get a fix for AiCharger.  You also find out that someone else already had the same problem
There are many other debugging commands you can also use, and those are all outlined here.  Hopefully this will help YOU out the next time some crazy bluescreen you can’t figure out; and you won’t be re-installing the OS to get rid of it.

Protips: 99% of the time, bluescreens are usually a driver and not something in the actual Windows system.  Especially if they are repeatable.  Always get the latest drivers first.
When the crashes are wake/sleep/resume/power related, often you should go to the device driver in the Device Manager and uncheck any ‘allow system to turn off the power of this device’ as a second step if the latest driver doesn’t solve it.  This prevents Windows from making calls into possibly faulty driver code.  Power mangament issues are very common with drivers still.
If you get dumps and the crashes are different places every time or random in timing – then you might have bad memory or a bad motherboard that’s corrupting things.  Check heat sinks or temps and possibly change ram/mb’s.

Other resources:
-The official Microsoft list of bluescreen failure codes with documentation on each one:
http://msdn.microsoft.com/en-us/library/ff542347%28v=VS.85%29.aspx

-Another list of the various bluescreen failure codes and their plaintext sub-code descriptions with some notes from external folks:
http://www.faultwire.com/solutions_index/fatal_error-1.html#IssueList

-Microsoft Answers forum that has really responsive and informative threads on just about every blue-screen investigation ever done.  These guys chew up minidumps all day and can help you track down just about anything that’s going on (if just searching the forum doesn’t do it for you automatically):
http://social.answers.microsoft.com/Forums/en-US/w7repair/threads

-Another Microsoft forum that seems to do a fair amount of this kind of debug work:
http://social.technet.microsoft.com/Forums/en/w7itproperf/threads

Using TWAIN driver for your Canon CanoScan LIDE 25 on Windows 7 x64

Using TWAIN driver for your Canon CanoScan LIDE 25 on Windows 7 x64

Yeah, so you automatically got the newest driver for your Canoscan when upgrading to Windows 7.  However, when you go into Photoshop CS5, you no longer see TWAIN devices listed(!).  Unfortunately, in Adobe’s infinite wisdom, they have discontinued installing TWAIN support by default.  You need to go here:

http://www.adobe.com/support/downloads/detail.jsp?ftpID=4688

to download the ‘Photoshop CS5 Optional Plugins’ free download.

Edit: Note – this ONLY works with 32-bit version of Photoshop CS5.  There still is no TWAIN support on 64-bit Photoshop.

Unzip the file, then copy Twain_32.8BA from the zip’s
\PSCS5OptionalPlugins_Win_en_US\Optional Plug-Ins\Win32

directory, and copy it into your Photoshop CS5 32-bit plugins folder:

C:Program Files (x86)\Adobe\Adobe Photoshop CS5\Plug-ins\
directory.  Restart Photoshop and you should see your TWAIN capture option again.

Multi-core compiling in Visual Studio

Multi-core compiling in Visual Studio

You might already know this, but this is for those of you that want to compile extra-fast on your multi-core beast.  Bet you didn’t know that by default, most versions of Visual Studio do NOT use multi-core compiling.  So, to turn it on, do this in visual studio:

Tools > Options > Projects and Solutions > Build and Run > maximum number of parallel project builds

Set this to the number of cores you have (or the number of cores you have -1 if you want to do things on your desktop while compiling extra-big things).

To see if it’s working, when you compile, in the compiler output window at the bottom you should see each line prefixed by a number like this:
1>blahblah
4>blahblah
3>blahblah

Those prefix numbers tell you which ‘core’ the message is coming from.  I find this speeds up your compile times dramatically – especially on large projects.  Give it a try!

Why the volatile keyword probably isn’t necessary in multi-threaded programming

Why the volatile keyword probably isn’t necessary in multi-threaded programming

Interesting article from the Intel guys doing TBB.

Arch Robinson just removed almost ALL the volatile keywords from Intel Thread Building Blocks.  Why?  For several reasons, but mostly because he claims that overall it slows your code, probably does not actually solve the underlying ordering problems if your code needs to be portable (a REAL concern on today’s writing of games/apps for x86, Xbox, PS3, and iPhone devices!),  and likely isn’t doing what you think it’s doing anyway.  Here’s a pertinent example:

Sometimes programmers think of volatile as turning off optimization of volatile accesses. That’s largely true in practice. But that’s only the volatile accesses, not the non-volatile ones. Consider this fragment:

    volatile int Ready; 

    int Message[100];

    void foo( int i ) {

        Message[i/10] = 42;

        Ready = 1;

    }

It’s trying to do something very reasonable in multi-threaded programming: write a message and then send it to another thread. The other thread will wait until Ready becomes non-zero and then read Message. Try compiling this with “gcc -O2 -S” using gcc 4.0, or icc. Both will do the store to Ready first, so it can be overlapped with the computation of i/10. The reordering is not a compiler bug. It’s an aggressive optimizer doing its job.

You might think the solution is to mark all your memory references volatile. That’s just plain silly. As the earlier quotes say, it will just slow down your code. Worst yet, it might not fix the problem. Even if the compiler does not reorder the references, the hardware might. x86 hardware will not reorder it. Neither will an Itanium(TM) processor, because Itanium compilers insert memory fences for volatile stores. That’s a clever Itanium extension. But chips like Power(TM) will reorder. What you really need for ordering are memory fences, also called memory barriers.

So what’s the solution for multi-threaded programming? Use a library or language extension hat implements the atomic and fence semantics. When used as intended, the operations in the library will insert the right fences. Some examples:

* POSIX threads
* Windows(TM) threads
* OpenMP
* TBB

So, when is volatile actually necessary?  It turns out there are only 3 portable cases volatile is actually needed:

  • marking a local variable in the scope of a setjmp so that the variable does not rollback after a longjmp.
  • memory that is modified by an external agent or appears to be because of a screwy memory mapping
  • signal handler mischief

And now you know, and knowing is half the battle.

Good presentation on PS3 hack

Good presentation on PS3 hack

An international group recently hacked the PS3 so that Linux could be run on the platform and recently presented how it was achieved at a security conference.  Some may recall that the PS3 originally had this option (called “OtherOS” boot), but then the feature was removed after Sony claimed it allowed cracking of their games/piracy concerns.  A bios patch was forced from Sony and no machine without the updated BIOS is allowed to use their services.

At this point, the group ‘failOverflow’ then picked up the mantel for angry PS3 users that had bought the console with the goal of running Linux on it and then felt they’d gotten cheated by Sony who was viewed as having reneged on their promise.  After about 12 months of work, a hack was achieved.

I gleaned two interesting things from the presentation.  Firstly, successful hacks of such modern devices usually comes from teams, not individuals.  While working together is obviously a logical progression if several people are trying to hack the same, much more complex device; it does seem to be a big change from the days when a single guy in his garage would ‘prove’ himself by hacking something by themselves.  Secondly, these guys are very smart.  They clearly have very high levels of understanding of hardware, memory architectures, operating system concepts (loaders, ring levels, decryption, trust-chains, etc), and software stacks.  I’m almost certain they all have a Computer Science or similar background.  The days of a single guy picking up a book, a debugger, and hacking the security in these consoles in their spare time seems to have come quickly to a close.  I think this trend started with the hacking of the original Xbox by a team of Computer Science grad students (which took advantage of an awesomely obscure memory wrap-around bug introduced when they switched from AMD to Intel at the last minute), and this trend doesn’t appear to be going back.  It appears that if you want to contribute to hacking a platform; you best get your BS/MS in CompSci or CompEng.

So, without further ado, here’s the video clip of failOverflow talking about their efforts (along with an interesting bit at the beginning on how long it took to hack other platforms)

Battlefield 2 on Steam kicking you off because of PunkBuster?

Battlefield 2 on Steam kicking you off because of PunkBuster?

Did you take advantage of the super Steam holiday sales?  I did, and got the complete Battlefield 2 pack for $4.99.  However, if you installed it and got it all working – you might keep getting kicked out of network games because PunkBuster reports that you have an ‘Invalid driver version/game’.  This is because of the Steam Community overlays.

  1. At your game library list in Steam, just right-click the Battlefield 2 icon
  2. Select Properties
  3. Go to the General tab
  4. UN-select the ‘Enable Steam Community In-Game’ option.
  5. Voila

Yes, annoying.  Can’t believe old games like this haven’t been re-modified by steam to not need the now-ancient PunkBuster system.  The fact they’ve had to update PunkBuster to be separate Windows Services now in order to work shows you it’s time to see that stuff go.

Adobe registration box always keeps popping up

Adobe registration box always keeps popping up

If you have any of the Adobe suite and continually get the registration box when starting the app (even if you’ve filled it out 10 times before or told it not to register) you’re in good company.  You just need to start the app in Administrator mode, make you registration/hit the don’t register button, and THEN the registry can be properly updated.

Thanks again UAC for messing up my life in weird, cryptic ways that not one grandmother would ever be able to figure out.  But I’ve ranted about this before

The key is MOV EDI, 0×9C5A203A

The key is MOV EDI, 0×9C5A203A

That’s the assembly instruction you need to unlock a secret ‘debug mode’ on AMD processors since the Athelon.  While you need to be in ring 0 to execute it; it did bring up some interesting possibilities of using the special debugging mode for reverse-engineering operation of the chip, accessing possible new features, or presenting a chink in the security armor.  So far, the security problems don’t seem to be probable, but if they cause undocumented resets/etc – they might be.

Anyway – interesting article.  Original posting here.

Cartalk conundrum

Cartalk conundrum

The guys from Car Talk have weekly puzzlers, but this question wasn’t a puzzler, but this came from a truck driver who called in.  He said (basically):

“I have big cylindrical tanks on my truck that lays sideways under the foot step.  Problem is that my gas gauge is broken.  I have a stick that I can put in vertically, so if the gas is at the 20″ mark on the stick, it’s full.  If the gas reaches the 10″ mark on the stick, the tank is clearly half full.  Where should I put the 1/4 and 3/4 marks on the stick?”

First you’d think they should be at 5″ and 15″, but that’s not right because the tank is round, which means the bottom and tops have less volume per inch of height.   Then you think this is a problem is an integration problem – which it can be – but the integration becomes extremely hairy.  Then, you find you can back up and use a geometric method (and when you can’t reduce anymore) use a numerical method to solve it.  So let’s get started!

We see that needing the actual volume of the cylinder is unimportant since you can solve this problem with just the cross-section – which is a circle.  What you want is a circle with a chord across it in which the volume between the chord and the outer wall is 1/4 the capacity of the circle.  So, you draw a diagram, and get started!

Unfortunately, you see that the equation becomes very difficult to solve analytically – and one must resort to numerical methods to get an actual solution.  I used the Mathematica online site, but you could easily use the Newton-Raphson method as well.  Whatever way you use, you find that he should mark the 1/4 tank line 5.96027 inches from the bottom of the stick.  3/4 and 1/8th values are also shown.

The value of this equation can quickly be used to calculate 1/8, 1/16, and all other desired fill marks by simply changing the 1/4 * pi * r^2 line whichever fraction you’d like. In fact, you can graph it to get any value:

Ignoring negative volumes, you see that the tank’s volume compared to it’s theta (roughly equivalent to height) forms a S curve, so that you can see that the height changes more rapidly w.r.t. volume when close to full/empty than in the middle – just like we’d expect.

So, that’s your answer.  Turns out, others have solved this since it’s a common problem with all kinds of other tanks (fuel oil, gas stations, etc).  Here and here are other solutions that verify the same process and confirm that the final equation is unsolvable analytically.

Another person pointed out that most semi’s have TWO tanks – one on each side – which are connected by a balancing flow connector.  So both tanks fill and empty evenly.  Even though this seems to mess up the problem, it actually does not.  In order to represent that situation, you simply multiply both sides by two (two tanks, two times the target volume) – which cancel each other out.  You could have ANY number of tanks connected like this and the answer is the same.

It also doesn’t matter how long the tank is either (so long as the tanks are the same size if you have more than one).  Finally, the theta angle you calculate doesn’t even depend on what radius of the tank!  So if you calculate the thetas for all the fill points, then you can calculate the 1/4 mark on ANY size tank.  Pretty nifty huh.