Core gone wild

What would you do if someone called you and said every time a computer finished imaging, it was missing a bunch of software?  Well, look at the logs of course, but after that is where things get interesting….

I had one computer in particular that was brought to me that was consistently throwing file errors, and the first thought was there had to be some download corruption.

But time and time again, everytime the image was loaded, errors would show up, not always on the same software, but at least a few various ones.  The hard drive was replaced first, but no improvement, then the RAM, and finally the motherboard.  No dice, issue still persisted.  The only thing left was to focus on the CPU.

After swapping the CPU from a known working computer, walla! No more errors!  And more importantly, the “bad” CPU reproduced the same random issues in the known computer.  Now, to get that CPU replaced…

Problem was, the PC OEM’s built in hardware diagnostics said everything was fine.  Even the Intel CPU diagnostics, came up clean.  In the process of installing all sorts of test utilities, I noticed a new pattern… There was no pattern.

Sometimes while simply double clicking a download to run an install, I’d get an extraction error, and then a minute later, it would run fine.  Intel’s own diagnostics tool is one example, here’s the error:

So how is this randomness possible?  When you run program, the Windows kernel will schedule the thread on an available logical processor, and considering this was a quad core computer, that means at random, it would be executing on one of the 4 cores.  To test the theory, I started setting the affinity to control which core it was running on.  Since processor affinity is inherited by child processes, I first open a command prompt and set  the affinity there, and started running my tests.

In classic suspenseful style, each core that I ran it on worked correctly, until the very last one was tested, and boom, error popped up.  I now had a %100 repeatable test, every time I had the affinity set to CPU 3, it would error out %100 of the time, any of the other CPU’s selected, and there was no error %100 of the time.

So why is a CPU issue manifesting as a file extract error?  Let’s look a little deeper at the files themselves.  Running Sysinternals Process Monitor, we see that the Setup.exe is extracting files to our Appdata temp folder, no surprises here:

But let’s grab a these files out of the temp folder before closing the error, and then do the same with the extracted files when there is no error.  Then using your favorite hex editor/comparison tool, look at the differences.

If you look closely, every where there is a difference between the good file and the bad file, the data is off by a value of 4.  From this point I can only speculate what is going on in the CPU.  It is decompressing a file, so maybe there’s a math error, or maybe a transistor somewhere that thought it gets set to 1 is still a 0, not sure.  That’s another job and another career for someone else to figure out….