Recovering from a Windows Server 2003 mirrored dynamic disk failure

I’m no fan of software raid. Pretty much, ever. At my last job, for whom I still consult, my predecessor was really into technology creep. All of the workstations used that awesome fake raid that is actually implemented in the mass storage driver and is therefore pretty useless and can actually reduce your paths to recovery from disk failure. I’ll leave out the list of arguments against software raid. It just simply isn’t worth it.

I showed up to a call with a server with an 0x7b error. Of course, Microsoft has this cool feature by default where servers automatically reboot when they blue screen. So nobody knew this was the error until I showed up and tried the “don’t automatically restart on BSOD” option under the F8 startup menu. I’m used to this error from moving system images between hardware, especially with virtual machines. As it turns out, the other values inside the parenthesis are actually useful. If the second value inside the parenthesis is 0×00000010, then you’re likely dealing with a disk in a software raid mirror set (dynamic disk) that Windows has marked as failed, and thus won’t start from.

The trick, which took me a while to nail down, is getting a boot.ini setup to boot from another disk. Since you can’t actually access this partition even in the Recovery Console, you can’t edit the boot.ini to tell it to start from the other disk. In the end, I formated a floppy using simply ‘format A:’ on an XP desktop (would you believe this entire data center lacks a Windows server with a floppy drive?), then copied ntldr, ntdetect.com and boot.ini from another Server 2003 machine with the same service pack to this floppy. Then I changed the boot.ini to contain:

[boot loader]
timeout=60
default=multi(0)disk(0)rdisk(0)partition(1)\WINDOWS
[operating systems]
multi(0)disk(0)rdisk(0)partition(1)\WINDOWS="DISK 0" /noexecute=optout /fastdetect /3GB
multi(0)disk(0)rdisk(1)partition(1)\WINDOWS="DISK 1" /noexecute=optout /fastdetect /3GB
multi(0)disk(0)rdisk(2)partition(1)\WINDOWS="DISK 2" /noexecute=optout /fastdetect /3GB
multi(0)disk(0)rdisk(3)partition(1)\WINDOWS="DISK 3" /noexecute=optout /fastdetect /3GB

If you’re not familiar with this file, you may want to read about ARC paths. Remember that ntldr and ntdetect.com are hidden, system and read-only by default, although it’s fine to leave this options unset. ‘attrib -s -h -r C:\ntldr’ will make the file accessible so you can copy it to a floppy. I have to assume when you format a floppy from an NT based operating system it puts a bit of code in the bootsector to look for these files.

I then booted from the floppy and for me I then chose ‘DISK 1′ and the system started up fine. I went pulled the failed disk (carefully guessed which disk it was by the disk order in disk management and the scsi id jumper settings) and replaced it. In disk management, right click the good disk, “remove mirror” and choose the missing disk. Then right click again, “add mirror” and choose the new disk. Drink coffee.

It’s late and I can’t figure out how to run ‘fixboot’ and ‘fixmbr’ with a disk mirror, so I’m still using the floppy disk to boot and choose either disk to start from.

23 thoughts on “Recovering from a Windows Server 2003 mirrored dynamic disk failure

  1. btm Post author

    I used something similar, and disk management wasn’t interested in bringing any of the software mirrors online for me. Chalk up another reason this should have been a hardware raid mirror.

  2. Ed

    Devil’s advocate: I recently had a hardware RAID adapter (SAS) utterly fail to handle errors on one of the drives in a mirrored pair, and so reported IO errors back to the OS. Pretty failboat, especially for an expensive piece of hardware whose only job in life is to do exactly not that. C’est la vie, firmware updates inc.

  3. btm Post author

    Sure. I had a RAID card fail a long time ago and it sucked bringing the array back up on another controller. In comparison to the number of times I have disks fail under hardware raid and recovery is a matter of swapping the dead disk out with a new one and waiting for the array to recover from the hot spare, fixing software raid is a nightmare.

  4. Alex Wetmore

    When you create a software RAID mirror through Disk Administrator it also creates a second line in the boot.ini to boot off of the mirrored drive. Did the previous administrator remove them?

    I’ve had more headaches caused by hardware RAID implementations (3ware cards specifically) than software, so I tend to run with software mirroring on my computers.

  5. btm Post author

    The second disk in the mirror didn’t have an MBR, I tried booting off this disk and didn’t get the bootloader. There wasn’t a line added to the boot.ini in the mirror for the second disk. After I added the new disk to the mirror, it didn’t have an MBR either and I couldn’t boot off it. As I mentioned, trying to use the recovery console to run fixboot/fixmbr didn’t work as the “c:” drive wasn’t available in the recovery console.

    KB 167045 notes a lot of work around for a failed primary disk in a mirror. It seems to put a lot of emphasis on the “fault tolerant boot floppy”, so I wonder why there’s no mentioned of the boot.ini being fixed? Perhaps they are assuming that the primary disk completely failed and you can no longer use the boot code on it, as opposed to Windows marking the disk failed with errors, causing the 0x7b error.

    I’ve definitely had the majority of my RAID headaches with software. Every couple of weeks I have a disk failure in a hardware raid and it’s simply a matter of performing the hot swap while the machine continues running. No having to add a floppy drive to the server, creating boot floppies, etc. I’d have to say it’s as close to magic as I can get.

    The last hardware raid controller failure I had was near a decade ago on some janky used piece of hardware. I suppose you get what you pay for.

  6. Tom H

    At my work I have exclusively software mirrors (not by choice). They are slower and lots more trouble to repair when a disk fails. Once a few weeks ago I had a customer’s hardware raid controller and scsi disks die after he didn’t turn on the a/c after a power failure.. lol! :)

    Of course both a hard disks and the raid controller went to the scrap pile. The box was 10 years old… I’m not suprised. I have used Bryan’s method as well for repairing failed mirror disks often. By the way, the boot files are not “copied” to the second mirror disk on a software mirror… it won’t mirror the boot files, just the data. That is why you have to reboot and usually create a boot disk depending if the failed disk had the boot files.

    I have had excellent luck with hardware raid.. the only problems I have had is once a cheap promise controller (those stupid ones that that use a promise software raid driver with a hard disk controller card and pretend its a raid card) had the drivers get corrupt and cause all kinds of blue screens.

    Hence on any kind of important server I always recommend a hardware raid controller due to how much time it takes to recover a failed software mirror.

  7. Jonathan

    So you’re still booting the MBR from a floppy?

    I take it that you cannot see the System Partition from the Recovery Console so you can’t transfer a MBR using FIXBOOT.

    My understanding is this – I guess the reason you wouldn’t be able to see the partition is because you need third-party drivers for the SCSI controller, so you’d need to create a Driver floppy disk with the SCSI drivers, then when booting into the Recovery Console, when it says to press to load 3rd-party SCSI drivers, do so, then you should be able to run FIXBOOT?

  8. James P. Rushworth

    “I’m no fan of software raid.”

    That’s because the software was written by Microsoft.

    I use MD on Linux and Solaris (RAID1 for all disks inside the box).
    The Linux boxes usually need to be powered off to swap out the failed drive (I haven’t had a hot-swappable SATA drive fail yet) so a re-boot is required.
    All the drive failures I’ve had on Solaris boxes were hot swappable SCSI so they just kept running.

  9. btm Post author

    If you’re relatively technical, you can use a tool like virtual floppy drive and apply a boot disk image to it, then use a cd recording program like nero to burn a bootable cdrom by pointing it at the virtual floppy disk at the corresponding point in the process. Otherwise, just go buy a USB floppy drive, they’re good to have around.

  10. rooda

    hi,
    I’m currently on pratically the same situation: dynamic disks mirrors, first disk failed and when I put the mirror alone in order to boot the server it gaves me the msg: error reading disk press ctrl-alt-del to reboot
    I’ll use your technics and see if it could help me to get the server back
    any more suggestion?
    regards

  11. teethdood

    So I spent an entire day reading/trying various methods (went to hell and back to create floppy boot disks in 2010), none of the methods worked. So I kinda did my own thing while crossing my fingers in desperation:

    1) remove the failed primary drive, insert the mirror as primary
    2) boot up using the windows server 2003 boot CD
    3) Press R to go to the Recovery Console
    4) type “map” to list all attached drives (my single hard drive is \Device\Harddisk0)
    5) type “fixmbr \Device\Harddisk0″ to recreate the MBR
    6) now when you try to boot to the hard drive, I could select “Boot Mirror C: secondary plex” but then it fails
    7) remove the drive and stick it into another windows computer
    8) go to the drive (now listed as D:) then edit boot.ini to change the 2 values of the rdisk from 1 to 0 – for example, multi(0)disk(0)rdisk(1)partition(1) to rdisk(0)

    Voila! you can now boot things up just fine. Stick another drive in and mirror it again. Much simpler than dealing with the bootable floppy business. MS just makes things impossible to create a bootable device.

  12. wRx7m

    I just started at a company where I am the sole IT Sys Admin. The previous guy hadn’t been around for about a year and they had some consultants that really don’t do much, which is why I got a job. So thanks, I guess. All that to say, I am in no way responsible for configuring systems with software RAIDs. I have only done it when teaching myself about various features when Windows 2000 just came out.

    Anyway, I wish I had access to the internet and this page last Monday when my company’s ancient ISA server’s HDD 0 failed and caused a BSOD with this exact same error. It kept rebooting to the loading splash screen and would BSOD (so quickly you couldn’t tell) and reboot.

    I had a consultant with me and he tried a few things like booting to BartPE and then running chkdsk and sfc /scannow. Neither of which did any good because they couldn’t find errors.

    We tried restoring from an Acronis image but first I pulled HDD 1 from the server. The restore was completely unsuccessful and cost about 1.5 hours of time. I decided to try booting to the HDD 1 drive that I had pulled earlier on the off chance that the drive had not had not been corrupted and sure enough, it booted.

    So I am not sure what made it boot, but I am really glad it did.

    I type this as I wait for the newly re-created RAID mirror to finish resynching with my new replacement HDD. I wish this server had a PERC RAID controller.

    Fortunately, I am now allowed to go ahead with buying an appliance firewall, which will not ever have this problem. I guess they decided that not having internet connectivity for almost 5 hours is worth the 5 grand.

  13. Bruce Elniski

    My problem is correctly identifying just which of two identical physical SATA hard drives is “dynamic disk 0″ and which is “dynamic disk 1″. I need to replace the suspected bad drive (dynamic disk 1 had the yellow warning icon in disk management) with a new drive. In the meantime, the mirror has rebuilt successfully (this is windows small business server 2003). As this is the second time the mirror failure has happened in the past 60 days I want to replace the intermittent bad hard drive soon with the new drive and rebuild the mirror using disk management.
    Thanks in advance for any suggestions.

  14. Graham

    The key is that the disks are dynamic.
    I have just tried the boot disk thing, but couldn’t get the disk to boot into ntldr.
    Lastly I have taken the 3 other hard disks out and just left the main mirror OS one in.
    It boots up!!!!

    2 x data discs + 2 x OS. ALl on IDE

  15. HairyFool

    The trouble is I have good and bad results from all three options. I have had a Windows boot mirror recover without intervention from a disk failure, my desktop onboard ICH9 has lost too many disks (3 in 3 years, luckily all in warranty) which all recovered with a replaced hard disk and an HP hardware RAID card refuse to boot after a disk failure and would not recover the RAID at the BIOS level with a new disk (New card saved it).

    The second issue convinced me that the extra work throughput of RAID does not suit desktop drives, curiously it was the “good” quality Deskstars that failed, the cheaper Spinpoints in the second RAID never did although not worked nearly so hard.

    In terms of identifying the disk0/1 it can be fun sorting out which disk is physically in port 1 or 2 but then which one is 1st in the BIOS. I have had to resort to pulling the power on my best guess and seeing which one drops off, if wrong let it rebuild the RAID after restoring the power and change the other

  16. PC3XP3RT

    Two of my customers bought Dell PowerEdge SC440 Win2003 servers and both had the UN939 UCS-51 Perc 5i/R PCI-E card fail. The machines were outside the warranty period but since the same two capacitors on both cards were bulged and had brown crust on them, I hoped Dell would work with me — like charge my credit card, ship out two new cards, allow me return the dead cards for analysis, and let them decide whether they would refund part or all of the charges.

    Instead, they just said the systems were outside the warranty and suggested I look online for a local reseller — because I wanted one of the cards the next day and they said they could not get it to me that quickly (it was a Saturday afternoon when I called). Google Shopping found used cards for ~$125 and new ones for ~$200 but they all appeared to have just 30-day warranties. Based on the cost of the cards (plus ~$80 overnight shipping), the age/slowness of the servers, and the odds the capacitors on the replacement cards would blow on day 31 ;) I decided to try plugging Disk0 into the on-board SATA connector.

    I expected to get a blue screen during startup and then need to spend several hours trying the things posted here… but I lucked out — it booted right up! I was considering doing a software RAID but I don’t want to risk wiping both disks. I’ve only done Win2003 software RAID on blank disks before and I recall it needs to be converted to dynamic, etc. which would probably wipe both, right? When it comes to changing RAID controllers and mirroring a drive with data on it to another drive, I’ve only done it with hardware RAID controllers — which did not require changing the disks to dynamic, etc. Plus, it sounds like the software RAID would be a lot of work if it ever had a problem.

    For now, I think the best bet is to just run from the 1 hard drive, leave the second hard drive unplugged, monitor their data backups every night, and urge my customers to buy new servers ASAP (with decent hardware RAID controllers). Yes?

  17. KAS

    I had the same problem, broken RAID 1 with software RAID from MS, Server 2003.

    The failed drive was the main boot drive. On reboot, the computer failed to boot.

    Moved the good drive to the same connector as the failed “main” drive so the computer tried to boot from it.

    NTLDR is missing error

    In the end, I booted from the Win2k3 disk, used “R” and got to the command line.

    As expected, I had no drive letters when I used “Map” command, but I had the drives and partitions.

    I ran “fixmbr” \Device\Harddisk0

    It gave me a big scary warning about writing over my partition tables etc, and I accepted the risk.

    Rebooted and it worked, no further action required.

    What this tells me is that all of the files were on the mirror drive, but there was no boot settings on that second drive. Once it was marked as a boot drive, it found the ntldr etc and booted.

    It also tells me that I didn’t have to rewrite the Boot.ini file because I had moved the drive to the same location as the original failed drive, thus causing the good drive to be seen as \Device\Harddisk0 instead of \Device\Harddisk1 which allowed Boot.ini to work on this drive.

    Software RAID = Bad

Leave a Reply

Your email address will not be published. Required fields are marked *


× eight = 72

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>