Softpanorama

Home Switchboard Unix Administration Red Hat TCP/IP Networks Neoliberalism Toxic Managers
May the source be with you, but remember the KISS principle ;-)
Bigger doesn't imply better. Bigger often is a sign of obesity, of lost control, of overcomplexity, of cancerous cells

Recovering Filesystems from corrupted RAID sets

News HP Servers Recommended Links HP ProLiant DL360 G7 Abrupt change of disks geometry in RAID10 configuration Lack of reliability
Sysadmin Horror Stories Administration of Remote Servers HP Smart Array P410 controller Humor Random Findings Etc

Recovering Filesystems from corrupted RAID sets

Recovery Workstation Setup

Label The Drives!

The first step in any recovery effort is to label the existing drives so you don't lose track of which came from where, usually noted by drive bay number or SCSI ID. Note additional info if known ("C: drive, mirror 1", etc.).

One can even photocopy the drive labels and note this information on the hardcopy output.

You'll be sorry if you don't handle this early.

All recovery efforts require a workstation with the ability to read the drives in "native" mode - outside the RAID controller's meddlesome influences - and this can be done on either the failed machine itself, or on a separate recovery workstation.

We'll need the ability to read the mirror sets (usually one at a time), as well as another piece of media that can receive the recovered data. In our case we used a very large external USB hard drive.

To avoid touching the hard drives inadvertently, we chose to do all of our work using a Knoppix bootable "live CD": this runs a Linux workstation strictly from the CD-ROM, and though it's slow, it doesn't touch the existing hard drives.

We were 2,000 miles away from the recovery workstation in question, so we used the services of an onsite technician to be our eyes and hands; those who are performing this onsite won't need these extra steps, but it's instructive to see how this actually works in a pinch.

The onsite tech chose an XP system near the recovery workstation, both of which were connected to the local area network (and the XP system had internet access).

Using desktop session-sharing software (WebEx), he was able to grant remote control of his workstation - this put us on the same network as the recovery workstation.

The onsite tech booted the Knoppix Live CD: be patient, it's slow. It should acquire an IP address from the network's local DHCP server (if not, it must be set manually to enable remote access, but that setup is beyond the scope of this article).

Once up, he selected a console session providing a shell, and then performed these steps:

From the recovery workstation console
$ su -                          — become the superuser

# passwd root                   — account is locked by default
Password: hello
Again: hello

# ifconfig                      — find out this station's IP address
eth0      Link encap:Ethernet  HWaddr 00:E0:1E:FC:11:40
          inet addr:192.168.50.158  Bcast:192.168.50.255  Mask:255.255.255.0
          ...

# /etc/init.d/ssh start         — launch Secure Shell daemon

We then used PuTTY to get on the machine in question from the technician's workstation, connecting by IP address as the root user. Now we're on the recovery console remotely.

The rest of these steps will be mostly the same whether you're on the system remotely or directly.

Taking a drive inventory

Now that we're on the recovery workstation in our Knoppix root shell session, we must take stock of the attached drives before we begin any recovery.

We recommend putting just one of the failed drives in the recovery workstation at a time, as this reduces the chance of trashing the whole set with an errant command. This is certainly easiest with hot-swap drives. But one can load all at once if one is very careful.

Our first step is to determine the Linux device names for each drive: they are usually in the form of /dev/sdX, where X is a sequential letter that increments on each drive found by the system while booting. We typically find this by going through the output of the dmesg command, which reports the boot-time discovery process.

It's a lot of output, so we usually route the output to a file and then peruse it with the vi editor.

# dmesg > /tmp/dmesg.out

# vi /tmp/dmesg.out

With the file conveniently in the vi editor, we search for the SCSI configuration lines that will help us identify each drive.

/tmp/dmesg.out
...
scsi0 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 7.0
        <Adaptec 29160 Ultra160 SCSI adapter>
        aic7892: Ultra160 Wide Channel A, SCSI Id=7, 32/253 SCBs

scsi 0:0:2:0: Direct-Access     HITACHI  DK32DJ-36MC      D4D4 PQ: 0 ANSI: 3
scsi0:A:2:0: Tagged Queuing enabled.  Depth 253
 target0:2:0: Beginning Domain Validation
 target0:2:0: wide asynchronous
 target0:2:0: FAST-80 WIDE SCSI 160.0 MB/s DT (12.5 ns, offset 100)
 target0:2:0: Ending Domain Validation
SCSI device sda: 71132959 512-byte hdwr sectors (36420 MB)
...

The first mention of SCSI is the Adaptec controller on Id=7, followed by a Hitachi drive that's presumably part of our failed mirror set. The long SCSI addresses are broken down this way:

SCSI addressing scheme

The Hitachi drive has SCSI ID 2, which we were able to correlate with its place in the original failed array. We also note that it's SCSI device sda, which means that /dev/sda addresses this drive.

Discovered disk drive information
Device SCSI address Drive vendor Description
/dev/sda scsi 0:0:2:0 HITACHI DK32DJ-36MC From Bay 4 - failed C: mirror
/dev/sdb scsi 0:0:3:0 FUJITSU MAP3367NC scratch drive
/dev/sdc scsi 1:0:0:0 WD 1600BEV External USB
/dev/sdd scsi 2:0:0:0 HITACHI DK32DJ-36MC From Bay 3 - failed C: mirror
/dev/sde scsi 2:0:1:0 HITACHI DK32DJ-36MC From Bay 2 - failed D: mirror
/dev/sdf scsi 2:0:2:0 HITACHI DK32DJ-36MC From Bay 1 - failed D: mirror
/dev/sdg scsi 2:0:3:0 HITACHI DK32DJ-36MC From Bay 0 - failed E:

One should go through the whole dmesg output until every drive is accounted for. We'll note that Linux uses the SCSI driver interface even for non-SCSI devices — presumably this is a clean, consistent driver API — so even the external USB drive shows up as SCSI.

Devices move!

IMPORTANT: this table must be manually recreated every time Knoppix boots because we've seen the controllers discovered in different orders on subsequent reboot: this changes the device names.

Writing to the wrong device could be very painful, or at least confusing.

The next step is to choose a device to work with. Since each half of a mirror set ostensibly contains a full copy of the data, we only need one drive to recover the whole volume. Here we're choosing /dev/sda, which represents half of the failed C: drive.

Recall that we speculate that the drive has some RAID housekeeping data at the start, followed by the partition table and the rest of the drive: now it's a matter of finding out where the boundary is.

It turns out that both the block 0 master boot record (containing the partition table), as well as the start-of-partition boot record, have a signature that makes them relatively easy to find by scanning: the last two bytes are hex digits 0x55 and 0xAA.

This causes plenty of false positives over a large volume, and though we could add heuristics to make it smarter, we're just really looking for the first block that looks like a partition table in the hopes that it gives us a clue.

Downloading the code from our website, we ran it on the device:

# /tmp/scandrive -v /dev/sda
scandrive 1.00 - 2002-02-01 - http://www.unixwiz.net/tools/
I/O buffer: 256 sectors of 512 bytes
Device /dev/sda is open
Loop 0: scanning sector 0...
Found ptable magic at sector 128                   — partition table
Found ptable magic at sector 191                   — start of filesystem
Loop 422: scanning sector 108032...
control-C                                          — interrupt scanning

So the first partition table is at sector 128, and since these are 512-byte sectors, this suggests that the RAID housekeeping is 64kbytes... which just happens to be the RAID stripe size. This is very promising.

If this is correct, it means that if we can somehow access the drive that makes sector 128 actually appear at sector 0, it's then a "regular" drive. In our first recovery, we did this the hard way and only discovered the much better way later.

Drive recovery, the hard way

Given our RAID drive that contains a "real" image starting 128 blocks into the drive, one approach is to copy this data to a scratch drive and do all our work there.

Careful!
You're about to write to a disk drive, so you must check, recheck, and check again to make sure you don't mix up source (if=) and destination (of=) drives.

Our above listing of available drives shows /dev/sdb as a scratch drive. Though the dd command has been typically used for this, we prefer the workalike dcfldd instead, mainly because it shows running progress and gives a clue how long it will take.

After doublechecking our parameters — carefully! — we launch the full copy to the scratch drive. The parameters are:

dcfldd
This is the command itself: it's a dd workalike (see Resources section for availability).
if=srcdrive
This specifies the Input File, which is the failed RAID mirror member.
of=dstdrive
This specifies the Output File, which is our scratch drive.
bs=512
Set the blocksize to 512 bytes each. This is actually the default value, but we like to be explicit even if only to clarify our intentions to onlookers.
skip=128
This skips the first 128 blocks — 512 bytes each — from the input device.

We run it this way:

# dcfldd if=/dev/sda of=/dev/sdb bs=512 skip=128

This can take quite a long time depending on the size of the drive, the performance of the machine, and whether the drives share a common I/O bus. The use of the dcfldd command will report regular progress.

Once the command finishes, then /dev/sdb should be a "regular" drive in nearly every respect, so we check it with the fdisk. command. Here we show sample output from an unrelated system (we neglected to save a copy of the fdisk output on our recovery workstation).

# fdisk /dev/sdb                   — NOTE: this is from an unrelated system
...

Command (m for help): p            — show partition table

Disk /dev/sdb: 18.2 GB, 18207375360 bytes
255 heads, 32 sectors/track, 4358 cylinders
Units = cylinders of 8160 * 512 = 4177920 bytes

Device       Boot  Start     End    Blocks   Id  System
/dev/sdb1             10      34    102000   83  Linux
/dev/sdb2             35     291   1048560   82  Linux swap
/dev/sdb3      *       1       9     36704   12  Compaq diagnostics
/dev/sdb4            292    4358  16593360    f  Win95 Ext'd (LBA)
/dev/sdb5            292    4358  16593344   83  Linux

Command (m for help): q

The Linux kernel reads the partition table from a drive at boot time, but since these partitions were created indirectly by copying a drive, the kernel won't know anything about them yet. In addition, the device name entries for each partition (/dev/sda1 for the first partition, and so on) may not be created.

The partprobe command is used to get Linux to re-read the partition table of a drive that was modified outside the usual fdisk methods. Given the name of a device, it makes sure the kernel knows about the partitions:

# partprobe /dev/sdb

TODO: how are the device nodes created?

With the partition table in place and the device nodes available, it's time to mount the partition and see if we can get our data. We need a directory on which to mount the data, then attempt the mount itself:

# mkdir /mnt/ntfs

# mount -oro -tntfs /dev/sdb1 /mnt/ntfs

# cd /mnt/ntfs

# ls -l                        — poke around...

We believe that NTFS filesystem support in Linux is still a bit spotty, so we mount the partition readonly (the ro option): this avoids the chance of messing up the mounted filesystem with buggy NTFS support or our own mistake.

 

Once mounted, change to the directory and look around. Insure that there's data and that it's the partition you expected. Extraction of the data is covered in a later section.

Drive recovery, the easy way

Though copying the data to a scratch drive works, it's slow and not always necessary — we've found a far more direct way using the Linux loopback driver. This module allows us to map a view on top of an existing drive with an offset we specify.

The offset is 128Χ512=65536 bytes, and /dev/loop0 is the first available loopback device:

# losetup -o 65536 /dev/loop0 /dev/sda

# fdisk /dev/loop0

Now, /dev/loop0 is in fact accessing the failed RAID member, but it simply never sees anything before the given offset: it's exactly what we wanted. If all is well, fdisk should reveal the partitions.

TODO: does partprobe work here too? How about creating device nodes?

Though we believe that multiple partitions under the loop device work fine (which may well require partprobe and creating per-partition device nodes under /dev/), our particular approach didn't use it.

Instead, because we knew that the drive had just one partition, and that scandrive suggested it might be at block offset 191, we just looped and mounted it directly:

# /tmp/scandrive -v /dev/sda
...
Found ptable magic at sector 128                   — partition table
Found ptable magic at sector 191                   — start of filesystem
...

# losetup -o 97792 /dev/loop0 /dev/sda      — 97792=191Χ512 bytes

# mkdir /mnt/ntfs

# mount -oro -tntfs /dev/loop0 /mnt/ntfs

# cd /mnt/ntfs

We then proceed to extract the data.

Extracting data from the drive

Once /mnt/ntfs/ has our a mounted recovery partition, however obtained, it's time to get its data onto other media. We typically use an external USB hard drive, but it's also possible to do it over the network with either scp (secure shell copy, to a UNIX system), or with Samba to a Windows share.

FAT32 limitations

Most external USB hard drives come preformatted with a single large filesystem, but FAT32 has a maximum filesize of 4G, which is unsuitable for many server applications (Exchange logs are often far larger.

It may be necessary to reformat the drive with NTFS, which has no such limits; this must be done from a Windows workstation.

Extracting with rsync

We usually prefer to use the rsync program to copy data in bulk from the old to the new drive, as it allows us to restart a copy in progress. We usually put multiple recovered drives on the same external USB, so we normally create a subdirectory for each one.

Note: most administrators are used to using the --archive option with rsync, which implies a raft of other options, but the request to maintain owners, groups, and permissions doesn't always translate so well when NTFS and Linux filesystem concepts collide. Turning these options off makes the copy just about the data and not the metadata.

# mkdir /mnt/usb                 — mount the USB

# mount /dev/sdc1 /mnt/usb

# mkdir /mnt/usb/C-DRIVE

# rsync --recursive --times --verbose \
        --exclude="RECYCLER" \
        --exclude="System Volume Information" \
        --exclude="pagefile.sys" \
        /mnt/ntfs/.  /mnt/usb/C-DRIVE/.

Extraction with Samba

Samba is the excellent CIFS/NETBIOS implementation for Linux, and the smbclient client is perfectly capable of migrating data across the network to a waiting share on a nearby server.

We'll encourage the reader to check with the many Samba resources on the internet to get the various authentication options right in the context of a recovery effort, using this as a guide:.

# cd /mnt/ntfs

# smbclient credentials '//myserver/myshare

  prompt
  recurse
  mput *

This will take some time to run, and there doesn't appear to be a way to exclude things we don't care about (say, pagefile.sys), but it ought to be mostly unattended.

We have used this method only with a very small filesystem, and without the recursive option.

Pure NTFS extractions

As noted, the Linux permissions system doesn't really understand the NTFS structure found on the drive, and something is inevitably lost in the translation when extracting data using these Linux tools.

In many recovery scenarios, just getting back the data itself is plenty good, but in some others the metadata may be very important. In this case, intermediate by Linux is not likely to be successful.

Instead, one might perform a raw block copy (with dcfldd) from the source hard drive to the target USB's partition which transfers the NTFS filesystem without translation. Then, when the USB drive is moved elsewhere, the filesystem is seen exactly as found on the failed RAID mirror.

This may require a bit more work to match up partition sizes and the like, and though we've not tried it ourselves, believe it to be promising.

Variations

As we noted in the introduction, ours was one particular journey that managed to avoid several complicating factors which may well arise in other situations. During the process we noted some of these considerations and touch on them here, but they're meant more to be thinking points than providing specific direction.

RAID 0
RAID 0 is not really even RAID at all — it's a single volume with no redundancy or fault tolerance — but nevertheless can be managed by the RAID controller. Our experience is that it's treated exactly like a member of a RAID 1 mirror set, with the same RAID housekeeping data at the start of a drive. It just has no sibling half of the mirror.
RAID 10
Our RAID 1 example was easy; we needed to look at just a single drive, but RAID 10 (striping plus mirroring) is not going to be so simple. We speculate that one could recreate a stripe with a large scratch drive, and using the dcfldd command to essentially string them together one at a time.
This is likely to be tricky, requiring that the copies be done in exact multiples of the stripe size (often 64kbyte). If the physical drive is not an exact multiple of the stripe size, one probably has to compute the number of exact blocks to copy to maintain proper striple alignment.
We've never attempted this.
RAID 5
This is likley to be far more difficult, requiring custom programming to figure out. Though the RAID set almost certainly has a predictable pattern (say, alternating stripes through drive 0, 1, 2, 3, then back to 0), this has to be researched on a controller-specific basis.
We could imagine a custom version of dcfldd that knew how to skip every Nth block on output, allowing subsequent runs on each source drive to fill in the single large output volume, but this feels like slow slogging.
Drives with bad blocks
All of our work presumed that the all drives had good media, but a confused RAID configuration. Adding bad media to the mix makes this far more complicated and likely requires different approaches that attempt to recover as much good data while not getting stuck on the bad spots.
We believe that this Tech Tip is of limited usefulness in this circumstance.
Data from both halves of the mirror
RAID1 is supposed to maintain identical data in both halves of the mirror, but depending on how the controller failed, we could imagine the two drives not being in perfect agreement.
With high-value data, one could perform the same recovery operation on both halves of the mirror set, copy them to separate areas on the extraction drive. Later the two sets of data could be compared to identify discrepancies.
Full, bootable OS recovery
Our project was only really concerned with recovering a large Exchange information store and transaction logs, so things like permissions and folder attributes were unimportant to us.
But this is not always the case: if the root drive of a domain controller (such as an SBS 2003 C: drive), recovery with "just the files" is going to be exceptionally painful — perhaps beyond the reach of all but the most expert.
Nothing is going to make this an easy drive, but we believe that doing a full NTFS-to-NTFS raw copy to scratch media holds the most promise. When the C: partition is copied to a regular (non-RAID marked) drive, the disk controller may be able to boot the operating system enough to allow running a real bare-metal backup (such as ShadowProtect).
Then, with the full system backup available, the original drives could be re-initialized by the RAID controller to their original RAID 1 state, and the backup restored to the fresh volume.
This is also going to be very slow going, but it seems like a road worth exploring considering the unattractiveness of the alternatives.

First published: 2008/07/18

Recommended Links

Google matched content

Softpanorama Recommended

Top articles

Sites

Top articles

Sites

...



Etc

Society

Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers :   Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism  : The Iron Law of Oligarchy : Libertarian Philosophy

Quotes

War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda  : SE quotes : Language Design and Programming Quotes : Random IT-related quotesSomerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose BierceBernard Shaw : Mark Twain Quotes

Bulletin:

Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 :  Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method  : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law

History:

Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds  : Larry Wall  : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOSProgramming Languages History : PL/1 : Simula 67 : C : History of GCC developmentScripting Languages : Perl history   : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history

Classic books:

The Peter Principle : Parkinson Law : 1984 : The Mythical Man-MonthHow to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater’s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite

Most popular humor pages:

Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor

The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D


Copyright © 1996-2018 by Dr. Nikolai Bezroukov. www.softpanorama.org was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) in the author free time and without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.

This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...

You can use PayPal to make a contribution, supporting development of this site and speed up access. In case softpanorama.org is down you can use the at softpanorama.info

Disclaimer:

The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the author present and former employers, SDNP or any other organization the author may be associated with. We do not warrant the correctness of the information provided or its fitness for any purpose.

The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.

Last modified: April, 18, 2018