[TCLUG] RAID (1) recovery

Fri Mar 22 14:10:02 CST 2002

There's a bunch of stuff I don't understand in RAID recovery (and I
remember thinking I really should put some disks in a box and spend
some hours playing with them to get more familiar with it, too, back
when I started using it; guess I should have actually done it).

Any good sources?  The HOWTO and FAQ I've found are confusing and out
of date and have no content.  Some of the examples are clearly wrong
(like showing using raidhotadd to add a *drive*, like hde, rather than
a partition, like hde1, into a raid array).  They advise doing things
that the man pages seem to say are wrong (like using mkraid on devices
with data on them; the man page says this will destroy the data on
them). 

I had a RAID 1 mirror across two drives (single large partition on
each).  (Oh; this is NOT the root or boot partition).

I had some errors logged early this morning:

Mar 22 01:55:17 gw kernel: hdd: status error: status=0x00 { }
Mar 22 01:55:17 gw kernel: hdd: drive not ready for command
Mar 22 01:55:17 gw kernel: hdd: status error: status=0x00 { }
Mar 22 01:55:17 gw kernel: hdd: drive not ready for command
Mar 22 01:55:17 gw kernel: hdd: status error: status=0x00 { }
Mar 22 01:55:17 gw kernel: hdd: drive not ready for command
Mar 22 01:55:17 gw kernel: hdd: status error: status=0x00 { }
Mar 22 01:55:17 gw kernel: hdd: drive not ready for command
Mar 22 01:55:30 gw kernel: ide1: reset: success
Mar 22 02:00:30 gw kernel: hdd: lost interrupt
Mar 22 02:00:30 gw kernel: hdd: write_intr error1: nr_sectors=1, stat=0x00
Mar 22 02:00:30 gw kernel: hdd: write_intr: status=0x00 { }
Mar 22 02:00:35 gw kernel: hdc: status timeout: status=0x80 { Busy }
Mar 22 02:00:35 gw kernel: hdc: drive not ready for command
Mar 22 02:00:36 gw kernel: ide1: reset: success

Mar 22 03:15:52 gw kernel: hdd: lost interrupt
Mar 22 03:15:52 gw kernel: hdd: write_intr error1: nr_sectors=3, stat=0x00
Mar 22 03:15:52 gw kernel: hdd: write_intr: status=0x00 { }
Mar 22 03:15:57 gw kernel: hdc: status timeout: status=0x80 { Busy }
Mar 22 03:15:57 gw kernel: hdc: drive not ready for command
Mar 22 03:16:27 gw kernel: ide1: reset timed-out, status=0x80
Mar 22 03:16:27 gw kernel: hdd: status error: status=0x80 { Busy }
Mar 22 03:16:27 gw kernel: hdd: drive not ready for command
Mar 22 03:16:57 gw kernel: ide1: reset timed-out, status=0x80
Mar 22 03:17:02 gw kernel: hdc: status timeout: status=0x80 { Busy }
Mar 22 03:17:02 gw kernel: hdc: drive not ready for command
Mar 22 03:17:32 gw kernel: ide1: reset timed-out, status=0x80
Mar 22 03:17:32 gw kernel: hdd: status error: status=0x00 { }
Mar 22 03:17:32 gw kernel: hdd: drive not ready for command
Mar 22 03:17:32 gw kernel: end_request: I/O error, dev 16:01 (hdc), sector 15204392
Mar 22 03:17:32 gw kernel: raid1: Disk failure on hdc1, disabling device. 
Mar 22 03:17:32 gw kernel: ^IOperation continuing on 1 devices

Then a reboot seemed to bring things partially back:

Mar 22 09:36:14 gw kernel: md0: max total readahead window set to 124k
Mar 22 09:36:14 gw kernel: md0: 1 data-disks, max readahead per data-disk: 124k
Mar 22 09:36:14 gw kernel: raid1: device hdd1 operational as mirror 1
Mar 22 09:36:14 gw kernel: raid1: device hdc1 operational as mirror 0
Mar 22 09:36:14 gw kernel: raid1: raid set md0 not clean; reconstructing mirrors
Mar 22 09:36:14 gw kernel: raid1: raid set md0 active with 2 out of 2 mirrors
Mar 22 09:36:14 gw kernel: md: updating md0 RAID superblock on device
Mar 22 09:36:14 gw kernel: md: hdd1 [events: 00000088](write) hdd1's sb offset: 19551040
Mar 22 09:36:14 gw kernel: md: syncing RAID array md0

But only for a little while

Mar 22 09:47:57 gw kernel: hdd: status error: status=0xff { Busy }
Mar 22 09:47:57 gw kernel: hdc: DMA disabled
Mar 22 09:47:57 gw kernel: hdd: DMA disabled
Mar 22 09:47:57 gw kernel: hdd: drive not ready for command
Mar 22 09:48:14 gw kernel: ide1: reset: success
Mar 22 09:49:13 gw kernel: hdd: irq timeout: status=0x80 { Busy }
Mar 22 09:49:48 gw kernel: ide1: reset timed-out, status=0x80
Mar 22 09:49:48 gw kernel: hdc: status timeout: status=0x80 { Busy }
Mar 22 09:49:48 gw kernel: hdc: drive not ready for command
Mar 22 09:50:23 gw kernel: ide1: reset timed-out, status=0x80
Mar 22 09:50:23 gw kernel: hdd: status timeout: status=0x80 { Busy }
Mar 22 09:50:23 gw kernel: hdd: drive not ready for command
Mar 22 09:50:58 gw kernel: ide1: reset timed-out, status=0x80
Mar 22 09:50:58 gw kernel: hdc: status timeout: status=0x80 { Busy }
Mar 22 09:50:58 gw kernel: hdc: drive not ready for command
Mar 22 09:51:28 gw kernel: request: I/O error, dev 16:01 (hdc), sector 9139160
Mar 22 09:51:28 gw kernel: end_request: I/O error, dev 16:01 (hdc), sector 9139168
Mar 22 09:51:28 gw kernel: end_request: I/O error, dev 16:01 (hdc), sector 9139176
Mar 22 09:51:28 gw kernel: end_request: I/O error, dev 16:01 (hdc), sector 9139184
Mar 22 09:51:28 gw kernel: end_request: I/O error, dev 16:01 (hdc), sector 9139192
Mar 22 09:51:28 gw kernel: end_request: I/O error, dev 16:01 (hdc), sector 9139200
Mar 22 09:51:28 gw kernel: end_request: I/O error, dev 16:01 (hdc), sector 9139208
Mar 22 09:51:28 gw kernel: end_request: I/O error, dev 16:01 (hdc), sector 9139216
Mar 22 09:51:28 gw kernel: end_request: I/O error, dev 16:01 (hdc), sector 9139224
Mar 22 09:51:28 gw kernel: end_request: I/O error, dev 16:01 (hdc), sector 9139232

Note the errors on hdc this time.  Starting to be scary -- both drives
of the mirror have reported errors.  However, hdc and hdd are the two
drives on the motherboard secondary IDE (Yes, I know it's better not
to have the two drives on the same channel) so maybe it's a controller
problem of some sort.

So I put in a new controller I had lying around and another 20 gig
drive I had lying around (warranty replacement from a previous
failure).  Partitioned the new drive to match.  I edited /etc/raidtab
to point the second drive of the mirror to the new partition (hde1).
Rebooted.  Things looked good; then I looked more carefully and
discovered that, despite the contents of /etc/raidtab, it was
reconstructing the mirror on hdc1 and hdd1, the old drives.  

I guess this is a consequence of persistent superblock?

Then I did a raidhotadd (can't find any man pages for any of the "hot"
raid commands, either) and added hde1, the new drive, into the
mirror.  That worked, and resync worked.  I think I now have a
three-disk mirrored set, which can't be helping the write performance
any.  Tried to take out hdd1 with raidhotremove, but it said the drive
was busy.  Not sure I had the syntax right anyway.

Then (just now) I saw this in the log:

Mar 22 14:04:54 gw kernel: md: trying to remove hdd1 from md0 ... 
Mar 22 14:04:54 gw kernel: md: bug in file md.c, line 2344
Mar 22 14:04:54 gw kernel: 
Mar 22 14:04:54 gw kernel: md:^I**********************************
Mar 22 14:04:54 gw kernel: md:^I* <COMPLETE RAID STATE PRINTOUT> *
Mar 22 14:04:54 gw kernel: md:^I**********************************
Mar 22 14:04:54 gw kernel: md0: <hde1><hdd1><hdc1> array superblock:
Mar 22 14:04:54 gw kernel: md:  SB: (V:0.90.0) ID:<be920d12.a496d728.f94654f9.f57c47e1> CT:3a8efac2
Mar 22 14:04:54 gw kernel: md:     L1 S19551040 ND:3 RD:2 md0 LO:0 CS:8192
Mar 22 14:04:54 gw kernel: md:     UT:3c9b7b73 ST:0 AD:2 WD:3 FD:0 SD:1 CSUM:736b8df3 E:0000008a
Mar 22 14:04:54 gw kernel:      D  0:  DISK<N:0,hdc1(22,1),R:0,S:6>
Mar 22 14:04:54 gw kernel:      D  1:  DISK<N:1,hdd1(22,65),R:1,S:6>
Mar 22 14:04:54 gw kernel:      D  2:  DISK<N:2,hde1(33,1),R:2,S:0>
Mar 22 14:04:54 gw kernel: md:     THIS:  DISK<N:1,hdd1(22,65),R:1,S:6>
Mar 22 14:04:54 gw kernel: md: rdev hde1: O:hde1, SZ:19551040 F:0 DN:2 md: rdev superblock:
Mar 22 14:04:54 gw kernel: md:  SB: (V:0.90.0) ID:<be920d12.a496d728.f94654f9.f57c47e1> CT:3a8efac2
Mar 22 14:04:54 gw kernel: md:     L1 S19551040 ND:3 RD:2 md0 LO:0 CS:8192
Mar 22 14:04:54 gw kernel: md:     UT:3c9b7b73 ST:0 AD:2 WD:3 FD:0 SD:1 CSUM:736bbba1 E:0000008a
Mar 22 14:04:54 gw kernel:      D  0:  DISK<N:0,hdc1(22,1),R:0,S:6>
Mar 22 14:04:54 gw kernel:      D  1:  DISK<N:1,hdd1(22,65),R:1,S:6>
Mar 22 14:04:54 gw kernel:      D  2:  DISK<N:2,hde1(33,1),R:2,S:0>
Mar 22 14:04:54 gw kernel: md:     THIS:  DISK<N:2,hde1(33,1),R:2,S:0>
Mar 22 14:04:54 gw kernel: md: rdev hdd1: O:hdd1, SZ:19551040 F:0 DN:1 md: rdev superblock:
Mar 22 14:04:54 gw kernel: md:  SB: (V:0.90.0) ID:<be920d12.a496d728.f94654f9.f57c47e1> CT:3a8efac2
Mar 22 14:04:54 gw kernel: md:     L1 S19551040 ND:3 RD:2 md0 LO:0 CS:8192
Mar 22 14:04:54 gw kernel: md:     UT:3c9b7b73 ST:0 AD:2 WD:3 FD:0 SD:1 CSUM:736bbbda E:0000008a
Mar 22 14:04:54 gw kernel:      D  0:  DISK<N:0,hdc1(22,1),R:0,S:6>
Mar 22 14:04:54 gw kernel:      D  1:  DISK<N:1,hdd1(22,65),R:1,S:6>
Mar 22 14:04:54 gw kernel:      D  2:  DISK<N:2,hde1(33,1),R:2,S:0>
Mar 22 14:04:54 gw kernel: md:     THIS:  DISK<N:1,hdd1(22,65),R:1,S:6>
Mar 22 14:04:54 gw kernel: md: rdev hdc1: O:hdc1, SZ:19551040 F:0 DN:0 md: rdev superblock:
Mar 22 14:04:54 gw kernel: md:  SB: (V:0.90.0) ID:<be920d12.a496d728.f94654f9.f57c47e1> CT:3a8efac2
Mar 22 14:04:54 gw kernel: md:     L1 S19551040 ND:3 RD:2 md0 LO:0 CS:8192
Mar 22 14:04:54 gw kernel: md:     UT:3c9b7b73 ST:0 AD:2 WD:3 FD:0 SD:1 CSUM:736bbb98 E:0000008a
Mar 22 14:04:54 gw kernel:      D  0:  DISK<N:0,hdc1(22,1),R:0,S:6>
Mar 22 14:04:54 gw kernel:      D  1:  DISK<N:1,hdd1(22,65),R:1,S:6>
Mar 22 14:04:54 gw kernel:      D  2:  DISK<N:2,hde1(33,1),R:2,S:0>
Mar 22 14:04:54 gw kernel: md:     THIS:  DISK<N:0,hdc1(22,1),R:0,S:6>
Mar 22 14:04:54 gw kernel: md:^I**********************************
Mar 22 14:04:54 gw kernel: 
Mar 22 14:04:54 gw kernel: md: cannot remove active disk hdd1 from md0 ... 

I guess this may be a consequence of my improper raidhotremove
command, rather than a serious problem, though.

So now I have three disks in a mirror.  How do I go about removing one
of them?  How do I get the system to actually do what /etc/raidtab
says?  How do I move raid disks around (like to put them one per
channel on the new ata-100 controller)?
-- 
David Dyer-Bennet, dd-b at dd-b.net  /  Ghugle: the Fannish Ghod of Queries
 John Dyer-Bennet 1915-2002 Memorial Site http://john.dyer-bennet.net
        Book log: http://www.dd-b.net/dd-b/Ouroboros/booknotes/
                 Photos: http://dd-b.lighthunters.net/