There's a bunch of stuff I don't understand in RAID recovery (and I remember thinking I really should put some disks in a box and spend some hours playing with them to get more familiar with it, too, back when I started using it; guess I should have actually done it). Any good sources? The HOWTO and FAQ I've found are confusing and out of date and have no content. Some of the examples are clearly wrong (like showing using raidhotadd to add a *drive*, like hde, rather than a partition, like hde1, into a raid array). They advise doing things that the man pages seem to say are wrong (like using mkraid on devices with data on them; the man page says this will destroy the data on them). I had a RAID 1 mirror across two drives (single large partition on each). (Oh; this is NOT the root or boot partition). I had some errors logged early this morning: Mar 22 01:55:17 gw kernel: hdd: status error: status=0x00 { } Mar 22 01:55:17 gw kernel: hdd: drive not ready for command Mar 22 01:55:17 gw kernel: hdd: status error: status=0x00 { } Mar 22 01:55:17 gw kernel: hdd: drive not ready for command Mar 22 01:55:17 gw kernel: hdd: status error: status=0x00 { } Mar 22 01:55:17 gw kernel: hdd: drive not ready for command Mar 22 01:55:17 gw kernel: hdd: status error: status=0x00 { } Mar 22 01:55:17 gw kernel: hdd: drive not ready for command Mar 22 01:55:30 gw kernel: ide1: reset: success Mar 22 02:00:30 gw kernel: hdd: lost interrupt Mar 22 02:00:30 gw kernel: hdd: write_intr error1: nr_sectors=1, stat=0x00 Mar 22 02:00:30 gw kernel: hdd: write_intr: status=0x00 { } Mar 22 02:00:35 gw kernel: hdc: status timeout: status=0x80 { Busy } Mar 22 02:00:35 gw kernel: hdc: drive not ready for command Mar 22 02:00:36 gw kernel: ide1: reset: success Mar 22 03:15:52 gw kernel: hdd: lost interrupt Mar 22 03:15:52 gw kernel: hdd: write_intr error1: nr_sectors=3, stat=0x00 Mar 22 03:15:52 gw kernel: hdd: write_intr: status=0x00 { } Mar 22 03:15:57 gw kernel: hdc: status timeout: status=0x80 { Busy } Mar 22 03:15:57 gw kernel: hdc: drive not ready for command Mar 22 03:16:27 gw kernel: ide1: reset timed-out, status=0x80 Mar 22 03:16:27 gw kernel: hdd: status error: status=0x80 { Busy } Mar 22 03:16:27 gw kernel: hdd: drive not ready for command Mar 22 03:16:57 gw kernel: ide1: reset timed-out, status=0x80 Mar 22 03:17:02 gw kernel: hdc: status timeout: status=0x80 { Busy } Mar 22 03:17:02 gw kernel: hdc: drive not ready for command Mar 22 03:17:32 gw kernel: ide1: reset timed-out, status=0x80 Mar 22 03:17:32 gw kernel: hdd: status error: status=0x00 { } Mar 22 03:17:32 gw kernel: hdd: drive not ready for command Mar 22 03:17:32 gw kernel: end_request: I/O error, dev 16:01 (hdc), sector 15204392 Mar 22 03:17:32 gw kernel: raid1: Disk failure on hdc1, disabling device. Mar 22 03:17:32 gw kernel: ^IOperation continuing on 1 devices Then a reboot seemed to bring things partially back: Mar 22 09:36:14 gw kernel: md0: max total readahead window set to 124k Mar 22 09:36:14 gw kernel: md0: 1 data-disks, max readahead per data-disk: 124k Mar 22 09:36:14 gw kernel: raid1: device hdd1 operational as mirror 1 Mar 22 09:36:14 gw kernel: raid1: device hdc1 operational as mirror 0 Mar 22 09:36:14 gw kernel: raid1: raid set md0 not clean; reconstructing mirrors Mar 22 09:36:14 gw kernel: raid1: raid set md0 active with 2 out of 2 mirrors Mar 22 09:36:14 gw kernel: md: updating md0 RAID superblock on device Mar 22 09:36:14 gw kernel: md: hdd1 [events: 00000088](write) hdd1's sb offset: 19551040 Mar 22 09:36:14 gw kernel: md: syncing RAID array md0 But only for a little while Mar 22 09:47:57 gw kernel: hdd: status error: status=0xff { Busy } Mar 22 09:47:57 gw kernel: hdc: DMA disabled Mar 22 09:47:57 gw kernel: hdd: DMA disabled Mar 22 09:47:57 gw kernel: hdd: drive not ready for command Mar 22 09:48:14 gw kernel: ide1: reset: success Mar 22 09:49:13 gw kernel: hdd: irq timeout: status=0x80 { Busy } Mar 22 09:49:48 gw kernel: ide1: reset timed-out, status=0x80 Mar 22 09:49:48 gw kernel: hdc: status timeout: status=0x80 { Busy } Mar 22 09:49:48 gw kernel: hdc: drive not ready for command Mar 22 09:50:23 gw kernel: ide1: reset timed-out, status=0x80 Mar 22 09:50:23 gw kernel: hdd: status timeout: status=0x80 { Busy } Mar 22 09:50:23 gw kernel: hdd: drive not ready for command Mar 22 09:50:58 gw kernel: ide1: reset timed-out, status=0x80 Mar 22 09:50:58 gw kernel: hdc: status timeout: status=0x80 { Busy } Mar 22 09:50:58 gw kernel: hdc: drive not ready for command Mar 22 09:51:28 gw kernel: request: I/O error, dev 16:01 (hdc), sector 9139160 Mar 22 09:51:28 gw kernel: end_request: I/O error, dev 16:01 (hdc), sector 9139168 Mar 22 09:51:28 gw kernel: end_request: I/O error, dev 16:01 (hdc), sector 9139176 Mar 22 09:51:28 gw kernel: end_request: I/O error, dev 16:01 (hdc), sector 9139184 Mar 22 09:51:28 gw kernel: end_request: I/O error, dev 16:01 (hdc), sector 9139192 Mar 22 09:51:28 gw kernel: end_request: I/O error, dev 16:01 (hdc), sector 9139200 Mar 22 09:51:28 gw kernel: end_request: I/O error, dev 16:01 (hdc), sector 9139208 Mar 22 09:51:28 gw kernel: end_request: I/O error, dev 16:01 (hdc), sector 9139216 Mar 22 09:51:28 gw kernel: end_request: I/O error, dev 16:01 (hdc), sector 9139224 Mar 22 09:51:28 gw kernel: end_request: I/O error, dev 16:01 (hdc), sector 9139232 Note the errors on hdc this time. Starting to be scary -- both drives of the mirror have reported errors. However, hdc and hdd are the two drives on the motherboard secondary IDE (Yes, I know it's better not to have the two drives on the same channel) so maybe it's a controller problem of some sort. So I put in a new controller I had lying around and another 20 gig drive I had lying around (warranty replacement from a previous failure). Partitioned the new drive to match. I edited /etc/raidtab to point the second drive of the mirror to the new partition (hde1). Rebooted. Things looked good; then I looked more carefully and discovered that, despite the contents of /etc/raidtab, it was reconstructing the mirror on hdc1 and hdd1, the old drives. I guess this is a consequence of persistent superblock? Then I did a raidhotadd (can't find any man pages for any of the "hot" raid commands, either) and added hde1, the new drive, into the mirror. That worked, and resync worked. I think I now have a three-disk mirrored set, which can't be helping the write performance any. Tried to take out hdd1 with raidhotremove, but it said the drive was busy. Not sure I had the syntax right anyway. Then (just now) I saw this in the log: Mar 22 14:04:54 gw kernel: md: trying to remove hdd1 from md0 ... Mar 22 14:04:54 gw kernel: md: bug in file md.c, line 2344 Mar 22 14:04:54 gw kernel: Mar 22 14:04:54 gw kernel: md:^I********************************** Mar 22 14:04:54 gw kernel: md:^I* <COMPLETE RAID STATE PRINTOUT> * Mar 22 14:04:54 gw kernel: md:^I********************************** Mar 22 14:04:54 gw kernel: md0: <hde1><hdd1><hdc1> array superblock: Mar 22 14:04:54 gw kernel: md: SB: (V:0.90.0) ID:<be920d12.a496d728.f94654f9.f57c47e1> CT:3a8efac2 Mar 22 14:04:54 gw kernel: md: L1 S19551040 ND:3 RD:2 md0 LO:0 CS:8192 Mar 22 14:04:54 gw kernel: md: UT:3c9b7b73 ST:0 AD:2 WD:3 FD:0 SD:1 CSUM:736b8df3 E:0000008a Mar 22 14:04:54 gw kernel: D 0: DISK<N:0,hdc1(22,1),R:0,S:6> Mar 22 14:04:54 gw kernel: D 1: DISK<N:1,hdd1(22,65),R:1,S:6> Mar 22 14:04:54 gw kernel: D 2: DISK<N:2,hde1(33,1),R:2,S:0> Mar 22 14:04:54 gw kernel: md: THIS: DISK<N:1,hdd1(22,65),R:1,S:6> Mar 22 14:04:54 gw kernel: md: rdev hde1: O:hde1, SZ:19551040 F:0 DN:2 md: rdev superblock: Mar 22 14:04:54 gw kernel: md: SB: (V:0.90.0) ID:<be920d12.a496d728.f94654f9.f57c47e1> CT:3a8efac2 Mar 22 14:04:54 gw kernel: md: L1 S19551040 ND:3 RD:2 md0 LO:0 CS:8192 Mar 22 14:04:54 gw kernel: md: UT:3c9b7b73 ST:0 AD:2 WD:3 FD:0 SD:1 CSUM:736bbba1 E:0000008a Mar 22 14:04:54 gw kernel: D 0: DISK<N:0,hdc1(22,1),R:0,S:6> Mar 22 14:04:54 gw kernel: D 1: DISK<N:1,hdd1(22,65),R:1,S:6> Mar 22 14:04:54 gw kernel: D 2: DISK<N:2,hde1(33,1),R:2,S:0> Mar 22 14:04:54 gw kernel: md: THIS: DISK<N:2,hde1(33,1),R:2,S:0> Mar 22 14:04:54 gw kernel: md: rdev hdd1: O:hdd1, SZ:19551040 F:0 DN:1 md: rdev superblock: Mar 22 14:04:54 gw kernel: md: SB: (V:0.90.0) ID:<be920d12.a496d728.f94654f9.f57c47e1> CT:3a8efac2 Mar 22 14:04:54 gw kernel: md: L1 S19551040 ND:3 RD:2 md0 LO:0 CS:8192 Mar 22 14:04:54 gw kernel: md: UT:3c9b7b73 ST:0 AD:2 WD:3 FD:0 SD:1 CSUM:736bbbda E:0000008a Mar 22 14:04:54 gw kernel: D 0: DISK<N:0,hdc1(22,1),R:0,S:6> Mar 22 14:04:54 gw kernel: D 1: DISK<N:1,hdd1(22,65),R:1,S:6> Mar 22 14:04:54 gw kernel: D 2: DISK<N:2,hde1(33,1),R:2,S:0> Mar 22 14:04:54 gw kernel: md: THIS: DISK<N:1,hdd1(22,65),R:1,S:6> Mar 22 14:04:54 gw kernel: md: rdev hdc1: O:hdc1, SZ:19551040 F:0 DN:0 md: rdev superblock: Mar 22 14:04:54 gw kernel: md: SB: (V:0.90.0) ID:<be920d12.a496d728.f94654f9.f57c47e1> CT:3a8efac2 Mar 22 14:04:54 gw kernel: md: L1 S19551040 ND:3 RD:2 md0 LO:0 CS:8192 Mar 22 14:04:54 gw kernel: md: UT:3c9b7b73 ST:0 AD:2 WD:3 FD:0 SD:1 CSUM:736bbb98 E:0000008a Mar 22 14:04:54 gw kernel: D 0: DISK<N:0,hdc1(22,1),R:0,S:6> Mar 22 14:04:54 gw kernel: D 1: DISK<N:1,hdd1(22,65),R:1,S:6> Mar 22 14:04:54 gw kernel: D 2: DISK<N:2,hde1(33,1),R:2,S:0> Mar 22 14:04:54 gw kernel: md: THIS: DISK<N:0,hdc1(22,1),R:0,S:6> Mar 22 14:04:54 gw kernel: md:^I********************************** Mar 22 14:04:54 gw kernel: Mar 22 14:04:54 gw kernel: md: cannot remove active disk hdd1 from md0 ... I guess this may be a consequence of my improper raidhotremove command, rather than a serious problem, though. So now I have three disks in a mirror. How do I go about removing one of them? How do I get the system to actually do what /etc/raidtab says? How do I move raid disks around (like to put them one per channel on the new ata-100 controller)? -- David Dyer-Bennet, dd-b at dd-b.net / Ghugle: the Fannish Ghod of Queries John Dyer-Bennet 1915-2002 Memorial Site http://john.dyer-bennet.net Book log: http://www.dd-b.net/dd-b/Ouroboros/booknotes/ Photos: http://dd-b.lighthunters.net/