Description of problem: This one is weird. Hardware: - Promise SATA "RAID" controller TX2 (i.e. one of the proprietary software RAID controllers) running in NON-RAID MODE (i.e. as just a plain old SATA controller) - two 120GB Seagate SATA drives - Via C3 1GHz (non-Nehemiah, I think) processor - 196MB system memory, PC100 After loading the sata_promise driver, the 2 disks appear as /dev/sda and /dev/sdb. I can munge them normally, e.g. with fdisk. I create /dev/md0 in /etc/raidtab to span /dev/sda1 and /dev/sdb1, with the partitions consuming the entire disk. After raidstart, I do: mkfs.ext3 /dev/md0. Then the fun begins. If I configure this using raid0, the filesystem is created, no sweat. If I configure this using raid1, mkfs gets ~ 40% through the volume, and the system hardcore wedges. By "hardcore" I mean that the system is completely wedged, interrupts are disabled (i.e. keyboard stops working) and the cursor stops flashing. So, we're pretty solidly wedged. Upon reboot, no errors in log file (which of course is reasonable since even if an oops got logged it never got synced to disk.) I took the card and the drives out, and put it into an Athlon Thunderbird box with 512MB of memory, also running FC2 test 3, and used that to build the raid1 array: no errors, completed successfully. I moved the hardware BACK to the Via box, and it is now mounted correctly and appears to work, but for obvious reasons I don't trust that configuration much. I am not sure what the cause is: raid1 driver? sata_promise driver? Via C3 processor? Some weird northbridge issue, e.g. DMA lockup or something? Some combination of the above? I can poke around at this more (and will be poking/testing and so on since this is my disk server :) but I am filing this bug to get some direction on things I should be poking at.
As an update: I did a stress-test on the RAID array after porting it back over to the Via box. i.e. I created the array on the Athlon, ran mkfs.ext3 on the Athlon, and then moved the card+disks back over to the Via. On reboot, the kernel decided it needed to rebuild the RAID 1 array (and no, I didn't swap the disks -- it actually always does that on a fresh RAID IIRC). I let that finish, then started copying data over. I got about 10 - 12 GB copied over before the system wedged. So, I can confirm that the system wedged even on a "known-good" array during normal operation, so it's not a bug in mkfs or mkraid. I suspect a hardware issue, but don't know enough about the C3 (or associated chipset) to know where to go next.
I'm switching this to the kernel component since a hard lock is some sort of kernel problem. Also reassigning to Jeff Garzik since he's the in house SATA person and this sounds like it's likely related to either that or the VIA C3 (and he's likely the right one for that as well).
Another experiment: 1. Put the 2 disks in the C3 box (actually, they were still there after I put them back in after creating a raid1 on the Athlon, though they were now corrupted.) 2. fdisk /dev/sda1, create single partition on whole disk 3. fdisk /dev/sdb1, create single partition on whole disk 4. Open two shells 5. Shell 1: "mkfs.ext3 /dev/sda1"; shell 2: "mkfs.ext3 /dev/sdb1" i.e. I left raid out of the picture entirely, and just tried two parallel mkfs processes on different disks. Result: hard lock. So, it's not raid1 at all, and looks increasingly like it's a lockup resulting from the sata_promise driver, the sata subsystem, or some weird DMA or other chipset issue. Given that sata_promise (I assume) works for (many?) other people, and given the non-mainstream processor, my money's on that. Here is something else to note: I now have that box logging remotely to a second box, so I can see some additional messages in syslog. Specifically: May 4 00:18:53 alexandria kernel: longhaul: FSB:133 Mult:3.5x May 4 00:19:13 alexandria kernel: longhaul: FSB:133 Mult:4.0x May 4 00:19:13 alexandria kernel: longhaul: FSB:133 Mult:4.5x This just before the lockup. I assume that "longhaul" the C3 term for dynamic clock mult stepping for power conservation; could it be related?
Great quivering Moscovites! I am an idiot, a moron. A nitwit, if you will, unworthy to share threespace with regular people. To wit: May 4 00:19:13 alexandria kernel: longhaul: FSB:133 Mult:4.5x Just as a sort of general rule, or maybe a philosophy by which one might guide his life, kind of like Zen Buddhism only geeky: If your FSB is running at 133MHz, it helps if you don't have PC100 SDRAM parts mixed in there. And here I've been casting aspersions on your code, and entertaining notions of an elaborate conspiracy between Via, Seagate, Promise, and the Illuminati. Can you believe I write Verilog for a living this year? I don't even deserve to use a computer anymore. Popped the 64MB PC100, leaving 128MB of PC133; mkraid /dev/md0; mkfs.ext3 completes. I will start up that multi-GB copy overnight and verify that it completes and all is stable, but I suspect I just might have found the bug: defective user.
Eh... guess I spoke too soon; perhaps I'm not quite so dumb after all. Kernel still hangs during disk operations, it just takes longer. An NFS transfer started overnight got to about 17GB and then died. A local copy started this morning got through about 20GB and then died. Well, I thought I had it, but I guess I didn't. Any suggestions appreciated.
More poking: - Big (18GB) copy to a PATA disk off the motherboard (i.e. not on the Promise SATA card) worked just fine - Big repeated copy of the 18GB into multiple copies on one (and only one, not in RAID mode, just plain old /dev/sda1) works fine; filled up all 120GB that way. So, it's not a general PCI issue, it looks like. Nor is it a fundamental flaw in the SATA subsystem. Looks like a bug that appears only when both SATA disks are active simultaneously. Next up: copy 120GB from /dev/sda1 to /dev/sdb1, which if the pattern holds should result in a hang.
More tests. On advice of friend, tried doing a mkraid and waiting for the initial mirror-sync to finish, THEN ran load-test. In this case, I was able to fill the raid array (with many copies of various data files.) Also managed to rm it all. i.e. it works. Worth a summary at this point: With PC100 installed, these failed with a hard lockup: - mkfs.ext3 /dev/md0 - mkfs.ext3 /dev/sda1 & mkfs.ext3 /dev/sdb1 - cp -R $large_dataset /mnt/raid After removing PC100, these now succeed: - mkfs.ext3 /dev/md0 - mkfs.ext3 /dev/sda1 & mkfs.ext3 /dev/sdb1 - dd if=/dev/sda1 of=/dev/sdb1 (i.e. copy one SATA to the other) - dd if=/dev/hda of=/dev/sda1 (i.e. copy data from PATA to SATA) Even with PC100 out, this can still fail if you run it during mirror resync: - cp -R $large_dataset /mnt/raid [So far this is all local-disk traffic; this is an NFS server but I have been ignoring that for the moment and running with NFS disabled.] With PC100 out, and waiting until after sync completes, the copy SUCCEEDS. So, it looks like I actually had/have 2 separate lockups here: one was due to the PC100/PC133 mixup, but the second is an actual bug somewhere. I think this means the bug must be in one of these places: - raid1 system, with a hard crash of the mirror sync task when the participating disks are under heavy load - possible issue with SATA support that triggers the above behavior in raid1 - wedge in NFS (which I will test next -- i.e. do my cp $large_dataset over NFS) I will also attempt to cause a mirror resync and repeat the copy and see if it crashes.
Additional tests; problem appears to have been mostly narrowed down. - Causing a mirror resync and generating heavy load currently does not wedge the box. - Waiting for initial resync to complete and generating heavy load currently does not wedge the box. - I filled up /dev/sda with 120GB of files, did a big old copy to /dev/sdb, removed them all, etc. No wedge. However: - Copying a lot of data to the RAID array over NFS sometimes wedges the box. - Copy a lot of data from NFS back to NFS wedges the box pretty reliably. So, it looks like I have 2 issues here: 1 PC133/PC100 issue as above (solved;) NFS server hard lockup. This bug doesn't appear to be getting much attention so I won't update again unless someone asks me for more info.
does this problem go away if you disable the cpuspeed service ?
I have no way to test that: the configuration seems stable enough once the raid array is created. Now that it is created, it's a production server, so I can't really stress-test it like that anymore. I also disabled NFS (because a later kernel version just wedged hard and I got tired of dealing with it) and switched to CIFS which is apparently fine in this configuration. In other words, I am not going to be able to get you any more info on this bug, so feel free to close it. :)
Fedora Core 2 has now reached end of life, and no further updates will be provided by Red Hat. The Fedora legacy project will be producing further kernel updates for security problems only. If this bug has not been fixed in the latest Fedora Core 2 update kernel, please try to reproduce it under Fedora Core 3, and reopen if necessary, changing the product version accordingly. Thank you.