Description of problem: When I reboot, I get this e-mail: ======================================== This is an automatically generated mail message from mdadm running on foo A DegradedArray event had been detected on md device /dev/md0. Faithfully yours, etc. P.S. The /proc/mdstat file currently contains the following: Personalities : [raid1] md0 : active raid1 hdb[0] 156290816 blocks [2/1] [U_] unused devices: <none> ======================================== This is really bizarre, because I should have 2 working devices! # mdadm /dev/hdc --examine /dev/hdc: Magic : a92b4efc Version : 00.90.00 UUID : 32ea6e7a:2ce2493e:f594b64f:1a51337b Creation Time : Sun Mar 26 14:18:38 2006 Raid Level : raid1 Device Size : 156290816 (149.05 GiB 160.04 GB) Array Size : 156290816 (149.05 GiB 160.04 GB) Raid Devices : 2 Total Devices : 2 Preferred Minor : 0 Update Time : Sat Mar 3 15:48:48 2007 State : clean Active Devices : 2 Working Devices : 2 Failed Devices : 0 Spare Devices : 0 Checksum : ac4752fb - correct Events : 0.255274 Number Major Minor RaidDevice State this 1 22 0 1 active sync /dev/hdc 0 0 3 64 0 active sync /dev/hdb 1 1 22 0 1 active sync /dev/hdc On the other hand, maybe I have only one working device! # mdadm --detail /dev/md0 /dev/md0: Version : 00.90.03 Creation Time : Sun Mar 26 14:18:38 2006 Raid Level : raid1 Array Size : 156290816 (149.05 GiB 160.04 GB) Device Size : 156290816 (149.05 GiB 160.04 GB) Raid Devices : 2 Total Devices : 1 Preferred Minor : 0 Persistence : Superblock is persistent Update Time : Sat Mar 3 15:51:04 2007 State : clean, degraded Active Devices : 1 Working Devices : 1 Failed Devices : 0 Spare Devices : 0 UUID : 32ea6e7a:2ce2493e:f594b64f:1a51337b Events : 0.255284 Number Major Minor RaidDevice State 0 3 64 0 active sync /dev/hdb 1 0 0 1 removed So, which is it, one or two? # mdadm /dev/md0 --add /dev/hdc mdadm: re-added /dev/hdc # cat /proc/mdstat Personalities : [raid1] md0 : active raid1 hdc[2] hdb[0] 156290816 blocks [2/1] [U_] [>....................] recovery = 0.1% (178304/156290816) finish=43.7min speed=59434K/sec unused devices: <none> I know from previous experience that when this finishes, I'll have two working devices, but /dev/hdc will be (incorrectly) marked as a spare in output from mdadm --examine /dev/hdc, but not in the output from mdadm --details /dev/md0. When I reboot (after a clean shutdown), the cycle will repeat with another e-mail about a DegradedArray and /dev/hdc marked as removed for no reason. Version-Release number of selected component (if applicable): # rpm -q mdadm mdadm-2.5.4-2.fc6 # uname -a Linux herbie 2.6.19-1.2911.6.4.fc6 #1 SMP Sat Feb 24 14:39:04 EST 2007 i686 i686 i386 GNU/Linux # cat /etc/mdadm.conf DEVICE partitions MAILADDR root # DEVICE /dev/md0 /dev/hdb /dev/hdc ARRAY /dev/md0 level=raid1 num-devices=2 UUID=32ea6e7a:2ce2493e:f594b64f:1a51337b # cat /proc/partitions major minor #blocks name 3 0 156290904 hda 3 1 104422 hda1 3 2 2152678 hda2 3 3 154031220 hda3 3 64 156290904 hdb 22 0 156290904 hdc 253 0 61440000 dm-0 253 1 10223616 dm-1 253 2 917504 dm-2 9 0 156290816 md0 253 3 122978304 dm-3
This sounds like a kernel problem, not an mdadm problem (specifically, it sounds like the kernel is failing to update the superblock on hdc correctly, mdadm simply tells the kernel to use hdc on a hotadd operation, the kernel is the component that actually sets the superblock state and writes it to disk). Have recent kernel updates in fc6 fixed this problem for you?
No help from recent kernel updates. I tested: while running 2.6.20-1.2933.fc6, I shut down (/sbin/shutdown -r now), and when the system came up the array was degraded again, and I had to add /dev/hdc back in. I watched the shutdown procedure, and the last line before the system restarts says that /dev/md0 is still in use. Could that have something to do with it? Could it have something to do with having swap on /dev/VolGroup00/LogVol01?
Yes, the array still being in use is likely to have an impact on things. This usually happens when you have / on an lvm device. The reason the raid device can't be shut down is that / is never unmounted (although it can be mounted read-only). Since / is never unmounted, the lvm device can't be completely shut down, and since it can't be completely shut down, neither can the raid array. Normally, this is handled by the / filesystem going read only, then the lvm device goes read only (and writes out a clean lvm superblock), then the raid device goes read only (and also writes a clean superblock). Also, theoretically, the IDE subsystem should do a cache flush on the drive during shutdown, but after all the above items have gone read only. That would flush the clean superblock to the device and make startup happen normally. The problem here may be that one of your drives has write through caching and the other write behind caching and the IDE subsystem isn't flushing the cache on your drive (which could be a kernel problem if it just isn't flushing the drive cache, or an initscripts problem if they aren't properly setting everything read only prior to the kernel's cache flush). Try using hdparm to detect the cache settings on hdb and hdc and see if there is a difference. If there is, try setting hdc to match hdb and see if that corrects your problem. If it doesn't, try to watch the exact ordering of things getting shut down and post that here so I can reassign this either to the kernel or to initscripts depending on which one looks like it is the culprit.
Created attachment 151901 [details] hdparm -I for /dev/hdb First off, Doug, thank you for being so helpful, I really appreciate it. Second, I'm not sure how to tell if the one drive has write-through and the other has write-behind caching. I'm attaching hdparm -I output for hdb and hdc. Third, independent of the unmounting bug, could it be that there's still a bug in mdadm with regard to how it creates conflicting output for mdadm /dev/hdc --examine and mdadm --detail /dev/md0? Fourth, w.r.t the unmounting bug, I can't figure out what's not getting killed properly. As an experiment, I put the system into runlevel 1, checked that no services were running, waited a minute, and then did a shutdown -t 5 -r now. Everything stopped "OK", and the md0 was still in use even after all file systems got unmounted.
Created attachment 151902 [details] hdparm -I for /dev/hdc
Do you have this raid device being used by the lvm subsystem?
Yup.
Yes, this raid device is being used by LVM.
OK, I need to see the dmesg output from this machine starting at bootup until you've added the missing device back into the raid array.
Created attachment 152687 [details] dmesg when hdc is omitted Here's the entire dmesg. The drive is added back into the mirror around line 540.
Next I need the output of: ls /sys/block/dm-*/slaves
This is once I've re-added hdc... Don't know if that makes a difference. $ ls /sys/block/dm-*/slaves /sys/block/dm-0/slaves: hda3 /sys/block/dm-1/slaves: hda3 /sys/block/dm-2/slaves: hda3 /sys/block/dm-3/slaves: md0
There is a patch in the bug that this bug depends on. Can you try out that patch and get me both the dmesg and the console output during bootup?
You mean the patch in bug 213586? I'll be happy to try it, but I'm no mkintrd genius, so please a) give instructions on what to do to make sure I'm trying it right, and b) tell me how to back out the changes if my machine doesn't feel like booting.
Fedora apologizes that these issues have not been resolved yet. We're sorry it's taken so long for your bug to be properly triaged and acted on. We appreciate the time you took to report this issue and want to make sure no important bugs slip through the cracks. If you're currently running a version of Fedora Core between 1 and 6, please note that Fedora no longer maintains these releases. We strongly encourage you to upgrade to a current Fedora release. In order to refocus our efforts as a project we are flagging all of the open bugs for releases which are no longer maintained and closing them. http://fedoraproject.org/wiki/LifeCycle/EOL If this bug is still open against Fedora Core 1 through 6, thirty days from now, it will be closed 'WONTFIX'. If you can reporduce this bug in the latest Fedora version, please change to the respective version. If you are unable to do this, please add a comment to this bug requesting the change. Thanks for your help, and we apologize again that we haven't handled these issues to this point. The process we are following is outlined here: http://fedoraproject.org/wiki/BugZappers/F9CleanUp We will be following the process here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping to ensure this doesn't happen again. And if you'd like to join the bug triage team to help make things better, check out http://fedoraproject.org/wiki/BugZappers
This bug is open for a Fedora version that is no longer maintained and will not be fixed by Fedora. Therefore we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen thus bug against that version. Thank you for reporting this bug and we are sorry it could not be fixed.