Bug 230860
Summary: | DegradedArray event for no good reason | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Jack Tanner <ihok> | ||||||||
Component: | mdadm | Assignee: | Doug Ledford <dledford> | ||||||||
Status: | CLOSED WONTFIX | QA Contact: | |||||||||
Severity: | medium | Docs Contact: | |||||||||
Priority: | medium | ||||||||||
Version: | 6 | CC: | triage | ||||||||
Target Milestone: | --- | ||||||||||
Target Release: | --- | ||||||||||
Hardware: | All | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | bzcl34nup | ||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2008-05-06 19:18:14 UTC | Type: | --- | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | 236666 | ||||||||||
Bug Blocks: | |||||||||||
Attachments: |
|
Description
Jack Tanner
2007-03-03 21:02:53 UTC
This sounds like a kernel problem, not an mdadm problem (specifically, it sounds like the kernel is failing to update the superblock on hdc correctly, mdadm simply tells the kernel to use hdc on a hotadd operation, the kernel is the component that actually sets the superblock state and writes it to disk). Have recent kernel updates in fc6 fixed this problem for you? No help from recent kernel updates. I tested: while running 2.6.20-1.2933.fc6, I shut down (/sbin/shutdown -r now), and when the system came up the array was degraded again, and I had to add /dev/hdc back in. I watched the shutdown procedure, and the last line before the system restarts says that /dev/md0 is still in use. Could that have something to do with it? Could it have something to do with having swap on /dev/VolGroup00/LogVol01? Yes, the array still being in use is likely to have an impact on things. This usually happens when you have / on an lvm device. The reason the raid device can't be shut down is that / is never unmounted (although it can be mounted read-only). Since / is never unmounted, the lvm device can't be completely shut down, and since it can't be completely shut down, neither can the raid array. Normally, this is handled by the / filesystem going read only, then the lvm device goes read only (and writes out a clean lvm superblock), then the raid device goes read only (and also writes a clean superblock). Also, theoretically, the IDE subsystem should do a cache flush on the drive during shutdown, but after all the above items have gone read only. That would flush the clean superblock to the device and make startup happen normally. The problem here may be that one of your drives has write through caching and the other write behind caching and the IDE subsystem isn't flushing the cache on your drive (which could be a kernel problem if it just isn't flushing the drive cache, or an initscripts problem if they aren't properly setting everything read only prior to the kernel's cache flush). Try using hdparm to detect the cache settings on hdb and hdc and see if there is a difference. If there is, try setting hdc to match hdb and see if that corrects your problem. If it doesn't, try to watch the exact ordering of things getting shut down and post that here so I can reassign this either to the kernel or to initscripts depending on which one looks like it is the culprit. Created attachment 151901 [details]
hdparm -I for /dev/hdb
First off, Doug, thank you for being so helpful, I really appreciate it.
Second, I'm not sure how to tell if the one drive has write-through and the
other has write-behind caching. I'm attaching hdparm -I output for hdb and hdc.
Third, independent of the unmounting bug, could it be that there's still a bug
in mdadm with regard to how it creates conflicting output for mdadm /dev/hdc
--examine and mdadm --detail /dev/md0?
Fourth, w.r.t the unmounting bug, I can't figure out what's not getting killed
properly. As an experiment, I put the system into runlevel 1, checked that no
services were running, waited a minute, and then did a shutdown -t 5 -r now.
Everything stopped "OK", and the md0 was still in use even after all file
systems got unmounted.
Created attachment 151902 [details]
hdparm -I for /dev/hdc
Do you have this raid device being used by the lvm subsystem? Yup. Yes, this raid device is being used by LVM. OK, I need to see the dmesg output from this machine starting at bootup until you've added the missing device back into the raid array. Created attachment 152687 [details]
dmesg when hdc is omitted
Here's the entire dmesg. The drive is added back into the mirror around line
540.
Next I need the output of: ls /sys/block/dm-*/slaves This is once I've re-added hdc... Don't know if that makes a difference. $ ls /sys/block/dm-*/slaves /sys/block/dm-0/slaves: hda3 /sys/block/dm-1/slaves: hda3 /sys/block/dm-2/slaves: hda3 /sys/block/dm-3/slaves: md0 There is a patch in the bug that this bug depends on. Can you try out that patch and get me both the dmesg and the console output during bootup? You mean the patch in bug 213586? I'll be happy to try it, but I'm no mkintrd genius, so please a) give instructions on what to do to make sure I'm trying it right, and b) tell me how to back out the changes if my machine doesn't feel like booting. Fedora apologizes that these issues have not been resolved yet. We're sorry it's taken so long for your bug to be properly triaged and acted on. We appreciate the time you took to report this issue and want to make sure no important bugs slip through the cracks. If you're currently running a version of Fedora Core between 1 and 6, please note that Fedora no longer maintains these releases. We strongly encourage you to upgrade to a current Fedora release. In order to refocus our efforts as a project we are flagging all of the open bugs for releases which are no longer maintained and closing them. http://fedoraproject.org/wiki/LifeCycle/EOL If this bug is still open against Fedora Core 1 through 6, thirty days from now, it will be closed 'WONTFIX'. If you can reporduce this bug in the latest Fedora version, please change to the respective version. If you are unable to do this, please add a comment to this bug requesting the change. Thanks for your help, and we apologize again that we haven't handled these issues to this point. The process we are following is outlined here: http://fedoraproject.org/wiki/BugZappers/F9CleanUp We will be following the process here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping to ensure this doesn't happen again. And if you'd like to join the bug triage team to help make things better, check out http://fedoraproject.org/wiki/BugZappers This bug is open for a Fedora version that is no longer maintained and will not be fixed by Fedora. Therefore we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen thus bug against that version. Thank you for reporting this bug and we are sorry it could not be fixed. |