Bug 615709
Summary: | Old raid5 does not assemble under f13, and other raid problems | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Edek Pienkowski <spojenie> | ||||||||
Component: | mdadm | Assignee: | Doug Ledford <dledford> | ||||||||
Status: | CLOSED WONTFIX | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||||
Severity: | high | Docs Contact: | |||||||||
Priority: | low | ||||||||||
Version: | 13 | CC: | anton, dledford, dougsland, gansalmon, itamar, jonathan, kernel-maint, madhu.chinakonda | ||||||||
Target Milestone: | --- | ||||||||||
Target Release: | --- | ||||||||||
Hardware: | x86_64 | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2011-06-29 13:21:02 UTC | Type: | --- | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Description
Edek Pienkowski
2010-07-18 07:15:23 UTC
You should be able to disable MD startup by adding rd_NO_MD to the kernel command line options. (run 'man dracut' for the full list of options.) (In reply to comment #1) Thanks - this option was set. Dracut starts fine; the problem appears later during initscripts phase. I managed to disable that with AUTO -all in mdadm.conf, so it won't do any damage for now. I also tried to disable initialization of raid in initscripts with kernel commandline, but nothing seemed to work. Can you dump the superblocks with mdadm? I think the command should be: mdadm -Q -E <partition> Created attachment 433215 [details]
Dump of all raid partitions with mdadm -Q -E
Created attachment 433216 [details]
Dump of all raid partitions with mdadm -Q -E
Created attachment 433217 [details]
FWIW, last /dev/sd{a,b,c,e}7 raid5 working date
The raid partitions are organized as follows: there are four drives, partitioned in exactly the same way. The same partition number on all four drives forms a raid (level 0,1, or 5). What I see is that sda7 "thinks" it is part of a healthy array, while sdb7, sdc7 and sde7 think that sda7 is removed. How it became so, I don't know. Basically, I've been using this setup for a couple of years without trouble. Recently, there were two "events" - I made a backup of LVs on sd{a,b,c,e}7 raid5 and sd{a,b,c,e}6 (they form one VG) onto an external drive. What is weird, I was resizing ext3 on this external drive, and it corrupted the ext3, though I think I have done it as usual: shrink another ext3 to get some free space, shrink LV with some margin, resize2fs to enlarge a bit to get the margin back, then enlarge LV and resize of ext3 (e2fschks of course were ran as resize2fs required, and after all was done). This corrupted the enlarged ext3. This would be an event, 'cause this never happened to me before. It was done on F11. - I waited till raid rebuilds, or checks, whatever cron does, and then I booted F13 live on this box. It failed to get raid up - unfortunately I did not save what it looked like in /proc/mdstat, but I remember that most of the volumes were wrong, such as UU__ and U_U_. I am a bit afraid of booting F13 again before I make backups of some volumes I haven't used for about a year, but I still want to have them, and they are on raid. Another two or three events where short electricity outages (no UPS...), but during the day, when nothing heavy happens to the drives. I had some situations similar on other boxes long time ago, both were because of failing DIMMs - but now there is ECC and no ECC events in logs. No MCEs either. It could be the motherboard, but the external drive was aoe, not SATA. Now F11 also cannot get sd{a,b,c,e}7 up, but other raids are OK. To recover sd{a,b,c,e}7, should I mount this raid5 with b,c,e and then add a and rebuild an array? And, which is more important, what went wrong? Thanks, Edek (In reply to comment #7) > To recover sd{a,b,c,e}7, should I mount this raid5 with b,c,e and then add a > and rebuild an array? > Yes, get it running without sda7 and then back it up. After you get a good backup, zero out the raid superblock on sda7 and add it back to the array. You can just clear the entire partition if you're not sure how to clear the superblock. > And, which is more important, what went wrong? There's probably no way of knowing that. (In reply to comment #8) > (In reply to comment #7) > > And, which is more important, what went wrong? > > There's probably no way of knowing that. Ok, I guess I'll try to fix what there is. However, still, F13 live has problems with most of the arrays (U_U_ if I understand correctly means two out of four disks are ok), and that is during auto-detection, manually they can be assembled. Is there anything wrong with other arrays (than sda7) when looking at superblock dumps? I'll give it one more try, if something fails like before I'll gather more data. Under F11 I assembled /dev/sd{b,c,e}7 and then mdadm -Iq /dev/sda7. It has rebuilt, writing mostly to sda7. The state now is: mdadm --assemble --scan under F11 segfaults after one array. Manually they can be assembled, filesystems/lvm are clean. Under F13 live: - dracut does not touch raid - initscripts fail like before (hardly assemble anything, some are 2/4, some 1/4) - mdadm --assemble --scan - does what it is supposed to, assembles all arrays (if they have been all stoped manually). Filesystems are ok. I can dump the superblocks in binary form and device sizes if you tell me where the superblocks are. I wonder if this is a bug in mdadm rather than a kernel bug? I'll reassign it and see what the maintainer thinks. There are significant improvements in mdadm's handling of hot plugged devices here recently. In particular, there has been a race condition in the handling of the mdadm device map file that is likely the reason f13 is doing such a poor job assembling your arrays. In fact, this race condition is *more* pronounced during init scripts bring up than it is during dracut bring up, so using rd_NO_MD on the command line actually makes the situation worse, not better. Regardless though, the improved mdadm won't hit an install image until f14 install images are cut. Fixing things in already created install images is very difficult. I have built updated mdadm packages for f12, f13, f14, and rawhide. The current, race fixed package is mdadm-3.1.3-0.git20100804.2, so you need that version or later to have the complete fix to the race condition that is affecting you. As to your array on sd{a,b,c,e}7, the output of the superblocks did clearly indicate that the last three drives were up to date and the first was out of date. An out of date drive always thinks it is up to date because once the drive is failed, we don't attempt to write a superblock that marks it as failed to the failed drive, we only update the superblocks on the remaining drives to indicate that the failed drive is failed. The fact that the other three drives all showed only three working disks instead of four, and had an events counter that was higher than sda7's event counter is how we know this. When we update the superblocks on the other three drives to mark sda7 as bad, we also increment the events counter, but because the superblock on sda7 wasn't updated, it has both the old count of number of working disks and the old events counter, which signals to the raid stack that it's out of date and should be kicked from the array. The remaining array assembly problems are likely the race condition I mentioned. If you could test with the latest mdadm, I would appreciate it. However, depending on the version of system you are running, you need to make sure you get a matching mdadm version. Don't attempt to use a f14 or rawhide mdadm on anything other than f14 or rawhide, f13 and earlier need the f13 or earlier mdadm packages due to a file packaging change introduced for f14 (udev no longer ships a rule file that mdadm now ships in f14, attempting to install the f14 package on f13 will cause a file conflict between mdadm and udev). The latest mdadm package has not yet hit the updates-testing repo, but should within another day or two. Thanks. I'll try the new mdadm, but please give me some time. What I noticed in the meantime, is that the live image on USB has those described problems - like 2 out of 4 drives - very often, but the same software booted from a faster source has no problems at all (ie a system installed from this live image). I do not know what race condition it is, but timing seems to affect the result. Hello, sorry it took so long. I updated mdadm on one f13 system, I can boot it now with AUTO +all, at least for a couple of reboots. It has "rotational" drives. I had same problem on another machine with ssd's, I'll reboot it a couple of times to check, hopefully in the next few days. Seems to work! This message is a reminder that Fedora 13 is nearing its end of life. Approximately 30 (thirty) days from now Fedora will stop maintaining and issuing updates for Fedora 13. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as WONTFIX if it remains open with a Fedora 'version' of '13'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version prior to Fedora 13's end of life. Bug Reporter: Thank you for reporting this issue and we are sorry that we may not be able to fix it before Fedora 13 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora please change the 'version' of this bug to the applicable version. If you are unable to change the version, please add a comment here and someone will do it for you. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. The process we are following is described here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping Fedora 13 changed to end-of-life (EOL) status on 2011-06-25. Fedora 13 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. Thank you for reporting this bug and we are sorry it could not be fixed. |