Description of problem: On a system running IMSM (BIOS) raid, trying to yum upgrade from F15 to F16. All data is sitting on the IMSM raid drive. During package clean-up state of mdadm, progress stops, and the system seems to have lost it's system disk. Switching to console mode and trying to login as root hangs indefinitely as well. I am not sure exactly how to address this one, whether we can fix it in systemd/mdadm or if it simply has to be documented as a 'do not try to do this on IMSM raid drives' kinda things. Version-Release number of selected component (if applicable): Fedora 16 Beta How reproducible: Steps to Reproduce: 1. Setup system to use two drives as raid1 in the BIOS 2. Install F15 3. Install fedora-release and fedora-release-notes from F16. 4. Run 'yum update' Actual results: Expected results: Additional info:
"3. Install fedora-release and fedora-release-notes from F16. 4. Run 'yum update'" note that this is not the recommended way to yum upgrade, you're meant to do it as per https://fedoraproject.org/wiki/Upgrading_Fedora_using_yum#4._Do_the_upgrade , which has a different process. likely not relevant to this bug, though. are you sure this is a systemd bug not an mdadm bug?
upgrading via yum is explicitly not part of the release criteria, so voting -1 blocker. for me this is only a blocker if it happens with an anaconda upgrade too.
Well it used to be the way to upgrade via yum, and it should be a requirement for a release - it's a common way to upgrade. As for systemd vs mdadm, not sure, dledford requested I filed the bug against systemd for now.
I had Jes file this against systemd because we are both aware of the problem already, the systemd folks are not, and we haven't root caused whether this is a systemd or mdadm bug. Making them aware of the issue gets more people thinking about it, but we are busy chasing a different issue that has a higher priority. If the systemd guys have time to look at this and can identify what happened, then that's cool.
FWIW, I did a yum upgrade on my Intel BIOS RAID-0 system yesterday; the package update step worked fine, but the system booted to a blinking cursor and I couldn't fix it. In the end I just re-installed F16 (and switched to software RAID...)
Discussed at 2011-10-07 blocker review meeting, again, this is a RAID issue we need evaluation from the developer before we can really be sure if it's a blocker.
Discussed at 2011-10-14 blocker review meeting, still waiting on developer input.
This package has changed ownership in the Fedora Package Database. Reassigning to the new owner of this component.
Discussed at 2011-10-21 blocker review meeting. We still have no assessment from systemd devs. Please respond ASAP, we need prompter input for blocker issues, please.
Sorry, I don't think any of the systemd devs ever really used BIOS raid. We would need to know what to look for, or can assist in debugging, but I don't think we can provide any real input to the issue. Is this related to restarting/stopping a service the md device depends on to be alive across reboots?
IIUTC this is about the userspace fakeraid stuff being unkillable. That really needs to be fixed in the dm code, and not in systemd.
Discussed at the 2011-10-24 QA meeting functioning as a blocker review meeting. We've been punting on this for a while, but given that it appears to be related solely to yum upgrades which are not supported by the criteria, and Jes and Doug don't seem hugely bothered about it, we're rejecting it as a blocker now. Also rejected as NTH, as when you do a yum upgrade, you pull in updates, so an update is as good a fix as anything. -- Fedora Bugzappers volunteer triage team https://fedoraproject.org/wiki/BugZappers
Another related problem is too small of /boot partition and/or people using MD (software) RAID when doing a yum upgrade. I figured out how to move (software RAID) md0's start sector to make room to install grub2, but grub2 won't show a menu. I have to manually boot from inside grub. The system was just yum upgraded to F16, it has a mirrored /dev/md0 eith an ext3 /boot and md's with LVM for the rest of the system. I have used this type of configuration for maybe 6-8yr+ to handle the piles of drives that have failed in systems during that time. The big upside is this is the 1st (and only) I have tried to upgrade from F15 -> F16, remote upgrades will be high risk at best... 1) To move the start of /boot (assuming md0 on partion 1 of the disks (sda1 and sdb1) for below, change my example as needed): This assumes a mirrored /dev/md0. #backup /boot tar cvzf ~/boot.tgz --exclude '*lost+found*' /boot/ # make a note of where / is, you will need it until my #3 point is solved df # confirm the md and the partitions (PV's) cat /proc/mdstat # danger below here!!!! Do at your own risk, not mine. # free the partitions mdadm --stop /dev/md0 fdisk /dev/sda # delete and re-add sda1 (delete, 'n' to add, use the default 2048 and whatever end is available, 't' (type) to fd (RAID), 'a' (active)), 'w' write # You will most likely get a re-read warning about needing to boot. DON'T REBOOT! partprobe # repeat 'fdisk /dev/sdb' and partprobe # make a new (now smaller) md0 mdadm --create /dev/md0 --level=1 --raid-devices=2 /dev/sda1 /dev/sdb1 # create a filesystem mkfs.ext3 /dev/md0 # make sure mounted mount /boot df /boot # restore /boot tar -xvzf ~/boot.tgz --exclude '*lost+found*' # (excluded to make sure the empty slots aren't lost) (assuming it is empty for you too) grub2-mkconfig -o /boot/grub2/grub.cfg grub2-install /dev/sda grub2-install /dev/sdb 2) If you fail to boot and just get a Grub> prompt... You need to know the root mapper path. from step one above, mine would look like: /dev/mapper/SysVG-RootLV (where Sys is the system name (I'm an ex-IBMer if you wonder about the naming standard). insmod gzio insmod part_msdos insmod ext2 linux /vmlinux-3.1.5-6.fc16.i686 root=/dev/mapper/SysVG-RootLV initrd /initramfs-3.1.5-6.fc16.i686.img boot 3) Problem as described in #2: at boot, grub 1.99 goes to a grub prompt vs presenting a menu. (I also repeated the grub2-install first running 'rm /etc/grub2.cfg', 'ln -s /boot/grub2/grub.cfg /etc', it still doesn't help) Ideas how to get past grub2 not showing a menu? (743022 may be the same core issue as 737508)
Jeff, It sounds like a different bug that you should file against grub2 to get the right people to look at it. Cheers, Jes
Jes this is a duplicate of 713224 right?
713224 is against Fedora 15, this is against Fedora 16 - problem needs to be fixed in both places, so no, not a dupe.
It's still the same "bug" right?
(In reply to comment #0) > Description of problem: > On a system running IMSM (BIOS) raid, trying to yum upgrade from F15 to F16. > All data is sitting on the IMSM raid drive. During package clean-up state of > mdadm, progress stops, and the system seems to have lost it's system disk. > Switching to console mode and trying to login as root hangs indefinitely as > well. Sounds like mdmon dies and the kernel is waiting indefinitely to mark the metadata dirty. This also correlates with Adam's finding that raid0 seems to work a bit better. I'm about to try this upgrade on my home systems (raid1 and a raid5), so I'll let you know what I find. I've been reluctant to reboot/touch my F15 system because systemd still arranges for the array to resync each boot, and given Lennart's comment11 I don't expect this is fixed in later Fedoras. We hashed through some of the details months back [1], and iirc the consensus back then was to exempt rootfs mdmon from cgroup based killing and just "return to the initramfs" to manage mdmon shutdown. Unfortunately this requires coordinated updates to systemd and dracut. [1]: http://marc.info/?t=129145213000001&r=1&w=2
(In reply to comment #18) > I've been reluctant to reboot/touch my F15 system because systemd still > arranges for the array to resync each boot, and given Lennart's comment11 I > don't expect this is fixed in later Fedoras. ...actually the systemd enabling was added: commit bd1a69818042e85e24ec3adaf5eb3ac30ab1d9fd Author: Lennart Poettering <lennart> Date: Wed Jan 11 01:51:52 2012 +0100 shutdown: add link to root storage daemon text commit 7e4ab3c5a6295193d0c58d353b6430265d842f34 Author: Lennart Poettering <lennart> Date: Tue Jan 10 04:20:55 2012 +0100 shutdown: exclude processes with argv[0][0] from killing http://www.freedesktop.org/wiki/Software/systemd/RootStorageDaemons Now just need the Dracut and mdmon update.
It's been in mdadm for a while - but it wasn't there in what was included on the F16 install image, so it may not happen without a hang during the upgrade. dracut has the fixes in too, I just don't have the git commits handy. Jes commit a0963a86e12a55d501f421048bd7c09cf4d78b93 Author: Jes Sorensen <Jes.Sorensen> Date: Wed Jan 25 15:18:04 2012 +0100 Spawn mdmon with --offroot if mdadm was launched with --offroot Acked-by: Doug Ledford <dledford> Signed-off-by: Jes Sorensen <Jes.Sorensen> Signed-off-by: NeilBrown <neilb> commit da827518c1f062e7d49433691d33e103525f9d6a Author: Jes Sorensen <Jes.Sorensen> Date: Wed Jan 25 15:18:03 2012 +0100 Add --offroot argument to mdmon Acked-by: Doug Ledford <dledford> Signed-off-by: Jes Sorensen <Jes.Sorensen> Signed-off-by: NeilBrown <neilb> commit 08ca2adffffeb3bfda3cafababfc26706a60463b Author: Jes Sorensen <Jes.Sorensen> Date: Wed Jan 25 15:18:02 2012 +0100 Add --offroot argument to mdadm When --offroot is specified, mdadm will change the first character of argv[0] to '@'. This is used to signal to systemd that mdadm was launched from initramfs and should not be shut down before returning to the initramfs. Acked-by: Doug Ledford <dledford> Signed-off-by: Jes Sorensen <Jes.Sorensen> Signed-off-by: NeilBrown <neilb>
Great! I'll take a look. So on my system I reproduced the hang while strace'ing mdmon during a yum upgrade: pselect6(16, NULL, NULL, [8 10 11 12 15], {86400, 0}, {[TERM], 8} <unfinished ...> +++ killed by SIGKILL +++ Does the clean-up action start killing things after a timeout? In any event the workaround that can go in the wiki is to disable active/clean transitions during the upgrade: echo 0 > /sys/block/md127/md/safe_mode_delay This prevents the root device from hanging after mdmon is killed. However the upgrade completes with: Cleanup : libgcc-4.6.3-2.fc15 3147/3147 Rpmdb checksum is invalid: dCDPT(pkg checksums) ...but I wonder if that is just a side-effect of the whatever killed mdmon? Seems to have come up ok after a forced reboot, probably something wonky in the hand-off from old systemd to new? ./run/initramfs/lib/dracut/hooks/shutdown/30md-shutdown.sh does not appear to be ensuring the array is clean before rebooting. It can call "mdadm --wait-clean --scan" to do that. Last note is that systemd still seems to arrange for mdmon to die an early death in the ultimate_send_signal() case. Any reason that routine can't use killall() to get the benefit of ignore_proc()?
Dan, Glad it worked - you may want to open a bugzilla against dracut to suggest they add those commands to the shutdown scripts. It may not get noticed here. Cheers, Jes
(In reply to comment #22) > Dan, > > Glad it worked - you may want to open a bugzilla against dracut to suggest > they add those commands to the shutdown scripts. It may not get noticed > here. > > Cheers, > Jes clone as dracut bug 840562
Is this fixed now? Can I close this?
No response, closing.