Bug 743022

Summary: F15->F16 yum update fails with IMSM (BIOS) raid
Product: [Fedora] Fedora Reporter: Jes Sorensen <Jes.Sorensen>
Component: systemdAssignee: systemd-maint
Status: CLOSED INSUFFICIENT_DATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 16CC: amlau, awilliam, buhrt, dan.j.williams, dledford, harald, Jes.Sorensen, johannbg, johannbg, kay, lpoetter, maurizio.antillon, metherid, mschmidt, notting, plautrba, redhat-bugzilla
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard: RejectedBlocker RejectedNTH
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 840562 (view as bug list) Environment:
Last Closed: 2013-01-14 22:14:34 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jes Sorensen 2011-10-03 15:43:55 UTC
Description of problem:
On a system running IMSM (BIOS) raid, trying to yum upgrade from F15 to F16.
All data is sitting on the IMSM raid drive. During package clean-up state of
mdadm, progress stops, and the system seems to have lost it's system disk.
Switching to console mode and trying to login as root hangs indefinitely as
well.

I am not sure exactly how to address this one, whether we can fix it in
systemd/mdadm or if it simply has to be documented as a 'do not try to do
this on IMSM raid drives' kinda things.

Version-Release number of selected component (if applicable):
Fedora 16 Beta

How reproducible:


Steps to Reproduce:
1. Setup system to use two drives as raid1 in the BIOS
2. Install F15
3. Install fedora-release and fedora-release-notes from F16.
4. Run 'yum update'

Actual results:


Expected results:


Additional info:

Comment 1 Adam Williamson 2011-10-04 20:44:30 UTC
"3. Install fedora-release and fedora-release-notes from F16.
4. Run 'yum update'"

note that this is not the recommended way to yum upgrade, you're meant to do it as per https://fedoraproject.org/wiki/Upgrading_Fedora_using_yum#4._Do_the_upgrade , which has a different process. likely not relevant to this bug, though. are you sure this is a systemd bug not an mdadm bug?

Comment 2 Adam Williamson 2011-10-04 20:45:00 UTC
upgrading via yum is explicitly not part of the release criteria, so voting -1 blocker. for me this is only a blocker if it happens with an anaconda upgrade too.

Comment 3 Jes Sorensen 2011-10-05 05:41:37 UTC
Well it used to be the way to upgrade via yum, and it should be a
requirement for a release - it's a common way to upgrade.

As for systemd vs mdadm, not sure, dledford requested I filed the
bug against systemd for now.

Comment 4 Doug Ledford 2011-10-07 14:41:10 UTC
I had Jes file this against systemd because we are both aware of the problem already, the systemd folks are not, and we haven't root caused whether this is a systemd or mdadm bug.  Making them aware of the issue gets more people thinking about it, but we are busy chasing a different issue that has a higher priority.  If the systemd guys have time to look at this and can identify what happened, then that's cool.

Comment 5 Adam Williamson 2011-10-07 15:57:46 UTC
FWIW, I did a yum upgrade on my Intel BIOS RAID-0 system yesterday; the package update step worked fine, but the system booted to a blinking cursor and I couldn't fix it. In the end I just re-installed F16 (and switched to software RAID...)

Comment 6 Adam Williamson 2011-10-07 18:51:42 UTC
Discussed at 2011-10-07 blocker review meeting, again, this is a RAID issue we need evaluation from the developer before we can really be sure if it's a blocker.

Comment 7 Adam Williamson 2011-10-14 18:25:17 UTC
Discussed at 2011-10-14 blocker review meeting, still waiting on developer input.

Comment 8 Fedora Admin XMLRPC Client 2011-10-20 16:31:10 UTC
This package has changed ownership in the Fedora Package Database.  Reassigning to the new owner of this component.

Comment 9 Adam Williamson 2011-10-21 18:14:49 UTC
Discussed at 2011-10-21 blocker review meeting. We still have no assessment from systemd devs. Please respond ASAP, we need prompter input for blocker issues, please.

Comment 10 Kay Sievers 2011-10-22 04:07:28 UTC
Sorry, I don't think any of the systemd devs ever really used BIOS
raid. We would need to know what to look for, or can assist in debugging,
but I don't think we can provide any real input to the issue.

Is this related to restarting/stopping a service the md device depends on
to be alive across reboots?

Comment 11 Lennart Poettering 2011-10-22 21:33:25 UTC
IIUTC this is about the userspace fakeraid stuff being unkillable. That really needs to be fixed in the dm code, and not in systemd.

Comment 12 Adam Williamson 2011-10-24 16:44:48 UTC
Discussed at the 2011-10-24 QA meeting functioning as a blocker review meeting. We've been punting on this for a while, but given that it appears to be related solely to yum upgrades which are not supported by the criteria, and Jes and Doug don't seem hugely bothered about it, we're rejecting it as a blocker now. Also rejected as NTH, as when you do a yum upgrade, you pull in updates, so an update is as good a fix as anything.



-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 13 Jeff Buhrt 2011-12-21 20:35:32 UTC
Another related problem is too small of /boot partition and/or people using MD (software) RAID when doing a yum upgrade.

I figured out how to move (software RAID) md0's start sector to make room to install grub2, but grub2 won't show a menu. I have to manually boot from inside grub.
The system was just yum upgraded to F16, it has a mirrored /dev/md0 eith an ext3 /boot and md's with LVM for the rest of the system. I have used this type of configuration for maybe 6-8yr+ to handle the piles of drives that have failed in systems during that time. The big upside is this is the 1st (and only) I have tried to upgrade from F15 -> F16, remote upgrades will be high risk at best...

1) To move the start of /boot (assuming md0 on partion 1 of the disks (sda1 and sdb1) for below, change my example as needed):
This assumes a mirrored /dev/md0.

#backup /boot
tar cvzf  ~/boot.tgz  --exclude '*lost+found*' /boot/

# make a note of where / is, you will need it until my #3 point is solved
df

# confirm the md and the partitions (PV's)
cat /proc/mdstat

# danger below here!!!! Do at your own risk, not mine.

# free the partitions
mdadm --stop /dev/md0 

fdisk /dev/sda
# delete and re-add sda1
(delete, 'n' to add, use the default 2048 and whatever end is available, 't' (type) to fd (RAID), 'a' (active)), 'w' write

# You will most likely get a re-read warning about needing to boot. DON'T REBOOT!
partprobe

# repeat 'fdisk /dev/sdb' and partprobe

# make a new (now smaller) md0
mdadm --create /dev/md0 --level=1 --raid-devices=2 /dev/sda1 /dev/sdb1

# create a filesystem
mkfs.ext3 /dev/md0

# make sure mounted
mount /boot
df /boot

# restore /boot
tar -xvzf  ~/boot.tgz  --exclude '*lost+found*'
# (excluded to make sure the empty slots aren't lost)

(assuming it is empty for you too)
grub2-mkconfig -o /boot/grub2/grub.cfg

grub2-install /dev/sda
grub2-install /dev/sdb

2) If you fail to boot and just get a Grub> prompt...
You need to know the root mapper path. from step one above, mine would look like: /dev/mapper/SysVG-RootLV (where Sys is the system name (I'm an ex-IBMer if you wonder about the naming standard).
insmod gzio
insmod part_msdos
insmod ext2
linux /vmlinux-3.1.5-6.fc16.i686 root=/dev/mapper/SysVG-RootLV
initrd /initramfs-3.1.5-6.fc16.i686.img
boot

3) Problem as described in #2: at boot, grub 1.99 goes to a grub prompt vs presenting a menu.

(I also repeated the grub2-install first running 'rm /etc/grub2.cfg', 'ln -s /boot/grub2/grub.cfg /etc', it still doesn't help)

Ideas how to get past grub2 not showing a menu?

(743022 may be the same core issue as 737508)

Comment 14 Jes Sorensen 2011-12-28 09:52:30 UTC
Jeff,

It sounds like a different bug that you should file against grub2 to get
the right people to look at it.

Cheers,
Jes

Comment 15 Jóhann B. Guðmundsson 2012-01-29 15:08:26 UTC
Jes this is a duplicate of 713224 right?

Comment 16 Jes Sorensen 2012-01-30 13:48:25 UTC
713224 is against Fedora 15, this is against Fedora 16 - problem needs to be
fixed in both places, so no, not a dupe.

Comment 17 Jóhann B. Guðmundsson 2012-01-30 14:00:25 UTC
It's still the same "bug" right?

Comment 18 Dan Williams 2012-06-23 22:10:27 UTC
(In reply to comment #0)
> Description of problem:
> On a system running IMSM (BIOS) raid, trying to yum upgrade from F15 to F16.
> All data is sitting on the IMSM raid drive. During package clean-up state of
> mdadm, progress stops, and the system seems to have lost it's system disk.
> Switching to console mode and trying to login as root hangs indefinitely as
> well.

Sounds like mdmon dies and the kernel is waiting indefinitely to mark the metadata dirty.  This also correlates with Adam's finding that raid0 seems to work a bit better.

I'm about to try this upgrade on my home systems (raid1 and a raid5), so I'll let you know what I find.

I've been reluctant to reboot/touch my F15 system because systemd still arranges for the array to resync each boot, and given Lennart's comment11 I don't expect this is fixed in later Fedoras.

We hashed through some of the details months back [1], and iirc the consensus back then was to exempt rootfs mdmon from cgroup based killing and just "return to the initramfs" to manage mdmon shutdown.  Unfortunately this requires coordinated updates to systemd and dracut.

[1]: http://marc.info/?t=129145213000001&r=1&w=2

Comment 19 Dan Williams 2012-06-24 19:30:44 UTC
(In reply to comment #18)
> I've been reluctant to reboot/touch my F15 system because systemd still
> arranges for the array to resync each boot, and given Lennart's comment11 I
> don't expect this is fixed in later Fedoras.

...actually the systemd enabling was added:

commit bd1a69818042e85e24ec3adaf5eb3ac30ab1d9fd
Author: Lennart Poettering <lennart>
Date:   Wed Jan 11 01:51:52 2012 +0100

    shutdown: add link to root storage daemon text

commit 7e4ab3c5a6295193d0c58d353b6430265d842f34
Author: Lennart Poettering <lennart>
Date:   Tue Jan 10 04:20:55 2012 +0100

    shutdown: exclude processes with argv[0][0] from killing

http://www.freedesktop.org/wiki/Software/systemd/RootStorageDaemons

Now just need the Dracut and mdmon update.

Comment 20 Jes Sorensen 2012-06-24 19:42:39 UTC
It's been in mdadm for a while - but it wasn't there in what was included
on the F16 install image, so it may not happen without a hang during the
upgrade.

dracut has the fixes in too, I just don't have the git commits handy.

Jes


commit a0963a86e12a55d501f421048bd7c09cf4d78b93
Author: Jes Sorensen <Jes.Sorensen>
Date:   Wed Jan 25 15:18:04 2012 +0100

    Spawn mdmon with --offroot if mdadm was launched with --offroot
    
    Acked-by: Doug Ledford <dledford>
    Signed-off-by: Jes Sorensen <Jes.Sorensen>
    Signed-off-by: NeilBrown <neilb>

commit da827518c1f062e7d49433691d33e103525f9d6a
Author: Jes Sorensen <Jes.Sorensen>
Date:   Wed Jan 25 15:18:03 2012 +0100

    Add --offroot argument to mdmon
    
    Acked-by: Doug Ledford <dledford>
    Signed-off-by: Jes Sorensen <Jes.Sorensen>
    Signed-off-by: NeilBrown <neilb>

commit 08ca2adffffeb3bfda3cafababfc26706a60463b
Author: Jes Sorensen <Jes.Sorensen>
Date:   Wed Jan 25 15:18:02 2012 +0100

    Add --offroot argument to mdadm
    
    When --offroot is specified, mdadm will change the first character of
    argv[0] to '@'. This is used to signal to systemd that mdadm was
    launched from initramfs and should not be shut down before returning
    to the initramfs.
    
    Acked-by: Doug Ledford <dledford>
    Signed-off-by: Jes Sorensen <Jes.Sorensen>
    Signed-off-by: NeilBrown <neilb>

Comment 21 Dan Williams 2012-06-24 23:51:22 UTC
Great! I'll take a look.

So on my system I reproduced the hang while strace'ing mdmon during a yum upgrade:

   pselect6(16, NULL, NULL, [8 10 11 12 15], {86400, 0}, {[TERM], 8}

    <unfinished ...>
   +++ killed by SIGKILL +++

Does the clean-up action start killing things after a timeout?

In any event the workaround that can go in the wiki is to disable active/clean transitions during the upgrade:

   echo 0 > /sys/block/md127/md/safe_mode_delay

This prevents the root device from hanging after mdmon is killed.

However the upgrade completes with:

   Cleanup    : libgcc-4.6.3-2.fc15                                    3147/3147
   Rpmdb checksum is invalid: dCDPT(pkg checksums)

...but I wonder if that is just a side-effect of the whatever killed mdmon?  Seems to have come up ok after a forced reboot, probably something wonky in the hand-off from old systemd to new?

./run/initramfs/lib/dracut/hooks/shutdown/30md-shutdown.sh does not appear to be ensuring the array is clean before rebooting.  It can call "mdadm --wait-clean --scan" to do that.

Last note is that systemd still seems to arrange for mdmon to die an early death in the ultimate_send_signal() case.  Any reason that routine can't use killall() to get the benefit of ignore_proc()?

Comment 22 Jes Sorensen 2012-06-25 08:43:04 UTC
Dan,

Glad it worked - you may want to open a bugzilla against dracut to suggest
they add those commands to the shutdown scripts. It may not get noticed
here.

Cheers,
Jes

Comment 23 Harald Hoyer 2012-07-16 15:26:02 UTC
(In reply to comment #22)
> Dan,
> 
> Glad it worked - you may want to open a bugzilla against dracut to suggest
> they add those commands to the shutdown scripts. It may not get noticed
> here.
> 
> Cheers,
> Jes

clone as dracut bug 840562

Comment 24 Lennart Poettering 2012-09-14 15:11:24 UTC
Is this fixed now? Can I close this?

Comment 25 Lennart Poettering 2013-01-14 22:14:34 UTC
No response, closing.