Created attachment 586208 [details] anaconda-ks.cfg generated by "failed" install Description of problem: After installing from an nfsiso repository using the default install options and clicking the reboot button, the system never finishes rebooting. Version-Release number of selected component (if applicable): Fedora 17 RC3 How reproducible: 2 out of 2 tries so far Steps to Reproduce: 1. Put DVD.iso in an NFS export on a 100BaseT-connected system 2. Boot desktop install DVD with repo=nfsiso:192.168.1.178:/home/mark/f17 3. Proceed with the default options and default graphical install package set 4. After install is complete, click reboot button Actual results: Reboot never completes. I can switch to vc6 and see these messages: ------- Sending SIGTERM to remaining processes... NetworkManager[548]: <info> (eth0): DHCPv4 client pid 598 exited with status -1 NetworkManager[548]: <warn> DHCP client died abnormally NetworkManager[548]: <info> (eth0): device state change activated -> failed (reason 'ip-config-expired') [100 120 6] NetworkManager[548]: <warn> Activation (eth0) failed. NetworkManager[548]: <warn> quit request received, terminating... NetworkManager[548]: <info> (eth0): now unmanaged NetworkManager[548]: <info> (eth0): device state change: failed -> unmanaged (reason 'removed') [120 10 36] NetworkManager[548]: <info> (eth0): deactivating device (reason 'removed') [36] NetworkManager[548]: <info> (eth0): cleaning up... NetworkManager[548]: <info> (eth0): taking down device. NetworkManager[548]: <info> exiting (success) Sending SIGKILL to remaining processes... Unmounting file systems. ------- It looks like NetworkManager might be bringing down the interface before the NFS share can be unmounted I can boot to the installed system and everything looks OK.
I am handling the issue for rhel (bug #821502). systemd seems not to be able to umount /run/install/repo nfs mounted in dracut. Here is updates image with patch: http://rvykydal.fedorapeople.org/updates.umountrepo.img Does it fix the problem for you? (you can check it by adding updates=http://rvykydal.fedorapeople.org/updates.umountrepo.img to boot parameters or using kickstart updates command)
That seems to have fixed it. Though, the first time I tested the updates image I also set up an additional NFS package repo through the GUI and it failed to reboot in the exactly the same way. However, using NFS for just the ISO works now at least. I don't know whether this should be an f17blocker, but I'll set it anyway and let someone else decide.
Well, I realized that your case may be different from comment #1 because it is nfsiso:, not nfs: as in comment #1. nfsiso: case mounts /run/install/isodir in dracut so I need to update my fix. Also I need to check the nfs repos added in UI - I thouhgt that only nfs mounts by dracut would cause hang in systemd reboot.
Now I'm confused :) Your updates.img was for nfs: only? I don't know why it worked then, because I didn't test with nfs: at all. It's conceivable that I mistyped the command line the time that it worked, and it installed from the DVD instead of nfsiso. That would mean that the failure to reboot with the NFS GUI repo could have still been caused by nfsiso. If I have time in the next few hours I'll test again.
I'm still testing and will send updated updates.img soon. As for the hang, I can't reproduce it always (there might be some race with NetworkManager/network going down), so it is possible that you were just lucky.
Here is new updates.img that should fix the issue: http://rvykydal.fedorapeople.org/updates.umountrepo2.img
I can confirm this problem with F17 RC4. Unfortunately it is non-deterministic, probably some race condition. I had highest occurrences of reboot hangs when doing netinst.iso + repo=nfsiso: install in a VM (on bare metal it the occurrences were more rare). I'll test again with update from comment 6.
Fix from comment 6 works for "netinst+repo=nfsiso: in VM" case - previously 4 out of 4 attempts broken (hanged at reboot), now 4 out of 4 attempts successful. But it still seems to be broken for "direct kernel boot + repo=nfsiso: in VM". 3 out of 3 attempts broken. I have connected serial console and this is what it prints out in the end: Sending SIGTERM to remaining processes... NetworkManager[641]: <info> (eth0): DHCPv4 client pid 670 exited with status -1 NetworkManager[641]: <warn> DHCP client died abnormally NetworkManager[641]: <info> (eth0): device state change: activated -> failed (reason 'ip-config-expired') [100 120 6] NetworkManager[641]: <warn> Activation (eth0) failed. NetworkManager[641]: <warn> quit request received, terminating... NetworkManager[641]: <info> (eth0): now unmanaged NetworkManager[641]: <info> (eth0): device state change: failed -> unmanaged (reason 'removed') [120 10 36] NetworkManager[641]: <info> (eth0): deactivating device (reason 'removed') [36] NetworkManager[641]: <info> (eth0): cleaning up... NetworkManager[641]: <info> (eth0): taking down device. NetworkManager[641]: <info> exiting (success) Sending SIGKILL to remaining processes... Unmounting file systems. Unmounted /mnt/sysimage/sys/fs/selinux. Unmounted /mnt/sysimage/sys. Unmounted /mnt/sysimage/proc. Unmounted /mnt/sysimage/dev/shm. Unmounted /mnt/sysimage/dev/pts. Unmounted /mnt/sysimage/dev. Unmounted /mnt/sysimage/boot. Unmounted /mnt/sysimage. Unmounted /dev/hugepages. Unmounted /sys/kernel/security. Unmounted /dev/mqueue. Unmounted /sys/kernel/config. Unmounted /sys/kernel/debug. Unmounted /media. Unmounted /run/initramfs/var/lib/nfs/rpc_pipefs. <computer hangs here>
The update from comment 6 fixed nfsiso: when booting from DVD, and fixed adding a new NFS repo from the GUI. I'm not set up to test any other install methods so that's all I can do for now.
Confirming comment 8. While it works with the fix from comment 6, it doesn't seem to work for direct kernel boot.
(In reply to comment #8) > But it still seems to be broken for "direct kernel boot + repo=nfsiso: in > VM". 3 out of 3 attempts broken. 1/1 worked for me, trying again
(In reply to comment #11) > 1/1 worked for me, trying again 2/2 working
Well...the criteria don't specify that reboot at the end of install needs to work, there is an obvious 'workaround' (reboot manually), and this doesn't seem to lead to any actual breakage of any kind. I think I'm -1 blocker.
Please don't forget about automatic deployment of virtual machines. When doing a manually install, yes, there is no problem to power off the machine the hard way. When doing it in an automated way, however, this might cause a lot of problems. I believe this should be a blocker.
I can confirm this also hits kickstarted installs. I reproduced using virt-manager, where I booted from vmlinuz+initrd and provided these arguments: repo=nfsiso:192.168.1.1:/mnt/data/iso/Fedora-17-i386-DVD.iso updates=http://rvykydal.fedorapeople.org/updates.umountrepo2.img ks=http://192.168.1.1:8000/minimal.ks Broken criteria: The installer must be able to successfully complete a scripted installation, using the installer's preferred scripting system, which duplicates the default interactive installation as closely as possible -and- Any installation method or process designed to run unattended must do so (there should be no prompts requiring user intervention)
An updates.img that fixes the issue would be a sufficient fix for such large-scale automated cases, wouldn't it?
Created attachment 586440 [details] howto: direct kernel boot in virt-manager This is how you set a direct kernel boot in virt-manager. Use to reproduce this issue.
-1 blocker if nfs works and only just nfsiso is broken, as "unpacking"/mounting the iso-file on the server seems to be a valid and fair workaround. -1 blocker if people are really certain that there's an updates.img available and tested to fix the issue +1 blocker in all other cases, under the criteria kparal mentioned in comment #16
When using a direct kernel boot, repo=nfs: (mounting an exploded install tree) is also affected by this bug.
FWIW, I reproduced this one try out of three in the 'install from netinst with no updates.img' case, and one try out of three in the 'install from direct kernel boot with updates.img' case. Both using VMs.
I think I'm -1 blocker here, because, while this bug is certainly annoying, the install is complete, and when reboot is forced, the system does come up properly. Of course, I am forced to wonder what the fix is in the updates.img that resolves the "netinst+repo=nfsiso: in VM" case, and if the same sort of fix could be applied in this case as well. Nevertheless, I think it is fine to work on this one for F18 (especially if F16 suffered from the same issue, as implied in Comment 15).
I'm also -1, particularly if we have been dealing with this already through f16. if we can fix it via updates.img that would be ideal. I'm not seeing a lot of consistency in test results either above....
The current status, as Kamil reads it, is that the updates.img fixes the 'boot from netinst' case (i.e. repository is only being used for packages, not for stage2) but not the 'boot from kernel pair' case (repo is being used as the source of stage2). I have reproduced both failure cases once out of three tries: failure with a netinst boot *without* the updates.img, and failure with a kernel pair boot *with* the updates.img. I didn't yet verify that it never fails with netinst+updates; i'd have to do quite a few installs to be confident of that, given that I only seem to reproduce the failure case infrequently.
(In reply to comment #22) > (especially if F16 suffered from the same issue, as implied > in Comment 15). Just to clarify: I don't think F16 suffered from exactly the same problem. This seems to be a new one. But from my experience there were some other race conditions that made some automated installs to fail.
Progress update: Radek Vykydal was able to reproduce the issue. We also found out that RHEL 7 Alpha (20120515.n.1) is not affected by this bug (when using updates.img). We compared the composes and we suppose NetworkManager could be the culprit. Patch from bug 787314 comment 9 is in Fedora and is not in RHEL. I'm trying to create a custom anaconda compose that would contain NetworkManager-0.9.4.0-1.git20120328.fc17 [1], which is the last version available without this patch applied. Radek is trying to come up with a solution based on updates.img. [1] http://koji.fedoraproject.org/koji/buildinfo?buildID=310098
(In reply to comment #26) > Progress update: Radek Vykydal was able to reproduce the issue. We also > found out that RHEL 7 Alpha (20120515.n.1) is not affected by this bug (when > using updates.img). We compared the composes and we suppose NetworkManager > could be the culprit. Patch from bug 787314 comment 9 is in Fedora and is > not in RHEL. * Thu Apr 12 2012 Dan Winship <danw> - 0.9.4-4.git20120403 - Fix networked-filesystem systemd dependencies (rh #787314) Also, the hang/nohang seems to correspond with the way NM is terminated, for hang: see comment #8 for nohang: Sending SIGTERM to remaining processes......................................... NetworkManager[743]: <info> caught signal 15, shutting down normally. NetworkManager[743]: <warn> quit request received, terminating... NetworkManager[743]: <info> exiting (success) Sending SIGKILL to remaining processes... Unmounting file systems. Unmounted /mnt/sysimage/sys/fs/selinux. Unmounted /mnt/sysimage/sys. Unmounted /mnt/sysimage/proc. Unmounted /mnt/sysimage/dev/shm. Unmounted /mnt/sysimage/dev/pts. Unmounted /mnt/sysimage/dev. Unmounted /mnt/sysimage/boot. Unmounted /mnt/sysimage. Unmounted /sys/kernel/config. Unmounted /dev/hugepages. Unmounted /sys/kernel/security. Unmounted /dev/mqueue. Unmounted /sys/kernel/debug. Unmounted /media. Unmounted /run/initramfs/var/lib/nfs/rpc_pipefs. Disabling swaps. Detaching loop devices. ...
(In reply to comment #26) > We compared the composes and we suppose NetworkManager > could be the culprit. Patch from bug 787314 comment 9 is in Fedora and is > not in RHEL. This is probably not the culprit, because even when we removed the patch, reboot still hangs. My machine hanged with following messages: Sending SIGTERM to remaining processes... NetworkManager[626]: <warn> Activation (eth0) failed. NetworkManager[626]: <warn> quit request received, terminating... NetworkManager[626]: <info> (eth0): now unmanaged NetworkManager[626]: <info> (eth0): device state change: failed -> unmanaged (reason 'removed') [120 10 36] NetworkManager[626]: <info> (eth0): deactivating device (reason 'removed') [36] NetworkManager[626]: <info> (eth0): cleaning up... NetworkManager[626]: <info> (eth0): taking down device. NetworkManager[626]: <info> exiting (success) Sending SIGKILL to remaining processes... Unmounting file systems. Unmounted /mnt/sysimage/sys/fs/selinux. Unmounted /mnt/sysimage/sys. Unmounted /mnt/sysimage/proc. Unmounted /mnt/sysimage/dev/shm. Unmounted /mnt/sysimage/dev/pts. Unmounted /mnt/sysimage/dev. Unmounted /mnt/sysimage. Unmounted /dev/mqueue. Unmounted /dev/hugepages. Unmounted /sys/kernel/debug. Unmounted /sys/kernel/config. Unmounted /sys/kernel/security. Unmounted /media. Unmounted /run/initramfs/var/lib/nfs/rpc_pipefs. Please note that NetworkManager is deactivating and taking down the device when reboot hangs, but no such message appears when machine reboots successfully. That can be the difference.
Please also see rhel bug #818581. Comment #12 has similar findings.
Please stop referencing bugs that are not open for the general public to access. There's already two such references in this discussion now, at the cost of transparency. Basically, you're shutting the community off like that.
I'm leaning toward -1 blocker on this now. IMO it doesn't make much sense to hold up the release for this if it's expected that it can be fixed in an updates.img, especially since it looks like a proper fix and testing could take more than week. It's also a relatively uncommon install method, and anyone who can tame an NFSD should be able to figure out how to use updates.img (as long as we publicize that update adequately).
Discussed during the 2012-05-24 Fedora 17 final go/no-go meeting. Rejected as a blocker for Fedora 17 final because it doesn't directly violate any of the Fedora 17 release criteria (the install completes, the installed system works). Given that it should only affect a minority of users and could be fixed with an updates.img - it doesn't need to block release
Anaconda needs to stop NetworkManager service properly so let's call systemctl reboot without --force flag. Here is updates image with a patch: http://rvykydal.fedorapeople.org/updates.reboot.img
(In reply to comment #33) > http://rvykydal.fedorapeople.org/updates.reboot.img Works like a charm. We almost made it into F17, what a pity!
This is still not closed, moving to F18 blockers so that we won't forget about it.
*** Bug 837628 has been marked as a duplicate of this bug. ***
(In reply to comment #36) > *** Bug 837628 has been marked as a duplicate of this bug. *** I think this needs more discussion. If bug 837628 is indeed a duplicate of this bug, why did putting updates.reboot.img into the images folder not fix it? Please note also that bug 837628 is referring to an unpacked image mounted over nfs, not an nfs-mounted iso (see comment 24 - these may be related bugs, but they don't seem to be the same).
(In reply to comment #37) > (see comment 24 - these may be related bugs, but they don't seem to be the same). Oops, my misunderstanding of comment 24. Nevertheless the point remains, if this were the same bug, putting updates.reboot.img into the images folder should have fixed it. It doesn't.
My apologies. Bug 837628 really is a duplicate of this bug. Since the file was named updates.reboot.img, I wrongly assumed that anaconda could handle names of the form updates*.img. With the file renamed to updates.img, it works properly.
Discussed at the blocker bug review meeting of 2012-08-03: http://meetbot.fedoraproject.org/fedora-bugzappers/2012-08-03/f18-alpha-blocker-review-1.2012-08-03-17.01.log.html . Rejected as an Alpha blocker on the grounds that it does not violate the Alpha criteria. It potentially violates the Beta criteria (NFS is Beta, not Alpha), so re-proposing as a Beta blocker, but please re-test with an image that uses the new anaconda UI once one is available for testing (Fedora 18 Alpha TC1 should use the new UI), as the new UI changes may have changed this bug.
Discussed at 2012-09-26 blocker review meeting: http://meetbot.fedoraproject.org/fedora-qa/2012-09-26/f18-beta-blocker-review-1.2012-09-26-16.03.log.txt . Accepted as a blocker per criterion "The installer must be able to use the HTTP, FTP and NFS remote package source options".
It doesn't seem like anyone has actually tested this with newUI yet, let alone Beta. Could someone please test this with Beta TC1 or TC2, http://dl.fedoraproject.org/pub/alt/stage/ ? Thanks. https://bugzilla.redhat.com/show_bug.cgi?id=862996 is a failed test with TC1, but it involves PXE in addition to nfsiso, so the problem there may be PXE-related.
According to bug 853508 comment 20 nfsiso support is still broken.
Discussed at 2012-10-11 blocker review meeting: http://meetbot.fedoraproject.org/fedora-qa/2012-10-11/f18beta-blocker-review-3.1.2012-10-11-16.04.log.txt . We're rather sceptical on blocker status for this one for various reasons - we expect it's fixed, and it's probably not bad enough to block Beta even if it isn't - but Kamil will re-test and check with anaconda team whether the fix is in the F18 code or not.
The fix is in F18. (In reply to comment #43) > According to bug 853508 comment 20 nfsiso support is still broken. That issue should be tracked either in the 853508 or in a new BZ.
OK, let's close this one. nfsiso issue is tracked in bug 853508. I reported new bug 865776, because unfortunately PXE+nfs is again broken in F18.