824191 – nfsiso install hangs during reboot

Bug 824191 - nfsiso install hangs during reboot

Summary: nfsiso install hangs during reboot

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	anaconda
Sub Component:
Version:	17
Hardware:	x86_64
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Radek Vykydal
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:	https://fedoraproject.org/wiki/Common...
Duplicates (1):	837628 (view as bug list)
Depends On:	853508
Blocks:	F18Beta, F18BetaBlocker
TreeView+	depends on / blocked

Reported:	2012-05-22 23:19 UTC by Mark McClelland
Modified:	2012-10-12 12:30 UTC (History)
CC List:	15 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2012-10-12 12:30:36 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
anaconda-ks.cfg generated by "failed" install (1.48 KB, text/plain) 2012-05-22 23:19 UTC, Mark McClelland	no flags	Details
howto: direct kernel boot in virt-manager (97.78 KB, image/png) 2012-05-23 19:44 UTC, Kamil Páral	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	865776	0	unspecified	CLOSED	kernel pair boot with inst.repo=nfs: hangs on reboot	2021-02-22 00:41:40 UTC

Internal Links: 865776

Description Mark McClelland 2012-05-22 23:19:41 UTC

Created attachment 586208 [details]
anaconda-ks.cfg generated by "failed" install

Description of problem:
After installing from an nfsiso repository using the default install options and clicking the reboot button, the system never finishes rebooting.

Version-Release number of selected component (if applicable):
Fedora 17 RC3

How reproducible:
2 out of 2 tries so far

Steps to Reproduce:
1. Put DVD.iso in an NFS export on a 100BaseT-connected system
2. Boot desktop install DVD with repo=nfsiso:192.168.1.178:/home/mark/f17
3. Proceed with the default options and default graphical install package set
4. After install is complete, click reboot button

Actual results:
Reboot never completes. I can switch to vc6 and see these messages:

-------
Sending SIGTERM to remaining processes...
NetworkManager[548]: <info> (eth0): DHCPv4 client pid 598 exited with status -1
NetworkManager[548]: <warn> DHCP client died abnormally
NetworkManager[548]: <info> (eth0): device state change activated -> failed (reason 'ip-config-expired') [100 120 6]
NetworkManager[548]: <warn> Activation (eth0) failed.
NetworkManager[548]: <warn> quit request received, terminating...
NetworkManager[548]: <info> (eth0): now unmanaged
NetworkManager[548]: <info> (eth0): device state change: failed -> unmanaged (reason 'removed') [120 10 36]
NetworkManager[548]: <info> (eth0): deactivating device (reason 'removed') [36]
NetworkManager[548]: <info> (eth0): cleaning up...
NetworkManager[548]: <info> (eth0): taking down device.
NetworkManager[548]: <info> exiting (success)
Sending SIGKILL to remaining processes...
Unmounting file systems.
-------

It looks like NetworkManager might be bringing down the interface before the NFS share can be unmounted

I can boot to the installed system and everything looks OK.

Comment 1 Radek Vykydal 2012-05-23 07:46:37 UTC

I am handling the issue for rhel (bug #821502). systemd seems not to be able to umount /run/install/repo nfs mounted in dracut.
Here is updates image with patch:
http://rvykydal.fedorapeople.org/updates.umountrepo.img
Does it fix the problem for you?
(you can check it by adding 
updates=http://rvykydal.fedorapeople.org/updates.umountrepo.img
to boot parameters or using kickstart updates command)

Comment 2 Mark McClelland 2012-05-23 09:08:26 UTC

That seems to have fixed it. Though, the first time I tested the updates image I also set up an additional NFS package repo through the GUI and it failed to reboot in the exactly the same way. However, using NFS for just the ISO works now at least.

I don't know whether this should be an f17blocker, but I'll set it anyway and let someone else decide.

Comment 3 Radek Vykydal 2012-05-23 09:41:21 UTC

Well, I realized that your case may be different from comment #1 because it is nfsiso:, not nfs: as in comment #1. nfsiso: case mounts /run/install/isodir in dracut so I need to update my fix. Also I need to check the nfs repos added in UI - I thouhgt that only nfs mounts by dracut would cause hang in systemd reboot.

Comment 4 Mark McClelland 2012-05-23 10:05:34 UTC

Now I'm confused :) Your updates.img was for nfs: only? I don't know why it worked then, because I didn't test with nfs: at all.

It's conceivable that I mistyped the command line the time that it worked, and it installed from the DVD instead of nfsiso. That would mean that the failure to reboot with the NFS GUI repo could have still been caused by nfsiso.

If I have time in the next few hours I'll test again.

Comment 5 Radek Vykydal 2012-05-23 10:51:52 UTC

I'm still testing and will send updated updates.img soon. As for the hang, I can't reproduce it always (there might be some race with NetworkManager/network going down), so it is possible that you were just lucky.

Comment 6 Radek Vykydal 2012-05-23 11:39:58 UTC

Here is new updates.img that should fix the issue:
http://rvykydal.fedorapeople.org/updates.umountrepo2.img

Comment 7 Kamil Páral 2012-05-23 11:54:44 UTC

I can confirm this problem with F17 RC4. Unfortunately it is non-deterministic, probably some race condition. I had highest occurrences of reboot hangs when doing netinst.iso + repo=nfsiso: install in a VM (on bare metal it the occurrences were more rare).

I'll test again with update from comment 6.

Comment 8 Kamil Páral 2012-05-23 12:58:28 UTC

Fix from comment 6 works for "netinst+repo=nfsiso: in VM" case - previously 4 out of 4 attempts broken (hanged at reboot), now 4 out of 4 attempts successful.

But it still seems to be broken for "direct kernel boot + repo=nfsiso: in VM". 3 out of 3 attempts broken. I have connected serial console and this is what it prints out in the end:

Sending SIGTERM to remaining processes...                                      
NetworkManager[641]: <info> (eth0): DHCPv4 client pid 670 exited with status -1
NetworkManager[641]: <warn> DHCP client died abnormally
NetworkManager[641]: <info> (eth0): device state change: activated -> failed (reason 'ip-config-expired') [100 120 6]
NetworkManager[641]: <warn> Activation (eth0) failed.
NetworkManager[641]: <warn> quit request received, terminating...
NetworkManager[641]: <info> (eth0): now unmanaged
NetworkManager[641]: <info> (eth0): device state change: failed -> unmanaged (reason 'removed') [120 10 36]
NetworkManager[641]: <info> (eth0): deactivating device (reason 'removed') [36]
NetworkManager[641]: <info> (eth0): cleaning up...
NetworkManager[641]: <info> (eth0): taking down device.
NetworkManager[641]: <info> exiting (success)
Sending SIGKILL to remaining processes...
Unmounting file systems.
Unmounted /mnt/sysimage/sys/fs/selinux.
Unmounted /mnt/sysimage/sys.
Unmounted /mnt/sysimage/proc.
Unmounted /mnt/sysimage/dev/shm.
Unmounted /mnt/sysimage/dev/pts.
Unmounted /mnt/sysimage/dev.
Unmounted /mnt/sysimage/boot.
Unmounted /mnt/sysimage.
Unmounted /dev/hugepages.
Unmounted /sys/kernel/security.
Unmounted /dev/mqueue.
Unmounted /sys/kernel/config.
Unmounted /sys/kernel/debug.
Unmounted /media.
Unmounted /run/initramfs/var/lib/nfs/rpc_pipefs.
<computer hangs here>

Comment 9 Mark McClelland 2012-05-23 13:06:04 UTC

The update from comment 6 fixed nfsiso: when booting from DVD, and fixed adding a new NFS repo from the GUI. I'm not set up to test any other install methods so that's all I can do for now.

Comment 10 Martin Krizek 2012-05-23 13:29:30 UTC

Confirming comment 8. While it works with the fix from comment 6, it doesn't seem to work for direct kernel boot.

Comment 11 Radek Vykydal 2012-05-23 15:28:55 UTC

(In reply to comment #8)

> But it still seems to be broken for "direct kernel boot + repo=nfsiso: in
> VM". 3 out of 3 attempts broken. 

1/1 worked for me, trying again

Comment 12 Radek Vykydal 2012-05-23 15:53:36 UTC

(In reply to comment #11)

> 1/1 worked for me, trying again
2/2 working

Comment 13 Adam Williamson 2012-05-23 16:53:59 UTC

Well...the criteria don't specify that reboot at the end of install needs to work, there is an obvious 'workaround' (reboot manually), and this doesn't seem to lead to any actual breakage of any kind. I think I'm -1 blocker.

Comment 14 Kamil Páral 2012-05-23 17:37:12 UTC

Please don't forget about automatic deployment of virtual machines. When doing a manually install, yes, there is no problem to power off the machine the hard way. When doing it in an automated way, however, this might cause a lot of problems. I believe this should be a blocker.

Comment 16 Kamil Páral 2012-05-23 18:00:23 UTC

I can confirm this also hits kickstarted installs. I reproduced using virt-manager, where I booted from vmlinuz+initrd and provided these arguments:
repo=nfsiso:192.168.1.1:/mnt/data/iso/Fedora-17-i386-DVD.iso updates=http://rvykydal.fedorapeople.org/updates.umountrepo2.img ks=http://192.168.1.1:8000/minimal.ks

Broken criteria:
The installer must be able to successfully complete a scripted installation, using the installer's preferred scripting system, which duplicates the default interactive installation as closely as possible
-and-
Any installation method or process designed to run unattended must do so (there should be no prompts requiring user intervention)

Comment 17 Adam Williamson 2012-05-23 18:16:32 UTC

An updates.img that fixes the issue would be a sufficient fix for such large-scale automated cases, wouldn't it?

Comment 18 Kamil Páral 2012-05-23 19:44:40 UTC

Created attachment 586440 [details]
howto: direct kernel boot in virt-manager

This is how you set a direct kernel boot in virt-manager. Use to reproduce this issue.

Comment 19 Sandro Mathys 2012-05-23 20:42:19 UTC

-1 blocker if nfs works and only just nfsiso is broken, as "unpacking"/mounting the iso-file on the server seems to be a valid and fair workaround.
-1 blocker if people are really certain that there's an updates.img available and tested to fix the issue
+1 blocker in all other cases, under the criteria kparal mentioned in comment #16

Comment 20 Kamil Páral 2012-05-23 20:43:10 UTC

When using a direct kernel boot, repo=nfs: (mounting an exploded install tree) is also affected by this bug.

Comment 21 Adam Williamson 2012-05-23 21:15:10 UTC

FWIW, I reproduced this one try out of three in the 'install from netinst with no updates.img' case, and one try out of three in the 'install from direct kernel boot with updates.img' case. Both using VMs.

Comment 22 Tom "spot" Callaway 2012-05-24 00:48:06 UTC

I think I'm -1 blocker here, because, while this bug is certainly annoying, the install is complete, and when reboot is forced, the system does come up properly.

Of course, I am forced to wonder what the fix is in the updates.img that resolves 
the "netinst+repo=nfsiso: in VM" case, and if the same sort of fix could be applied in this case as well. Nevertheless, I think it is fine to work on this one for F18 (especially if F16 suffered from the same issue, as implied in Comment 15).

Comment 23 Robyn Bergeron 2012-05-24 01:22:10 UTC

I'm also -1, particularly if we have been dealing with this already through f16. if we can fix it via updates.img that would be ideal. I'm not seeing a lot of consistency in test results either above....

Comment 24 Adam Williamson 2012-05-24 01:50:40 UTC

The current status, as Kamil reads it, is that the updates.img fixes the 'boot from netinst' case (i.e. repository is only being used for packages, not for stage2) but not the 'boot from kernel pair' case (repo is being used as the source of stage2).

I have reproduced both failure cases once out of three tries: failure with a netinst boot *without* the updates.img, and failure with a kernel pair boot *with* the updates.img. I didn't yet verify that it never fails with netinst+updates; i'd have to do quite a few installs to be confident of that, given that I only seem to reproduce the failure case infrequently.

Comment 25 Kamil Páral 2012-05-24 07:26:08 UTC

(In reply to comment #22)
> (especially if F16 suffered from the same issue, as implied
> in Comment 15).

Just to clarify: I don't think F16 suffered from exactly the same problem. This seems to be a new one. But from my experience there were some other race conditions that made some automated installs to fail.

Comment 26 Kamil Páral 2012-05-24 11:25:26 UTC

Progress update: Radek Vykydal was able to reproduce the issue. We also found out that RHEL 7 Alpha (20120515.n.1) is not affected by this bug (when using updates.img). We compared the composes and we suppose NetworkManager could be the culprit. Patch from bug 787314 comment 9 is in Fedora and is not in RHEL. I'm trying to create a custom anaconda compose that would contain NetworkManager-0.9.4.0-1.git20120328.fc17 [1], which is the last version available without this patch applied. Radek is trying to come up with a solution based on updates.img.

[1] http://koji.fedoraproject.org/koji/buildinfo?buildID=310098

Comment 27 Radek Vykydal 2012-05-24 11:35:41 UTC

(In reply to comment #26)
> Progress update: Radek Vykydal was able to reproduce the issue. We also
> found out that RHEL 7 Alpha (20120515.n.1) is not affected by this bug (when
> using updates.img). We compared the composes and we suppose NetworkManager
> could be the culprit. Patch from bug 787314 comment 9 is in Fedora and is
> not in RHEL.

* Thu Apr 12 2012 Dan Winship <danw> - 0.9.4-4.git20120403 - Fix networked-filesystem systemd dependencies (rh #787314) 

Also, the hang/nohang seems to correspond with the way NM is terminated,
for hang: see comment #8
for nohang:

Sending SIGTERM to remaining processes.........................................
NetworkManager[743]: <info> caught signal 15, shutting down normally.
NetworkManager[743]: <warn> quit request received, terminating...
NetworkManager[743]: <info> exiting (success)
Sending SIGKILL to remaining processes...
Unmounting file systems.
Unmounted /mnt/sysimage/sys/fs/selinux.
Unmounted /mnt/sysimage/sys.
Unmounted /mnt/sysimage/proc.
Unmounted /mnt/sysimage/dev/shm.
Unmounted /mnt/sysimage/dev/pts.
Unmounted /mnt/sysimage/dev.
Unmounted /mnt/sysimage/boot.
Unmounted /mnt/sysimage.
Unmounted /sys/kernel/config.
Unmounted /dev/hugepages.
Unmounted /sys/kernel/security.
Unmounted /dev/mqueue.
Unmounted /sys/kernel/debug.
Unmounted /media.
Unmounted /run/initramfs/var/lib/nfs/rpc_pipefs.
Disabling swaps.
Detaching loop devices.
...

Comment 28 Kamil Páral 2012-05-24 12:48:43 UTC

(In reply to comment #26)
> We compared the composes and we suppose NetworkManager
> could be the culprit. Patch from bug 787314 comment 9 is in Fedora and is
> not in RHEL.

This is probably not the culprit, because even when we removed the patch, reboot still hangs. My machine hanged with following messages:

Sending SIGTERM to remaining processes...
NetworkManager[626]: <warn> Activation (eth0) failed.
NetworkManager[626]: <warn> quit request received, terminating...
NetworkManager[626]: <info> (eth0): now unmanaged
NetworkManager[626]: <info> (eth0): device state change: failed -> unmanaged (reason 'removed') [120 10 36]
NetworkManager[626]: <info> (eth0): deactivating device (reason 'removed') [36]
NetworkManager[626]: <info> (eth0): cleaning up...
NetworkManager[626]: <info> (eth0): taking down device.
NetworkManager[626]: <info> exiting (success)
Sending SIGKILL to remaining processes...
Unmounting file systems.
Unmounted /mnt/sysimage/sys/fs/selinux.
Unmounted /mnt/sysimage/sys.
Unmounted /mnt/sysimage/proc.
Unmounted /mnt/sysimage/dev/shm.
Unmounted /mnt/sysimage/dev/pts.
Unmounted /mnt/sysimage/dev.
Unmounted /mnt/sysimage.
Unmounted /dev/mqueue.
Unmounted /dev/hugepages.
Unmounted /sys/kernel/debug.
Unmounted /sys/kernel/config.
Unmounted /sys/kernel/security.
Unmounted /media.
Unmounted /run/initramfs/var/lib/nfs/rpc_pipefs.

Please note that NetworkManager is deactivating and taking down the device when reboot hangs, but no such message appears when machine reboots successfully. That can be the difference.

Comment 29 Radek Vykydal 2012-05-24 12:59:18 UTC

Please also see rhel bug #818581. Comment #12 has similar findings.

Comment 30 Sandro Mathys 2012-05-24 14:19:10 UTC

Please stop referencing bugs that are not open for the general public to access. There's already two such references in this discussion now, at the cost of transparency. Basically, you're shutting the community off like that.

Comment 31 Mark McClelland 2012-05-24 14:55:03 UTC

I'm leaning toward -1 blocker on this now. IMO it doesn't make much sense to hold up the release for this if it's expected that it can be fixed in an updates.img, especially since it looks like a proper fix and testing could take more than week. It's also a relatively uncommon install method, and anyone who can tame an NFSD should be able to figure out how to use updates.img (as long as we publicize that update adequately).

Comment 32 Tim Flink 2012-05-24 18:17:24 UTC

Discussed during the 2012-05-24 Fedora 17 final go/no-go meeting. Rejected as a blocker for Fedora 17 final because it doesn't directly violate any of the Fedora 17 release criteria (the install completes, the installed system works). Given that it should only affect a minority of users and could be fixed with an updates.img - it doesn't need to block release

Comment 33 Radek Vykydal 2012-05-26 22:08:35 UTC

Anaconda needs to stop NetworkManager service properly so let's call systemctl reboot without --force flag.
Here is updates image with a patch:
http://rvykydal.fedorapeople.org/updates.reboot.img

Comment 34 Kamil Páral 2012-05-27 15:12:35 UTC

(In reply to comment #33)
> http://rvykydal.fedorapeople.org/updates.reboot.img

Works like a charm. We almost made it into F17, what a pity!

Comment 35 Kamil Páral 2012-06-14 09:08:21 UTC

This is still not closed, moving to F18 blockers so that we won't forget about it.

Comment 36 Jesse Keating 2012-07-05 17:21:23 UTC

*** Bug 837628 has been marked as a duplicate of this bug. ***

Comment 37 bob mckay 2012-07-06 01:26:40 UTC

(In reply to comment #36)
> *** Bug 837628 has been marked as a duplicate of this bug. ***

I think this needs more discussion. If bug 837628 is indeed a duplicate of this bug, why did putting updates.reboot.img into the images folder not fix it? Please note also that bug 837628 is referring to an unpacked image mounted over nfs, not an nfs-mounted iso (see comment 24 - these may be related bugs, but they don't seem to be the same).

Comment 38 bob mckay 2012-07-06 01:40:11 UTC

(In reply to comment #37)
> (see comment 24 - these may be related bugs, but they don't seem to be the same).
Oops, my misunderstanding of comment 24. Nevertheless the point remains, if this were the same bug, putting updates.reboot.img into the images folder should have fixed it. It doesn't.

Comment 39 bob mckay 2012-07-06 09:58:34 UTC

My apologies. Bug 837628 really is a duplicate of this bug. Since the file was named updates.reboot.img, I wrongly assumed that anaconda could handle names of the form updates*.img. With the file renamed to updates.img, it works properly.

Comment 40 Adam Williamson 2012-08-03 23:42:57 UTC

Discussed at the blocker bug review meeting of 2012-08-03: http://meetbot.fedoraproject.org/fedora-bugzappers/2012-08-03/f18-alpha-blocker-review-1.2012-08-03-17.01.log.html .

Rejected as an Alpha blocker on the grounds that it does not violate the Alpha criteria. It potentially violates the Beta criteria (NFS is Beta, not Alpha), so re-proposing as a Beta blocker, but please re-test with an image that uses the new anaconda UI once one is available for testing (Fedora 18 Alpha TC1 should use the new UI), as the new UI changes may have changed this bug.

Comment 41 Adam Williamson 2012-09-26 16:33:15 UTC

Discussed at 2012-09-26 blocker review meeting: http://meetbot.fedoraproject.org/fedora-qa/2012-09-26/f18-beta-blocker-review-1.2012-09-26-16.03.log.txt . Accepted as a blocker per criterion "The installer must be able to use the HTTP, FTP and NFS remote package source options".

Comment 42 Adam Williamson 2012-10-04 18:39:10 UTC

It doesn't seem like anyone has actually tested this with newUI yet, let alone Beta. Could someone please test this with Beta TC1 or TC2, http://dl.fedoraproject.org/pub/alt/stage/ ? Thanks. https://bugzilla.redhat.com/show_bug.cgi?id=862996 is a failed test with TC1, but it involves PXE in addition to nfsiso, so the problem there may be PXE-related.

Comment 43 Kamil Páral 2012-10-08 13:35:03 UTC

According to bug 853508 comment 20 nfsiso support is still broken.

Comment 44 Adam Williamson 2012-10-11 18:19:23 UTC

Discussed at 2012-10-11 blocker review meeting: http://meetbot.fedoraproject.org/fedora-qa/2012-10-11/f18beta-blocker-review-3.1.2012-10-11-16.04.log.txt . We're rather sceptical on blocker status for this one for various reasons - we expect it's fixed, and it's probably not bad enough to block Beta even if it isn't - but Kamil will re-test and check with anaconda team whether the fix is in the F18 code or not.

Comment 45 Radek Vykydal 2012-10-12 10:21:44 UTC

The fix is in F18.

(In reply to comment #43)

> According to bug 853508 comment 20 nfsiso support is still broken.
That issue should be tracked either in the 853508 or in a new BZ.

Comment 46 Kamil Páral 2012-10-12 12:30:36 UTC

OK, let's close this one. nfsiso issue is tracked in bug 853508. I reported new bug 865776, because unfortunately PXE+nfs is again broken in F18.

Note You need to log in before you can comment on or make changes to this bug.