Bug 876218

Summary: pxeboot/netinst + nfsiso repo = hang on reboot
Product: [Fedora] Fedora Reporter: Kamil Páral <kparal>
Component: NetworkManagerAssignee: Dan Williams <dcbw>
Status: CLOSED CURRENTRELEASE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 18CC: awilliam, bcl, danw, dcbw, g.kaviyarasu, harald, johannbg, jonathan, jpopelka, jreznik, lnykryn, metherid, msekleta, notting, pf.rhlists, plautrba, psimerda, robatino, sbueno, schorschi, steved, systemd-maint, vanmeeuwen+fedora, vpavlin
Target Milestone: ---Keywords: Reopened
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: AcceptedBlocker
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-01-05 07:16:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 752661    
Attachments:
Description Flags
anaconda.log
none
anaconda.packaging.log
none
anaconda.program.log
none
anaconda.storage.log
none
syslog
none
virsh dump / crash output from frozen nfsiso install
none
output from mount during inst.repo=nfs install
none
systemd debug messages with sucessful reboot
none
screenshot of strace showing that dhclient is killed by systemd using SIGTERM none

Description Kamil Páral 2012-11-13 15:21:46 UTC
Description of problem:
I followed this test case:
https://fedoraproject.org/wiki/QA:Testcase_install_repository_NFSISO_variation
using direct kernel boot (equivalent to pxeboot), instructions here:
https://fedoraproject.org/wiki/QA:Testcase_Boot_Methods_Pxeboot

After the installation is complete, the machine hangs and doesn't reboot. Serial console output says:

[terminated]
[  OK  ] Started Show Plymouth Reboot Screen.
[  OK  ] Stopped Anaconda.
[  OK  ] Stopped target Anaconda System Services.
         Stopping System Logging Service...
         Stopping Network Manager...
         Stopping Shell on tty2...
[  OK  ] Stopped Shell on tty2.
[  OK  ] Stopped System Logging Service.
[  OK  ] Stopped Network Manager.
[  OK  ] Stopped target Basic System.
[  OK  ] Stopped target Sockets.
[  OK  ] Closed Syslog Socket.
[  OK  ] Closed D-Bus System Message Bus Socket.
[  OK  ] Stopped target System Initialization.
         Stopping Load Kernel Modules...
[  OK  ] Stopped Load Kernel Modules.
[  OK  ] Stopped target Encrypted Volumes.
         Stopping Apply Kernel Variables...
[  OK  ] Stopped Apply Kernel Variables.
[  OK  ] Stopped target Swap.
[  OK  ] Stopped target Local File Systems.
         Stopping Configure read-only root support...
[  OK  ] Stopped Configure read-only root support.
         Unmounting Temporary Directory...
         Stopping Setup Virtual Console...
[  OK  ] Stopped Setup Virtual Console.
[  OK  ] Unmounted Temporary Directory.
[  OK  ] Reached target Unmount All Filesystems.
[  OK  ] Stopped target Local File Systems (Pre).
         Stopping Remount Root and Kernel File Systems...
[  OK  ] Stopped Remount Root and Kernel File Systems.
         Starting Save Random Seed...
[  OK  ] Reached target Shutdown.
Sending SIGTERM to remaining processes...
Sending SIGKILL to remaining processes...
Unmounting file systems.
Unmounted /sys/fs/fuse/connections.
Unmounted /mnt/sysimage/sys/fs/selinux.
Unmounted /mnt/sysimage/sys.
Unmounted /mnt/sysimage/proc.
Unmounted /mnt/sysimage/dev/shm.
Unmounted /mnt/sysimage/dev/pts.
Unmounted /mnt/sysimage/dev.
Unmounted /mnt/sysimage/boot.
Unmounted /mnt/sysimage.
Unmounted /sys/kernel/config.
Unmounted /dev/hugepages.
Unmounted /sys/kernel/debug.
Unmounted /dev/mqueue.
Unmounted /var/lib/nfs/rpc_pipefs.
Could not unmount /run/install/repo: Device or resource busy
Could not unmount /run/install/isodir: Device or resource busy


Version-Release number of selected component (if applicable):
F18 Beta TC8
anaconda 18.28

How reproducible:
always

Steps to Reproduce:
1. use the linked test cases to pxeboot or directly boot kernel+initrd with inst.repo=nfs:server:/path/dvd.iso
2. install
3. reboot

Comment 1 Kamil Páral 2012-11-13 15:23:11 UTC
Created attachment 644170 [details]
anaconda.log

Comment 2 Kamil Páral 2012-11-13 15:23:15 UTC
Created attachment 644171 [details]
anaconda.packaging.log

Comment 3 Kamil Páral 2012-11-13 15:23:18 UTC
Created attachment 644172 [details]
anaconda.program.log

Comment 4 Kamil Páral 2012-11-13 15:23:21 UTC
Created attachment 644173 [details]
anaconda.storage.log

Comment 5 Kamil Páral 2012-11-13 15:23:25 UTC
Created attachment 644174 [details]
syslog

Comment 6 Kamil Páral 2012-11-13 15:25:32 UTC
Proposing as F18 Final blocker, because it violates:
" The installer must be able to use all supported local and remote package source options "
https://fedoraproject.org/wiki/Fedora_18_Final_Release_Criteria

Comment 7 Paul Franklin (RHlists) 2012-11-14 00:33:31 UTC
+1 Final blocker.

I have seen this every time;
I always have to power-cycle my machine.

Comment 8 Jesse Keating 2012-11-20 19:50:31 UTC
I'm not seeing this in smoke22.  I think it briefly is unable to unmount something but that doesn't seem to stop the reboot process.

Please re-test with smoke22 or later, or the beta RCs when we make them available and re-open if it still happens.

I just pushed a number of nfsiso and hdiso fixes that probably help here.

Comment 9 Kamil Páral 2012-11-21 19:57:44 UTC
If you don't mind, I'll set it to ON_QA and needinfo myself, so that we don't lose track of it.

Comment 10 Kamil Páral 2012-11-22 17:57:08 UTC
I tested with F18 Beta RC1 (anaconda 18.29.2). I can reproduce the problem 100%. Either with NFS (exploded ISO) or NFSISO, both hang on reboot.

I haven't checked whether the logs are the same, but the problem persists.

Comment 11 Jóhann B. Guðmundsson 2012-11-22 18:21:40 UTC
Proposing as Beta blocker

 The installer must be able to use the HTTP, FTP and either NFS or NFSISO remote package source options
It must be possible to install by booting the installation kernel directly, including via PXE, and correctly specifying a remote source for the installer itself, using whichever protocols are required to work for package retrieval at the current phase (Alpha, Beta, Final). This must work if the remote source is not a complete repository but contains only the files necessary for the installer itself to run.

Comment 12 Jóhann B. Guðmundsson 2012-11-22 18:26:12 UTC
The installer must be able to use the HTTP, FTP and either NFS or NFSISO remote package source options

Comment 13 Adam Williamson 2012-11-22 18:39:15 UTC
Isn't this just https://bugzilla.redhat.com/show_bug.cgi?id=824191 ? We closed that on the basis of https://bugzilla.redhat.com/show_bug.cgi?id=824191#c45 , but looks like it wasn't really fixed.

This was rejected as a blocker for F17 Final, so it seems inconsistent to take it as a blocker for F18 Beta.

See http://meetbot.fedoraproject.org/fedora-meeting-1/2012-05-24/fedora_17_final_go_nogo_meeting_round_2.2012-05-24-17.01.log.html , starting at 17:09:32 .

Comment 14 Adam Williamson 2012-11-22 20:50:59 UTC
Discussed at 2012-11-22 go/no-go meeting (http://meetbot.fedoraproject.org/fedora-meeting-1/2012-11-22/f18_beta_gono-go_meeting.2012-11-22-20.01.log.txt ) acting as a blocker review meeting. This is rejected as a blocker, following the F17 precedent, and the fact it doesn't really preclude install, just makes it inconvenient.

Comment 15 Kamil Páral 2012-11-22 21:03:03 UTC
Moving to final blocker instead.

Comment 16 Kamil Páral 2012-11-28 17:10:47 UTC
+1 Final blocker from me. The ability to restart the machine should be considered part of the installation. This breaks every automated installation (using nfsiso).

Comment 17 Adam Williamson 2012-11-29 18:23:32 UTC
Discussed at 2012-11-29 blocker review meeting: http://meetbot.fedoraproject.org/fedora-qa/2012-11-29/f18final-blocker-review-1.1.2012-11-29-17.01.log.txt . Accepted as a blocker per Beta criterion "Any installation method or process designed to run unattended must do so (there should be no prompts requiring user intervention)" in the case of a kickstarted NFSISO install (it fails the test as it will stick at the reboot and require manual intervention).

We are aware we rejected a very similar bug as blocker for F17 final. Our excuse is that we were fudging it and we shouldn't have done, and this is the correct decision.

Comment 18 Brian Lane 2012-12-06 23:10:07 UTC
With smoke4 I am not seeing exactly the same output. It stops with:

[  OK  ] Reached target Shutdown.
Sending SIGTERM to remaining processes...
Sending SIGKILL to remaining processes...
Unmounting file systems.
Unmounted /mnt/sysimage/sys/fs/selinux.
Unmounted /mnt/sysimage/sys.
Unmounted /mnt/sysimage/proc.
Unmounted /mnt/sysimage/dev/shm.
Unmounted /mnt/sysimage/dev/pts.
Unmounted /mnt/sysimage/dev.
Unmounted /mnt/sysimage/boot.
Unmounted /mnt/sysimage.
Unmounted /sys/kernel/config.
Unmounted /dev/mqueue.
Unmounted /dev/hugepages.
Unmounted /sys/kernel/debug.

I did a virsh dump on it while frozen and will attach the dmesg output. I suspect that maybe something is taking down the network before unmounting the nfs mount, but it is hard to tell. rd.debug doesn't seem to increase the logging level on shutdown.

There are only 2 processes running, plymouth and systemd-shutdown so I'm passing this over to systemd. I have the dump file if you want me to grab any additional info from it.

Comment 19 Brian Lane 2012-12-06 23:10:58 UTC
Created attachment 659087 [details]
virsh dump / crash output from frozen nfsiso install

Comment 20 Jóhann B. Guðmundsson 2012-12-17 12:20:03 UTC
Follow 

http://www.freedesktop.org/wiki/Software/systemd/Debugging#Diagnosing_Shutdown_Problems

And allow the shutdown to complete ( timeout is insanely high like 3 or 5 minutes )

Comment 21 Steve Dickson 2012-12-19 15:47:38 UTC
(In reply to comment #19)
> Created attachment 659087 [details]
> virsh dump / crash output from frozen nfsiso install

Quick question... These two logs make me wonder if the network is
not be torn down  before things are allow to umount. 

[  436.704249] nfs: server 192.168.101.5 not responding, timed out
[  442.352170] nfs: server 192.168.101.5 not responding, timed out

Is there a way to verify the network is actually up at this point?

Comment 22 Steve Dickson 2012-12-19 18:30:34 UTC
Unfortunately its not clear to me as to how test set up a install used a nfs:<server>:<path>/dvd.iso using virt-manager. 

So where and when do I tell virt-manager to used the nfs:<server>:<path>/dvd.iso ISO?

Comment 23 Brian Lane 2012-12-19 18:56:07 UTC
Just pass inst.repo=... on the kernel cmdline, I do my tests with an ISO so you use the syslinux menu to append it. Or you can pass --extra-args="inst.repo=..." to virt-install if you are using the direct kernel+initrd boot method.

Comment 24 Steve Dickson 2012-12-19 20:13:31 UTC
Could please show me a valid virt-install command line using --extra-args="inst.repo=..." ?

Comment 25 Lennart Poettering 2012-12-19 20:40:44 UTC
How is the network configured in anaconda? If that service is orderered before remote-fs-pre.target, then the NFS mounts are not attempted before the service is up, and will be removed before the service goes done.

Comment 26 Lennart Poettering 2012-12-19 20:42:22 UTC
And could somebody please include the output of "mount" when the NFS stuff is mounted, during runtime?

Comment 27 Brian Lane 2012-12-19 21:44:38 UTC
Created attachment 666393 [details]
output from mount during inst.repo=nfs install

anaconda-dracut mounts it on /run/install/isodir, and then mounts the iso on /run/install/repo. We then use the live image from /run/install/repo/LiveOS/squashfs.img as the new /.

Comment 28 Kamil Páral 2012-12-20 10:17:17 UTC
(In reply to comment #24)
> Could please show me a valid virt-install command line using
> --extra-args="inst.repo=..." ?

Steve, please see
https://fedoraproject.org/wiki/QA:Testcase_Boot_Methods_Pxeboot
-> Setup -> Option 2 to see how to make virt-manager boot using kernel+initrd.

Of course you need to use a different inst.repo than in the picture, your inst.repo is from
https://fedoraproject.org/wiki/QA:Testcase_install_repository_NFSISO_variation

Comment 29 Steve Dickson 2012-12-20 16:39:09 UTC
Ok... I finally figured out all the assumptions those document make about  my understanding of Anaconda and virt-manager... Point being they are a bit hard to comprehend, IMHO... anyway... 


So I booted a VM, from virt-manager, with the following values
kernel path: /mnt/iso/isolinux/vmlinuz
initrd path: /mnt/iso/isolinux/initrd.img
Kernel arguments:  inst.repo=nfs:filerb:/vol/vol0/home/steved/iso/Fedora-18-Beta-x86_64-DVD.iso

The installed went faultlessly... What needs to happen for me to reproduce this hang?

Comment 30 Brian Lane 2012-12-20 22:00:10 UTC
Lennart, I guess I didn't answer the networking question. Anaconda uses some dracut scripts to set things up before the switch root, so networking is actually being brought up by dracut. We write to /tmp/net.ifaces when we need the network, and then we hook into initqueue/online to handle the actual nfs mounting when the network appears.

I'm not sure exactly how dracut is using systemd now, so I don't know what the exact ordering is.

Comment 31 Kamil Páral 2012-12-21 13:08:05 UTC
I tested with smoke10:
http://dl.fedoraproject.org/pub/alt/qa/

I used "inst.repo=nfs:... console=ttyS0 graphical" arguments. The serial output is this:

> [terminated]
> [  OK  ] Started Show Plymouth Reboot Screen.
> [  OK  ] Stopped Anaconda.
> [  OK  ] Stopped target Anaconda System Services.
>          Stopping Network Manager...
>          Stopping Shell on tty2...
>          Stopping System Logging Service...
> [  OK  ] Stopped Shell on tty2.
> [  OK  ] Stopped System Logging Service.
> [  OK  ] Stopped Network Manager.
> [  OK  ] Stopped target Basic System.
> [  OK  ] Stopped target Sockets.
> [  OK  ] Closed Syslog Socket.
> [  OK  ] Closed D-Bus System Message Bus Socket.
> [  OK  ] Stopped target System Initialization.
>          Stopping Setup Virtual Console...
> [  OK  ] Stopped Setup Virtual Console.
>          Stopping Load Kernel Modules...
> [  OK  ] Stopped Load Kernel Modules.
> [  OK  ] Stopped target Encrypted Volumes.
>          Stopping Apply Kernel Variables...
> [  OK  ] Stopped Apply Kernel Variables.
> [  OK  ] Stopped target Swap.
> [  OK  ] Stopped target Local File Systems.
>          Unmounting Temporary Directory...
>          Stopping Configure read-only root support...
> [  OK  ] Stopped Configure read-only root support.     <-- hangs a minute in here
> [  OK  ] Unmounted Temporary Directory.
> [  OK  ] Reached target Unmount All Filesystems.
> [  OK  ] Stopped target Local File Systems (Pre).
>          Stopping Remount Root and Kernel File Systems...
> [  OK  ] Stopped Remount Root and Kernel File Systems.
>          Starting Update UTMP about System Shutdown...
> [  OK  ] Reached target Shutdown.                      <-- hangs forever here (> 10 minutes)

Quite interestingly the VM screen has redrawn into a new boot screen ("Press F12 to boot menu" and similar messages printed out), but it is stuck and doesn't react to anything. I assume it's just a VM glitch. The reboot process is hanging.

Sending Ctrl+Alt+Del does not help.

If I issue "reboot -f" instead of using Anaconda's Reboot button, the system reboots OK. So it shouldn't be a kernel bug.

Then I tried to enable system debug messages by adding these boot arguments: "systemd.log_level=debug systemd.log_target=console" and - HOLY SPHERICAL COW - IT REBOOTS!

My full kernel boot line is this:
inst.repo=nfs:192.168.1.1:/mnt/data/iso/TEST/Fedora-18-smoke10-x86_64-DVD.iso console=ttyS0 graphical systemd.log_level=debug systemd.log_target=console
kernel+initrd used to boot were of course extracted from the same DVD.iso

I don't know why, but without systemd debug messages it hangs on reboot in 100% cases, but if I add them, it reboots just fine every time (4 out of 4 attempts). Can it be a race condition? With debug messages added the systemd can be slower or something.

I can't say this approach is a good workaround for the problem, but it seems to work on my computer.

Comment 32 Kamil Páral 2012-12-21 13:09:42 UTC
Created attachment 667226 [details]
systemd debug messages with sucessful reboot

These are the systemd debug messages that appears on the serial console between the time I hit the Reboot button and the system really reboots. No hang in this case (debug messages enabled).

Comment 33 Kamil Páral 2012-12-21 13:22:29 UTC
(In reply to comment #29)
> The installed went faultlessly... What needs to happen for me to reproduce
> this hang?

Steve, maybe you were confused by the fact that the VM sometimes looks like it rebooted, but in fact it has not (the boot process doesn't continue)?

Comment 34 Steve Dickson 2012-12-21 14:07:25 UTC
(In reply to comment #33)
> (In reply to comment #29)
> > The installed went faultlessly... What needs to happen for me to reproduce
> > this hang?
> 
> Steve, maybe you were confused by the fact that the VM sometimes looks like
> it rebooted, but in fact it has not (the boot process doesn't continue)?
Well I was watching the traffic go over the wire... the traffic stop and then started back up again... so I'm pretty sure the vm rebooted... I actually logged in afterwards...

Comment 35 Steve Dickson 2012-12-21 14:32:09 UTC
(In reply to comment #3)
> Created attachment 644172 [details]
> anaconda.program.log

This is a bit odd... 

15:06:30,952 INFO program: Running... /bin/mount -n -t nfs -o defaults,nolock 192.168.1.1:/mnt/data/iso/dvd.iso /mnt/install/source
15:06:31,044 ERR program: mount.nfs: mount point /run/install/source is not a directory

What is the expectation here?

Comment 36 Brian Lane 2012-12-21 17:58:52 UTC
Steve, that's going to be a different issue. Lets not clutter this bug up with it. But I think what's happening there is anaconda-dracut isn't mounting the dvd.iso because it doesn't remove the iso name before trying to mount the path. It will still use it as the source, but it won't mount stage2 from nfs.

Comment 37 Steve Dickson 2012-12-21 21:40:50 UTC
(In reply to comment #36)
> Steve, that's going to be a different issue. Lets not clutter this bug up
> with it. 
Fair enough... I was just looking for any type of failures that had N-F-S in them ;-)

Comment 38 Steve Dickson 2012-12-21 21:53:02 UTC
(In reply to comment #31)
> I tested with smoke10:
> http://dl.fedoraproject.org/pub/alt/qa/
> 
> I used "inst.repo=nfs:... console=ttyS0 graphical" arguments. The serial
> output is this:
> 
> > [terminated]
> > [  OK  ] Started Show Plymouth Reboot Screen.
> > [  OK  ] Stopped Anaconda.
> > [  OK  ] Stopped target Anaconda System Services.
> >          Stopping Network Manager...
> >          Stopping Shell on tty2...
> >          Stopping System Logging Service...
> > [  OK  ] Stopped Shell on tty2.
> > [  OK  ] Stopped System Logging Service.
> > [  OK  ] Stopped Network Manager.
> > [  OK  ] Stopped target Basic System.
> > [  OK  ] Stopped target Sockets.
> > [  OK  ] Closed Syslog Socket.
> > [  OK  ] Closed D-Bus System Message Bus Socket.
> > [  OK  ] Stopped target System Initialization.
> >          Stopping Setup Virtual Console...
> > [  OK  ] Stopped Setup Virtual Console.
> >          Stopping Load Kernel Modules...
> > [  OK  ] Stopped Load Kernel Modules.
> > [  OK  ] Stopped target Encrypted Volumes.
> >          Stopping Apply Kernel Variables...
> > [  OK  ] Stopped Apply Kernel Variables.
> > [  OK  ] Stopped target Swap.
> > [  OK  ] Stopped target Local File Systems.
> >          Unmounting Temporary Directory...
> >          Stopping Configure read-only root support...
> > [  OK  ] Stopped Configure read-only root support.     <-- hangs a minute in here
> > [  OK  ] Unmounted Temporary Directory.
> > [  OK  ] Reached target Unmount All Filesystems.
> > [  OK  ] Stopped target Local File Systems (Pre).
> >          Stopping Remount Root and Kernel File Systems...
> > [  OK  ] Stopped Remount Root and Kernel File Systems.
> >          Starting Update UTMP about System Shutdown...
> > [  OK  ] Reached target Shutdown.                      <-- hangs forever here (> 10 minutes)
Would it be possible to get a wireshark network trace of *just* the above activity. I want to see which side stops talking and was last said.

How I set this up was to start the trace on the host machine and had it listen for traffic to/from  the server. Something similar to 
    tshark -i br0 -o /tmp/data.pcap host <server> 
or
   just use the wireshark GUI and listen on the same interface and for the same server. Then simply save the trace to a .pcap file.

Comment 39 Kamil Páral 2012-12-26 13:59:36 UTC
With Anaconda 18.37.6 this also affects netinst + NFS (either graphical or via inst.repo) installations. That might be much easier to reproduce for a lot of people. The console log is:

> [terminated]
> [  OK  ] Started Show Plymouth Reboot Screen.
> [FAILED] Failed to start Restore /run/initramfs.             <-- I didn't see these 2 lines on the second run, but the
> See 'systemctl status dracut-shutdown.service' for details.  <-- result was the same (hang on reboot).
> [  OK  ] Stopped Anaconda.
> [  OK  ] Stopped target Anaconda System Services.
>          Stopping Network Manager...
>          Stopping Shell on tty2...
>          Stopping System Logging Service...
> [  OK  ] Stopped Shell on tty2.
> [  OK  ] Stopped System Logging Service.
> [  OK  ] Stopped Network Manager.
> [  OK  ] Stopped target Basic System.
> [  OK  ] Stopped target Sockets.
> [  OK  ] Closed Syslog Socket.
> [  OK  ] Closed D-Bus System Message Bus Socket.
> [  OK  ] Stopped target System Initialization.
>          Stopping Setup Virtual Console...
> [  OK  ] Stopped Setup Virtual Console.
>          Stopping Load Kernel Modules...
> [  OK  ] Stopped Load Kernel Modules.
> [  OK  ] Stopped target Encrypted Volumes.
>          Stopping Apply Kernel Variables...
> [  OK  ] Stopped Apply Kernel Variables.
> [  OK  ] Stopped target Swap.
> [  OK  ] Stopped target Local File Systems.
>          Unmounting Temporary Directory...
>          Stopping Configure read-only root support...
> [  OK  ] Stopped Configure read-only root support.
> [  OK  ] Unmounted Temporary Directory.
> [  OK  ] Reached target Unmount All Filesystems.
> [  OK  ] Stopped target Local File Systems (Pre).
>          Stopping Remount Root and Kernel File Systems...
> [  OK  ] Stopped Remount Root and Kernel File Systems.
>          Starting Update UTMP about System Shutdown...
> [  OK  ] Reached target Shutdown.
> Sending SIGTERM to remaining processes...
> Sending SIGKILL to remaining processes...
> Unmounting file systems.
> Unmounted /mnt/sysimage/sys/fs/selinux.
> Unmounted /mnt/sysimage/sys.
> Unmounted /mnt/sysimage/proc.
> Unmounted /mnt/sysimage/dev/shm.
> Unmounted /mnt/sysimage/dev/pts.
> Unmounted /mnt/sysimage/dev.
> Unmounted /mnt/sysimage/boot.
> Unmounted /mnt/sysimage.
> Unmounted /run/install/source.                              <-- Hangs here.

I'll try to get the wireshark logs on Jan 2nd, if nobody beats me to it.

Comment 40 Kamil Páral 2013-01-02 12:27:02 UTC
(In reply to comment #38)
> How I set this up was to start the trace on the host machine and had it
> listen for traffic to/from  the server. Something similar to 
>     tshark -i br0 -o /tmp/data.pcap host <server> 

The pcap file is here:
http://kparal.fedorapeople.org/rhbz/876218/data.pcap

This is the case of direct kernel boot + nfs. I logged just the packets between Reboot button and the hang, but still it is quite large.

I also received "112 packets dropped" message, I hope nothing important is missing.

Comment 41 Jaroslav Reznik 2013-01-02 12:40:08 UTC
(In reply to comment #40)
> (In reply to comment #38)
> > How I set this up was to start the trace on the host machine and had it
> > listen for traffic to/from  the server. Something similar to 
> >     tshark -i br0 -o /tmp/data.pcap host <server> 
> 
> The pcap file is here:
> http://kparal.fedorapeople.org/rhbz/876218/data.pcap
> 
> This is the case of direct kernel boot + nfs. I logged just the packets
> between Reboot button and the hang, but still it is quite large.
> 
> I also received "112 packets dropped" message, I hope nothing important is
> missing.

Thanks Kamil,
but...

Forbidden

You don't have permission to access /rhbz/876218/data.pcap on this server.

Comment 42 Kamil Páral 2013-01-02 12:45:41 UTC
(In reply to comment #41)
> You don't have permission to access /rhbz/876218/data.pcap on this server.

Fixed.

Comment 43 Lennart Poettering 2013-01-03 18:58:59 UTC
There are two options here:

a) tell systemd about the ordering between the network setup and the nfs mounts. This would require dracut/anaconda to tell us what is networking and what is an nfs mounting. 

b) anaconda just unmounts after it is done what it mounted in the first place.

For simplicity reasons I'd vote for b).

Comment 44 Brian Lane 2013-01-03 19:02:38 UTC
We can't umount, we are running from the mount that was setup by dracut, so it has to be torn down after anaconda is done.

Comment 45 Lennart Poettering 2013-01-03 19:06:31 UTC
Ah, oh. This is NFS root? Didn't know that.

Hmm, so dracut sets up network and nfsiso? Then it shut down both too when we go down again (we nowadays jump back into the initrd on shutdown for non-trivial storage setups).

Assuming that dracut already handles that bit correctly there seems to be somebody who fiddles with the network interfaces too early, instead of leaving that to dracut?

Assuming NM is the culprit here, can we somehow teach NM to stay away from the network interfaces?

Comment 46 Jaroslav Reznik 2013-01-03 21:16:25 UTC
Ok -> NetworkManager

Comment 47 Pavel Šimerda (pavlix) 2013-01-03 21:44:09 UTC
(In reply to comment #45)
> Assuming NM is the culprit here, can we somehow teach NM to stay away from
> the network interfaces?

Ok, so it's now assinged to NetworkManager but with virtually *no information* other than that that. So the only things I can do now is to read the logs, which I will do.

But if you have enough information to think that the problem is in networkmanager, I would be grateful if you could provide them.

Comment 48 Pavel Šimerda (pavlix) 2013-01-03 21:53:29 UTC
From the syslog:

15:06:23,219 INFO NetworkManager: <info> (eth0): DHCPv4 state changed preinit -> reboot
15:06:23,219 INFO NetworkManager: <info>   address 192.168.1.11
15:06:23,219 INFO NetworkManager: <info>   prefix 24 (255.255.255.0)
15:06:23,220 INFO NetworkManager: <info>   gateway 192.168.1.1
15:06:23,221 INFO NetworkManager: <info>   nameserver '192.168.1.1'

That means we have a DHCP-assigned IPv4 address on an ethernet interface. This is a usual situation, under which NetworkManager should never under any circumstances tear down the interface or change the default route unless there's another connection.

Comment 49 Brian Lane 2013-01-03 22:33:49 UTC
The problem appears to be that NM is bringing the network down during shutdown.

I can reproduce this (kinda-sorta) by running 'systemctl stop NetworkManager.service' from a shell during the install. The connection is taken down and NFS stops working. 

A bit of an overview of what's going on:

1. PXE boot of kernel and initrd
2. An iso is mounted via NFS
3. Anaconda stage2 is mounted from the iso's LiveOS/squashfs.img
4. dracut (via dmsquash-live) uses that as the new /

So on shutdown the network needs to stay up until everything has been unmounted, otherwise you're pulling the rug out from under the running system.

Comment 50 Dan Winship 2013-01-03 22:39:44 UTC
The syslog output doesn't show what NM is doing at shutdown. In fact, it doesn't show what anything is doing at shutdown, it just ends at 15:10:30. In anaconda.log, the timestamps jump from 15:06:36 to 10:06:48, presumably at the point when anaconda adjusts the timezone... Is there some systemd/anaconda-log-gathering-script bug that's causing it to not get the stuff that was syslogged after the time change?

Comment 51 Pavel Šimerda (pavlix) 2013-01-03 22:41:52 UTC
(In reply to comment #49)
> The problem appears to be that NM is bringing the network down during
> shutdown.

Networkmanager is bringing some network connections down on shutdown and it is currently not possible to avoid. But I understood that in the above cases, you have only one connection which is ethernet with DHCP.

And in that case, networkmanager should never bring it down without reason. Exitting networkmanager is not considered a reason, in this case.

Comment 52 Dan Winship 2013-01-03 23:11:22 UTC
(In reply to comment #50)
> Is there some
> systemd/anaconda-log-gathering-script bug that's causing it to not get the
> stuff that was syslogged after the time change?

ah, no, it's simpler than that:

         Stopping System Logging Service...
         Stopping Network Manager...

it's stopping syslogd before almost everything else, so their shutdown doesn't get logged...

Comment 53 Pavel Šimerda (pavlix) 2013-01-03 23:39:50 UTC
Just recording current information:

1) It is possible to reproduce the behavior with 'systemctl stop NetworkManager.service' in the installation environment.

2) The problem is specific to the installation environment, so there must be something different there.

The configuration looks the same. It doesn't seem to be a problem of using a different version. looks like something in the installation environment.

That means this is most probably not a bug in networkmanager, nor in systemd, but let's gather more information.

Comment 54 Pavel Šimerda (pavlix) 2013-01-03 23:54:53 UTC
When debugging this, I found out that a bug that was fixed in F17 is back in F18:

https://bugzilla.redhat.com/show_bug.cgi?id=815243

Comment 55 Pavel Šimerda (pavlix) 2013-01-04 00:02:37 UTC
When debugging this, filed an upstream bug report, that is probably *not* related:

https://bugzilla.gnome.org/show_bug.cgi?id=691102

Comment 56 Pavel Šimerda (pavlix) 2013-01-04 09:10:36 UTC
More information:

Testing shows that NetworkManager takes the interface down because dhclient exits with an error. Whether it should set the interface down (in the kernel sense), is for discussion in the NetworkManager project, but in such cases, NetworkManager should definitely remove the IP address and routes and the network communication will *not* continue.

This is the correct behavior of NetworkManager when dhclient exits with error.

So, the question is, why does dhclient exit. So far the tests suggest that dhclient is running before the user types systemctl stop NetworkManager.service and that it exits immediately after.

Comment 57 Pavel Šimerda (pavlix) 2013-01-04 10:08:21 UTC
The respective lines in syslog on 'systemctl stop NetworkManager' are:

09:45:06,836 INFO NetworkManager: <info> (eth0): DHCPv4 client pid 742 exited with status -1
09:45:06,836 WARNING NetworkManager: <warn> DHCP client died abnormally

The source code says:

»·······if (!WIFEXITED (status)) {
»·······»·······new_state = DHC_ABEND;
»·······»·······nm_log_warn (LOGD_DHCP, "DHCP client died abnormally");
»·······}

strace says that it recieved SIGTERM from pid 1, which is systemd. It is killed by systemd immediately after requesting the stopping of the service.

I don't know why systemd does that, and I don't know why this doesn't happen in a running f18. But until there is a documented reason to assing this bug report to NetworkManager, I'm moving it back to systemd.

And this is how I think reassignment should happen. With information.

Comment 58 Pavel Šimerda (pavlix) 2013-01-04 10:12:28 UTC
Created attachment 672351 [details]
screenshot of strace showing that dhclient is killed by systemd using SIGTERM

Comment 59 Pavel Šimerda (pavlix) 2013-01-04 10:20:00 UTC
I would like to give credit to Brian C. Lane and Kamil Páral who did most of the testing for me.

Comment 60 Lennart Poettering 2013-01-04 13:21:28 UTC
Well, we will shut down both NM and dhclient when going down, but neither should deconfigure the network interface NFSISO is mounted from while doing that.

I mean, we have a bit of a chicken and egg problem here: the stuff that manages the network connection runs from storage that is on the network. Of course this will go boom, on shutdown if the network managing code removes the network interface when going down.

We had a similar problem with fancier storage systems where the management tool for the storage was run from the storage itself. And we decreed there that that it is no longer allowed and hence storage tools for the root fs need to run from the initrd and can never be run from the root fs, ever.

This is similar for networking. But we have the big benefit that interfaces can survive easily without their managing processes around (NM or DHCP). So, let's solve it that way: allow NM/DHCP from NFSISO zu manage the network interface as much as it wants, however when NM/DHCP is shutdown it has to leave the interface up and configured.

hence, reassigning to NM.

(BTW, this in no way specific to anaconda, this is a general issue of booting from the network.)

Comment 61 Martin 2013-01-04 13:49:50 UTC
Lennart, why systemd doesn't stop NM and dhclient _after_ switching back to initrd and unmounting all network resources? I don't see a reason why NM shoudn't cleanup interfaces configured by itself on exit. Same idea is in systemd and it's behaviour - switch to initrd and unmount all mounts (including rootfs) before halt.

Comment 62 Kay Sievers 2013-01-04 14:31:30 UTC
NM does not run from the initrd but from the rootfs. We must kill
everything that runs from the real root to be able to go back to
the initrd. All tools running pin the filesystem, and need to be
killed or re-executed before anything is like that is possible.

This is not a problem we could solve at the systemd level, NM and
its tool should leave the interfaces alone, or NM needs to be run
from the initrd, so systemd can leave it running. At that point I see no
other sensible options to solve that problem.

Comment 63 Martin 2013-01-04 15:16:44 UTC
Ok, I forgot that NM inside initrd is planned feature for F19. Anyway, NM should keep configured what was configured outside of NM, especially before startup. But killing dhclient from systemd is kind of tricky, NM can handle that on it's own.

Comment 64 Dan Winship 2013-01-04 15:18:56 UTC
Yes, I don't see why systemd is killing dhclient, since it doesn't start/manage it, and it hasn't gotten to the "sending SIGTERM to everyone who's left" part of shutdown yet.

Comment 65 Jaroslav Reznik 2013-01-04 15:21:49 UTC
Well, for F18, we need ANY suitable workaround (on systemd/nm side - doesn't matter for now ;-). And of course, long term F19 solution is that chicken/egg one -> NM in initrd.

Comment 66 Kay Sievers 2013-01-04 15:25:26 UTC
(In reply to comment #63)
> Ok, I forgot that NM inside initrd is planned feature for F19.

It might be a feature for F19, but I would be surprised, if the NM from
the initrd is kept running through the entire lifetime of the system. That
would be the only case where the do-not-kill logic would apply.

Things running from the initrd, like rootfs raid storage daemons, are considered "kernel threads" to the system, which is unlikely a useful setup for NM.

> Anyway, NM
> should keep configured what was configured outside of NM, especially before
> startup. But killing dhclient from systemd is kind of tricky, NM can handle
> that on it's own.

We have to kill everything, as it might pin deleted files like libraries,
which prevents the system from even mounting the rootfs read/only.

NM should definitely leave things alone, which it hasn't brought up at boot time.

It could be, if needed, that also dhclient needs changes to leave things as they
are if it is killed. I don't know enough about it.

Comment 67 Martin 2013-01-04 15:45:06 UTC
Let's stop NM and let NM stop dhclient gracefully. And if dhclient keeps running _after_ NM exits, then kill it.

Comment 68 Harald Hoyer 2013-01-04 16:14:22 UTC
$ man systemd.kill
OPTIONS
       KillMode=process

BUT, NM should never ever _consider_ to shut down the interface.

Comment 69 Dan Winship 2013-01-04 16:17:26 UTC
(In reply to comment #66)
> NM should definitely leave things alone, which it hasn't brought up at boot
> time.

If you want NM to ignore the interface entirely, set NM_CONTROLLED=false in its ifcfg file. But in this case, the ifcfg file says this interface is supposed to be up, so NM wants it to be up. When dhclient exits with an error, NM assumes that that means that the interface is no longer usable, so it brings it down to get things back into a clean state before bringing it back up again. Except that since it's been asked to exit, it doesn't get a chance to bring it back up.

Comment 70 Kay Sievers 2013-01-04 16:31:40 UTC
(In reply to comment #69)
> If you want NM to ignore the interface entirely, set NM_CONTROLLED=false in
> its ifcfg file. But in this case, the ifcfg file says this interface is
> supposed to be up, so NM wants it to be up.

I don't know for how many years we need to discuss this same stuff
over and over. Stuff that runs before NM is active needs to be left
alone, what's so hard here.

What NM is doing here is like udisks would power down all disks at exit.
What a great idea. I give up now.

Comment 71 Bill Nottingham 2013-01-04 16:43:04 UTC
I think most everything is behaving correctly here.

systemd kills dhclient & NM on shutdown, as it needs to.

NM handles dhclient spontaneously exiting with an error, as it should.

Wouldn't the simplest bit be to fix the ordering somehow so that NM is shut down *before* dhclient is? This could be done either with systemd ordering, or with an explicit kill in anaconda or anaconda's dracut module.

Comment 72 Brian Lane 2013-01-04 16:45:20 UTC
I booted my PXE test with rd.break and edited NetworkManager.service to add KillMode=process

the system now shuts down without hanging. So, I think the simple solution is best, unless there is some dramatic side-effect to using KillMode=process in other situations.

Comment 73 Bill Nottingham 2013-01-04 16:48:20 UTC
(In reply to comment #72)
> I booted my PXE test with rd.break and edited NetworkManager.service to add
> KillMode=process
> 
> the system now shuts down without hanging. So, I think the simple solution
> is best, unless there is some dramatic side-effect to using KillMode=process
> in other situations.

If you just want this behavior when run in anaconda, the anaconda dracut module could set up an override in /run like:

/run/systemd/system/NetworkManager.service:
.include /lib/systemd/system/NetworkManager.service
[Service]
KillMode=process

Comment 74 Martin 2013-01-04 16:50:56 UTC
(In reply to comment #70)
> I don't know for how many years we need to discuss this same stuff
> over and over. Stuff that runs before NM is active needs to be left
> alone, what's so hard here.
This leads to a question: Is dhclient in this particular case started by systemd/network-scripts/whatever or by NM?

Still, "EasyFix" proposed in comment #67 should kill this F18Blocker. And if NM in initrd will be accepted as F19 feature and so it replace network scripts then this chicken-egg problem will be gone.

Comment 75 Dan Winship 2013-01-04 16:51:54 UTC
(In reply to comment #73)
> > the system now shuts down without hanging. So, I think the simple solution
> > is best, unless there is some dramatic side-effect to using KillMode=process
> > in other situations.
> 
> If you just want this behavior when run in anaconda

No, this seems to be generically correct; when killed cleanly, NM will take care of shutting down any daemons it started.

(dhclient is started by NM, not by systemd)

Comment 76 Martin 2013-01-04 16:53:19 UTC
(In reply to comment #72)
> I booted my PXE test with rd.break and edited NetworkManager.service to add
> KillMode=process
> 
> the system now shuts down without hanging. So, I think the simple solution
> is best, unless there is some dramatic side-effect to using KillMode=process
> in other situations.

Great.

Comment 77 Brian Lane 2013-01-04 16:55:42 UTC
(In reply to comment #73)

> If you just want this behavior when run in anaconda, the anaconda dracut
> module could set up an override in /run like:

True, but Anaconda is trying (yes, really!) to keep things as close to a normal system as possible. I'd rather see NM update their .service file.

Comment 78 Kay Sievers 2013-01-04 16:59:50 UTC
(In reply to comment #71)
> I think most everything is behaving correctly here.
> 
> NM handles dhclient spontaneously exiting with an error, as it should.

I can't agree here, and wouldn't call this correct.

If NM wants to be "smart" and manage everything it finds, it needs to be
smart *enough* to also find out that it kills the very storage it itself
runs from.

If it isn't that smart, it should do the only reasonable thing, and leave
the stuff alone it doesn't know about.

Comment 79 Jaroslav Reznik 2013-01-04 17:01:03 UTC
(In reply to comment #77)
> (In reply to comment #73)
> 
> > If you just want this behavior when run in anaconda, the anaconda dracut
> > module could set up an override in /run like:
> 
> True, but Anaconda is trying (yes, really!) to keep things as close to a
> normal system as possible. I'd rather see NM update their .service file.

I'd prefer the less intrusive change as we are quite late in release cycle - touching NM .service file could be quite opposite. NM guys, how do you feel about it?

Comment 80 Harald Hoyer 2013-01-04 17:01:29 UTC
(In reply to comment #71)
> I think most everything is behaving correctly here.
> 
> systemd kills dhclient & NM on shutdown, as it needs to.
> 
> NM handles dhclient spontaneously exiting with an error, as it should.
> 
> Wouldn't the simplest bit be to fix the ordering somehow so that NM is shut
> down *before* dhclient is? This could be done either with systemd ordering,
> or with an explicit kill in anaconda or anaconda's dracut module.

So, it's ok to kill the whole system (network root), if dhclient exits with !=0 ???

Comment 81 Bill Nottingham 2013-01-04 17:08:25 UTC
(In reply to comment #80)
> > Wouldn't the simplest bit be to fix the ordering somehow so that NM is shut
> > down *before* dhclient is? This could be done either with systemd ordering,
> > or with an explicit kill in anaconda or anaconda's dracut module.
> 
> So, it's ok to kill the whole system (network root), if dhclient exits with
> !=0 ???

NM should handle dhclient exiting. If dhclient assigned the IP to the interface, that likely means deconfiguring it. If dhclient merely assigned a lease to an existing IP on the interface, you can argue it doesn't mean deconfiguring the IP.

Comment 82 Dan Winship 2013-01-04 17:09:24 UTC
(In reply to comment #78)
> > NM handles dhclient spontaneously exiting with an error, as it should.
> 
> I can't agree here, and wouldn't call this correct.

If dhclient tells NM that the lease is about to expire, should NM ignore it and continue working with an IP address that might get reassigned to another host?

If NM notices that carrier was lost, and that when it comes back up, the router refuses to speak to it until it does 802.1X authentication again, should NM ignore that, and let anaconda send packets into /dev/null?

The problem here is not that NM is taking corrective action, it's that NM is exiting midway through that corrective action. If you "killall dhclient" in the middle of an install without killing NM too, NM will quickly restore things to normal and the install will continue.

(In reply to comment #79)
> I'd prefer the less intrusive change as we are quite late in release cycle -
> touching NM .service file could be quite opposite. NM guys, how do you feel
> about it?

I'm doing a build now. KillMode=process really is the right thing here. (Even if you think NM shouldn't be taking down the interface under any circumstances, it's *still* the case that NM should have been KillMode=process anyway.)

Comment 83 Fedora Update System 2013-01-04 17:34:09 UTC
NetworkManager-0.9.7.0-12.git20121004.fc18 has been submitted as an update for Fedora 18.
https://admin.fedoraproject.org/updates/NetworkManager-0.9.7.0-12.git20121004.fc18

Comment 84 Pavel Šimerda (pavlix) 2013-01-04 18:12:06 UTC
(In reply to comment #60)
> Well, we will shut down both NM and dhclient when going down,

dhclient is managed by NM and killing it is a mistake.

> but neither
> should deconfigure the network interface NFSISO is mounted from while doing
> that.

There is no reason to discuss the behavior of dhclient and NetworkManager in circumstances that are wrong from the beginning. The dhclient process is managed by NetworkManager, no other process should ever send signals to it in production environment.

> I mean, we have a bit of a chicken and egg problem here

There is no chicken and egg problem. NetworkManager starts dhclient. NetworkManager will stop it. Intervening in the process from outside is wrong and the behavior is undefined.

> however when NM/DHCP is shutdown it has to leave the interface up and
> configured.

This is the current behavior.

> hence, reassigning to NM.

As you clearly ask NM to behave in the way it already behaves, I have no choice but reassign it back to the wrongly behaving component that kills a subprocess before stopping the service.

> (BTW, this in no way specific to anaconda, this is a general issue of
> booting from the network.)

Please read my explanations of the tests made with Brian and Kamil. You should understand the problem before taking action.

Comment 85 Pavel Šimerda (pavlix) 2013-01-04 18:17:57 UTC
(In reply to comment #82)
> I'm doing a build now. KillMode=process really is the right thing here.
> (Even if you think NM shouldn't be taking down the interface under any
> circumstances, it's *still* the case that NM should have been
> KillMode=process anyway.)

If this setting is what we need for NetworkManager's unit file, I will reassign myself from systemd to NetworkManager and will do the systemd guys' work to explain correctly why I'm doing that.

But I would like to have an explanation, why it works in a running system and not in the installer. What's the difference.

Comment 86 Pavel Šimerda (pavlix) 2013-01-04 18:23:21 UTC
(In reply to comment #85)
> If this setting is what we need for NetworkManager's unit file, I will
> reassign myself from systemd to NetworkManager and will do the systemd guys'
> work to explain correctly why I'm doing that.

So, I'm fulfilling my promise.

Once again, this his how reassignment should work (with explanation):

Dan Winship provided information that systemd defaults to killing the whole cgroup at once. This is the wrong way, at least for network management daemon, as it manages its processes itself.

I appreciate his knowledge of systemd and thank him from providing a unit file fix.

> But I would like to have an explanation, why it works in a running system
> and not in the installer. What's the difference.

This still applies, though.

Comment 87 Martin 2013-01-04 19:43:00 UTC
Thanks Harald for providing "KillMode=process" solution and Brian for verifying it.

Comment 88 Pavel Šimerda (pavlix) 2013-01-04 19:55:29 UTC
(In reply to comment #68)
> $ man systemd.kill
> OPTIONS
>        KillMode=process

I missed your comment. Thank you for your explanation.

> BUT, NM should never ever _consider_ to shut down the interface.

You may or may not be true, depending on what you mean by shutting down. NetworkManager is expected to configure the interface according to administrator's instructions and dynamic configuration protocols (e.g. DHCP).

There are situations where setting the interface down (unsetting kernel IFF_UP) is desirable and there are situation when it is not. Just as well there are situations when removing dynamically assinged addresses is desirable and when this is not.

This is unrelated to this particular bug report, but I would be happy if you could send me your ideas via e-mail (or via an upstream or downstream bugreport, if you see fit).

I would like to either implement any of your ideas or document why it won't be implemented. We are working hard on NetworkManager's transition to the server and initrd world. Any suggestions are welcome.

(In reply to comment #87)
> Thanks Harald for providing "KillMode=process" solution and Brian for
> verifying it.

I second this and I'm sorry for overseeing the comment.

(In reply to comment #80)
> 
> So, it's ok to kill the whole system (network root), if dhclient exits with
> !=0 ???

AFAIK it is not ok for dhclient to exit with non-zero status while managing a successful lease. Correct me if I'm wrong and we'll fix the behavior then.

Please consider failed dhclient as an equivalent to a removed ethernet cable. The DHCP standard doesn't include any exceptions for NFS.

Comment 89 Kay Sievers 2013-01-04 20:02:39 UTC
(In reply to comment #88)
> > So, it's ok to kill the whole system (network root), if dhclient exits with
> > !=0 ???
> 
> AFAIK it is not ok for dhclient to exit with non-zero status while managing
> a successful lease. Correct me if I'm wrong and we'll fix the behavior then.

It is surely not ok to _ever_ tear down any interface that provides the
backing store for the entire system.

A failing dhclient causes no harm, could happen any time, for whatever
weird reason. A "killall dhclient" should not crash the system.

This "cleanup" by NM does kill the system and provides no benefit
at all in this case. The interface just needs to be left alone.
 
> Please consider failed dhclient as an equivalent to a removed ethernet
> cable. The DHCP standard doesn't include any exceptions for NFS.

I don't think so. The running dhclient just updates the DHCP lease on
the server once a day or so, it has absolutely nothing to do with the
link state or the actual connection. It is not needed at all for operations.

This is a very fragile NM behaviour which should still be fixed, regardless of the more reliable killing order now, which makes the problem less visible.

Comment 90 Martin 2013-01-04 20:49:35 UTC
Verified fix with NetworkManager-0.9.7.0-12.git20121004.fc18

Compare strace output:
[root@Martix-T520 martix]# strace -p $(pidof dhclient)
Process 13644 attached
select(22, [5 6], [], NULL, {35416, 525775}) = ? ERESTARTNOHAND (Interrupted by signal)
--- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=1, si_uid=0} ---
+++ killed by SIGTERM +++

"meanwhile installed NetworkManager-0.9.7.0-12.git20121004.fc18"

[root@Martix-T520 martix]# strace -p $(pidof dhclient)
Process 13731 attached
select(22, [5 6], [], NULL, {42435, 155219}) = ? ERESTARTNOHAND (Interrupted by signal)
--- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=13718, si_uid=0} ---
+++ killed by SIGTERM +++

Verify if dhclient is really being killed by NM, not another process:

[root@Martix-T520 martix]# ps aux|grep Network
root     13820  3.4  0.2 354864  7992 ?        Ssl  21:40   0:00 /usr/sbin/NetworkManager --no-daemon
root     13892  0.0  0.0 109192   888 pts/1    S+   21:41   0:00 grep --color=auto Network
[root@Martix-T520 martix]# strace -p $(pidof dhclient)
Process 13846 attached
select(22, [5 6], [], NULL, {37986, 688828}) = ? ERESTARTNOHAND (Interrupted by signal)
--- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=13820, si_uid=0} ---
+++ killed by SIGTERM +++

Comment 91 Pavel Šimerda (pavlix) 2013-01-04 22:03:45 UTC
(In reply to comment #89)
> It is surely not ok to _ever_ tear down any interface that provides the
> backing store for the entire system.

This dogma has already been refuted by the possibility of a disconnected cable.

> A failing dhclient causes no harm,

It may make your connection invalid when you no longer have your IP address. You can be for example filtered out by a firewall that only allows valid IP addresses.

> could happen any time, for whatever weird reason.

But it doesn't for whatever reason but for bad integration between NM and systemd.

> A "killall dhclient" should not crash the system.

It is too easy to choose one use case. Unfortunately, we, the networking people, have to consider the big picture, not only one special use case of yours.

> This "cleanup" by NM does kill the system and provides no benefit
> at all in this case.

That's it. You can afford thinking only in specific cases. I cannot.

> This is a very fragile NM behaviour which should still be fixed, regardless
> of the more reliable killing order now, which makes the problem less visible.

Then please file a bug report against NetworkManager (better upstream) and find time to explain your case thoroughly. Include other cases. Remember that we need to be RFC compliant unless there is a *very* good reason not to. Try to work with the big picture, not just one use case. Then please Cc me and/or people from the NM team you believe more and we will see what we can do.

Comment 92 Pavel Šimerda (pavlix) 2013-01-04 22:17:35 UTC
(In reply to comment #89)
> It is surely not ok to _ever_ tear down any interface that provides the
> backing store for the entire system.

And please don't repeat the mistake of using „tear down“ when talking about kernel network configuration and/or concepts of network configuration. This term is not precisely defined.

You may be talking about:

1) Setting the interface down

In my opinion, NetworkManager shoud never ever set ethernet interface down unless explicitly asked to do so.

2) Removing an address

In my opinion, even NFS doesn't justify breaking the DHCP standard. That means, under some circumstances, connectivity can be lost.

3) Removing a route

Same as #2.

How should NetworkManager behave in various situations is up to a discussion. But we want to support internetworking properly and not do rushed decision because someone just *thinks* the problem is in NetworkManager.

Comment 93 Adam Williamson 2013-01-05 02:58:36 UTC
Tested with tc4 then smoke14 (which has the fix): reproduced the problem with tc4, smoke14 worked. Looks good.

Comment 94 Fedora Update System 2013-01-05 07:16:39 UTC
NetworkManager-0.9.7.0-12.git20121004.fc18 has been pushed to the Fedora 18 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 95 Kamil Páral 2013-01-07 10:21:34 UTC
netinst+nfs: PASS
netinst+nfsiso: PASS
pxeboot+nfs: PASS
pxeboot+nfsiso: PASS

Thanks everyone.

Comment 96 Schoschi Decker 2013-01-26 21:30:09 UTC
Just saw the exact same behavior, as documented in original description, Fedora 17 (x86_64) pxe boot/installed in oVirt Node 2.5.5/under control of Ovirt Engine 3.1 environment.  Not sure if this affirms the situation or confuses it?

last line reported...
"Could not unmount /run/install/repo: Device or resource busy"

Comment 97 Kamil Páral 2013-01-28 08:37:30 UTC
Schoschi, this report was related to Fedora 18. For Fedora 17 issue please see:
https://fedoraproject.org/wiki/Common_F17_bugs#nfsiso-reboot-hang

Comment 98 Adam Williamson 2013-01-28 21:25:08 UTC
Right. It's known to be fixed in F18, broken in F17.