Bug 811242

Summary: PXE and repo=nfs or repo=nfsiso freezes installer (nfs4)
Product: [Fedora] Fedora Reporter: Kamil Páral <kparal>
Component: nfs-utilsAssignee: Steve Dickson <steved>
Status: CLOSED CURRENTRELEASE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 23CC: amcnabb, anaconda-maint-list, awilliam, bfields, dcbw, ddumas, g.kaviyarasu, jonathan, jsmith.fedora, k.georgiou, robatino, satellitgo, steved, tflink, vanmeeuwen+fedora, wwoods
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-08-22 12:31:00 EDT Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
Description Flags
X.log none

Description Kamil Páral 2012-04-10 10:08:14 EDT
Description of problem:
I boot from PXE. I tried any of these boot arguments:

* repo=nfsiso:nfs.englab.brq.redhat.com:/pub/fedora/fedora-alt/stage/17-Beta.RC4/Fedora/i386/iso/Fedora-17-Beta-i386-DVD.iso

* repo=nfs:nfs.englab.brq.redhat.com:/pub/fedora/linux/development/17/i386/os/

* root=live:nfs:nfs.englab.brq.redhat.com:/pub/fedora/fedora-alt/stage/17-Beta.RC4/Fedora/i386/os/LiveOS/squashfs.img

The symptoms are always the same. Somewhere during the installation setup anaconda freezes. Usually it occurs during partition setup (while creating partitions), but it also occurs after providing root password and clicking Next, or clicking Next on the very first screen. The whole screen is frozen, only cursor can be moved around (spinning circle). When I switch to TTYs, I see no traceback or error. I can't execute any command, if I try to run e.g. 'ls', the prompt just hangs and provides no output. After switching back to TTY7 the whole system freezes and cursor nor any keys no longer work.

I have this problem confirmed by 4 different people on 4 different machines (two bare-metal and two VMs). All of them booted from our company PXE.

If I use repo=http or root=live:http everything works well. If I boot from optical media instead of PXE, everything works well for all protocols. If I use direct kernel boot in VM (not using PXE), everything works well for all protocols (but I have to admit I use different NFS server in this case, because I can't reach the original one in my VMs).

The only broken use case seems to be PXE boot and one of the boot arguments listed above.

There can be some problem with company's NFS server. But it serves well in all other cases - for F16 installation, for repo=nfs when booted from optical media, etc. I have no idea what might cause this. But it seems as if the root image (squashfs.img) was disconnected during installation setup, if fetched over nfs, and then everything simply hanged.

Version-Release number of selected component (if applicable):
F17 Beta RC4

How reproducible:
100% reproducible in my case, but some of my colleagues succeeded in 1 in several installation attempts

Steps to Reproduce:
1. Boot from PXE
2. use repo=nfs or repo=nfsiso or root=live:nfs
3. finish installation
Actual results:
Most of installation attempts freezes during setup (not actual installation, just the setup, like providing password or creating partitions). Maybe related to any I/O.

Additional info:
All members of Anaconda team located in Brno should be able to reproduce it, since they have access to this very PXE server.
Comment 1 Kamil Páral 2012-04-10 10:10:20 EDT
Proposing as Beta blocker, even though we still haven't accepted the PXE criteria.
Comment 2 Kamil Páral 2012-04-10 10:24:18 EDT
I'll try now to find out whether it's a problem of the single NFS server we use in Brno (maybe overloaded or something) or whether it happens even with other NFS servers.
Comment 3 Kamil Páral 2012-04-10 10:35:28 EDT
This will get even more complicated.

I tried to boot from company PXE and use my own NFS server for repo=nfsiso option. It also freezes. So this is not a problem of a single overloaded/misconfigured NFS server. However the same (my own) NFS server works when I use direct kernel boot in VM and pass the same repo=nfsiso option.

I thought there is no difference between direct kernel boot in VM and PXE boot, once you start the Linux kernel. But it seems to be the only distinction I have found. PXE+nfs broken, direct kernel boot in VM+(the same) nfs works.
Comment 4 Kamil Páral 2012-04-10 11:37:59 EDT
I tested older composes. Beta RC1 and earlier don't work because of NetworkManager problem during boot. Beta RC2 has the same issues as Beta RC4, it freezes during installation setup.

I also found out why sometimes I was able to perform an installation and sometimes not. It depends on timing. It seems that installer will get stuck after 60 seconds. If you are quick enough to make to the package installation phase, you're good (probably). If you are too slow, it will freeze.

So if you want to reproduce it, just wait 60-90 seconds at the first screen and then hit next, it will freeze.

Martin Sivak suggested there might be some problem with network during boot, and the connection/process will wait for another one or more minutes until it gives up and quits, which makes the root image unavailable. My current experience matches his suspicion.
Comment 5 Adam Williamson 2012-04-10 13:18:12 EDT
Can you get a log out during the time when it's working, so we can maybe see what problem the network is having?

I think I'm -1 blocker on this one...it's definitely a bug, but nothing is completely broken; you can do PXE/NFS if you're fast, you can do PXE if you use something other than NFS, you can do NFS if you use something other than PXE...just feels like it's not broken 'enough' to hold beta yet again. But it's a fairly weak -1.

Fedora Bugzappers volunteer triage team
Comment 6 Tim Flink 2012-04-10 13:21:05 EDT
I'm wondering if this is too much of a corner case to be a beta blocker.
Installation from NFS works, installation from PXE works but you can't put them

Since F17 is still pre-release and this could be fixed with an update (once
it's pushed to stable), I'm -1 beta blocker, +1 beta NTH for a well tested fix.
Comment 7 Kamil Páral 2012-04-11 05:09:41 EDT
(In reply to comment #5)
> Can you get a log out during the time when it's working, so we can maybe see
> what problem the network is having?

Man, that was harder that I would thought. Maybe the timeout is much shorter. I attach the logs I gathered before system got stuck.

I also tried to run ping indefinitely on that system and the network is not disconnected. The root image somehow gets disconnected after short timeout, but the network is running fine, ping continues to ping (until I quit it and attempt to do another action, then it freezes).

Martin Sivak advised me to install statically-linked busybox into ramdisk and run it, so that I can access the logs even after the root image is disconnected. But I can't get to it today and this is all Greek to me, so I can't really promise I can provide these "after-freeze" logs.
Comment 8 Kamil Páral 2012-04-11 05:10:05 EDT
Created attachment 576714 [details]
Comment 9 Kamil Páral 2012-04-11 05:10:34 EDT
Created attachment 576716 [details]
Comment 10 Kamil Páral 2012-04-11 05:10:37 EDT
Created attachment 576717 [details]
Comment 11 Kamil Páral 2012-04-11 05:10:43 EDT
Created attachment 576718 [details]
Comment 12 Kamil Páral 2012-04-11 05:10:48 EDT
Created attachment 576719 [details]
Comment 13 Jared Smith 2012-04-11 13:35:28 EDT
I'm inclined to vote -1 to blocker on this one, as I don't see what Beta criteria this hits.
Comment 14 Jared Smith 2012-04-11 13:51:51 EDT
For the record, I'm -1 on beta blocker.
Comment 15 Robyn Bergeron 2012-04-11 14:14:11 EDT
I am -1 beta blocker from the cornercase angle; I'm leaning towards -1 beta NTH as well, mostly because I don't want to breathe on anything, though as Tim pointed out, well-tested might sway me elsewhere.
Comment 16 Dennis Gilmore 2012-04-11 15:37:03 EDT
/me is -1 blocker also
Comment 17 Adam Williamson 2012-04-11 16:15:13 EDT
Rejected as a Beta blocker.

Fedora Bugzappers volunteer triage team
Comment 18 Kamil Páral 2012-04-12 05:06:41 EDT
Proposing as Final blocker. We should have relevant criteria by then.
Comment 19 Will Woods 2012-04-19 12:46:15 EDT
When was the last time this was tested? Does it work in Beta?
Comment 20 Kamil Páral 2012-04-19 17:23:52 EDT
Beta RC4 = Beta GOLD. It is broken with Beta.
Comment 21 Kamil Páral 2012-04-19 17:37:23 EDT
We now have release criterion for PXE:

"It must be possible to install by booting the installation kernel directly, including via PXE, and correctly specifying a remote source for the installer itself, using whichever protocols are required to work for package retrieval at the current phase (Alpha, Beta, Final). This must work if the remote source is not a complete repository but contains only the files necessary for the installer itself to run."
Comment 22 Adam Williamson 2012-04-20 15:06:11 EDT
Discussed at 2012-04-20 blocker review meeting - http://meetbot.fedoraproject.org/fedora-bugzappers/2012-04-20/fedora-bugzappers.2012-04-20-17.01.log.txt . Accepted as a blocker per criterion cited in comment #21.
Comment 23 Will Woods 2012-05-01 13:59:08 EDT
Works fine here. I set up a virt guest and NFS-exported a copy of a F17 tree, then booted with:

What happens if you use the IP of the NFS server rather than the hostname?
Comment 24 Will Woods 2012-05-01 14:12:50 EDT
Ah, now I can reproduce the bug. Yay.

When this sort of lockup happens it's usually because NetworkManager (or something else) has messed with the network interface. And since our entire root filesystem (including NetworkManager) is being read from that network interface, if the interface goes away we can't bring it back up.

But I swear this was working before. Maybe some of my upstream dracut changes didn't make it into Beta as expected? I'll investigate further.
Comment 25 Will Woods 2012-05-03 16:03:52 EDT
Okay, this is a tough one.

As far as I can tell, anaconda/dracut is doing everything right to hand the device over to NetworkManager. If I run the install via kickstart, it completes successfully. 

However, if I do the install interactively and wait a while on the first anaconda screen, *usually* something seems to twiddle the interface, because the filesystem stops responding.

Maybe something is messing with NetworkManager? Maybe there's some kind of activity timeout? It's very strange that using the same repo and the same ifcfg file will work for a kickstart install but hang (eventually) otherwise.
Comment 26 Dan Williams 2012-05-03 17:02:07 EDT
We don't expect NM to bounce the network at all here, but it sounds like it might be.  We'd need to get some logs from /var for this.  I don't think NM output goes to syslog here anymore, I think there's an NM specific logfile in the install environment that we want.  If we can get that *and* the syslog logfile (since that's where stuff like DHCP info is dumped) then we can get a clue as to what's going on.

Will asked about using gdb to trap where NM might be taking down the device, so we can stop it there and grab the logfile before the whole machine freezes.  The function you want for that is nm_device_deactivate().  We don't expect that function to be called at all on the active device that NM gets handed from Dracut.  To get a lot more info out of NM, you can set the "level=debug" option in the [logging] section of /etc/NetworkManager/NetworkManager.conf before NM starts, or we could use dbus-send to bump the log level if the anaconda environment includes dbus-send.
Comment 27 Adam Williamson 2012-05-04 18:17:38 EDT
This was discussed at the blocker review meeting today: http://meetbot.fedoraproject.org/fedora-bugzappers/2012-05-04/f17-final-blocker-review-4.2012-05-04-17.00.log.html . The meeting conclusion was "There has been some movement on this bug but no fix has been proposed yet. There is a request for logs but the request is a little vague - ask for clarification on exactly which logs are being requested". I happen to know the answer to that at least as far as comment #26 goes: NM logs to syslog, so when you set level=debug in /etc/NetworkManager/NetworkManager.conf [logging] section, you just get a lot more output in /var/log/messages , and that's what would be useful.

Fedora Bugzappers volunteer triage team
Comment 28 Will Woods 2012-05-07 12:29:32 EDT
This almost certainly isn't a NM problem.

I've been reproducing the problem using a virt guest, and here's what I've found:

1) With NM running with level=debug, there are *no* syslog messages that are anywhere *near* the eventual hang. Not NM, not kernel, nothing.

2) The interface never actually goes down. The system responds to pings at 0.2 second intervals for the entire duration of the test - before and after the root filesystem hangs - and an existing ssh connection continues to respond.

3) The install completes successfully* if automated via kickstart - or if you work quickly by hand. It seems to be some kind of timeout / autosuspend thing.
* Except that it fails to reboot at the end - see bug 818581.
Comment 29 Adam Williamson 2012-05-08 15:58:54 EDT
<steved> adamw: also it would be good to get a system trace (echo t > /proc/sysrq-trigger)
<steved> adamw: what NFS protocol versions is being used? If its v4 you might want force v3 just to see if the problem goes away...

Fedora Bugzappers volunteer triage team
Comment 30 Steve Dickson 2012-05-08 16:01:50 EDT
Also getting a system trace (echo t > /proc/sysrq-trigger) of the system when its hug would be good as well....
Comment 31 Will Woods 2012-05-08 16:53:01 EDT
We're not specifying v3, so it's defaulting to v4.

Here's a (rough guide to setting up a) reproducer, assuming you have a DVD image and libvirt:

virtip= # ip of virbr0
nfsdir=/srv/nfs/test # or wherever

# make NFS export
mkdir $nfsdir
cd $nfsdir
cp $dvd .
exportfs -o all_squash $virtip/24:$nfsdir

# grab kernel/initrd
mkdir tmp
mount -o ro,loop *.iso tmp
cp tmp/images/pxeboot/* .
umount tmp

# create/run virt guest
virt-install --name nfstest --ram 768 --disk disk.img,size=10 --network network=default,model=virtio --boot "kernel=$nfsdir/vmlinuz,initrd=$nfsdir/initrd.img,kernel_args=repo=nfs:$virtip:$nfsdir ip=dhcp"

[Swap to the shell on tty2; 'ls' should work. Wait a few minutes and try 'ps aux' and e.g. 'ls /lib' - these should hang.]

# clean up virt guest:
virsh destroy nfstest; virsh undefine nfstest
Comment 32 Kamil Páral 2012-05-09 05:24:34 EDT
If this problem ends up being too hard to resolve for F17 Final, I think there should be a simple workaround - just copy the LiveOS/squashfs.img the same way you do it with http, instead of mounting it remotely.
Comment 33 Steve Dickson 2012-05-09 06:33:19 EDT
When everything hangs is it possible to get a system trace (echo t > /proc/sysrq-trigger) of what is running and where they hung?

Also, is the NFS mount a loopback mount or is it going over a network to a remote server.... If its the latter would it be possible to get an network traces of any traffic?
Comment 34 Will Woods 2012-05-10 19:45:40 EDT
The NFS mount goes over the virtual network to the virt host, as described in comment #31. 

Once the system hangs, I stop getting syslog output over my ssh pipe (although the connection doesn't seem to break?).

Actually, once the system hangs, nothing more seems to be written to syslog - even though it's in /tmp?

I had it run with "while :; do dmesg | tail; sleep 1; done" and there are no new kernel messages in the minutes leading to the hang.

I ran with "while :; do logger "uptime: $(</proc/uptime)"; sleep 1; done"; the hang happens *reliably* about 180 seconds (+/-5) after switch_root / syslog start.

Since we can't get data out of syslog after that, I can't get the output from the system trace.

I'll try forcing nfsv3 in a bit. After that I'm basically out of ideas.
Comment 35 Will Woods 2012-05-10 21:05:54 EDT
Okay, it's not always 180 seconds - sometimes it's 120 seconds? 

On the other hand: adding "nfsvers=3" to the nfs options seems to fix the problem! i.e. boot with a repo=nfs... arg like so:


So, any ideas what nfs4-specific things might have a timeout after a couple of minutes?

We *can* use nfs3 by default, of course, but I'd rather *fix* the problem, not work around it... nevertheless, I'll work on a patch to use nfs3 by default instead of nfs4, just in case.
Comment 36 Orion Poplawski 2012-05-11 16:11:35 EDT
Just a thought...  I've seen issues where nfs mounts go awry after the clock was jumped, and I've seen the clock jump during an install at some point after boot.
Comment 37 Will Woods 2012-05-11 17:31:31 EDT
Okay, so the patch to do nfs3 by default is pretty easy. Doing the fallback to nfs2 if nfs3 fails is harder.

How much do we need to care about nfs2 in This Modern Age?
Comment 38 Steve Dickson 2012-05-12 07:28:36 EDT
(In reply to comment #35)
> Okay, it's not always 180 seconds - sometimes it's 120 seconds? 
> On the other hand: adding "nfsvers=3" to the nfs options seems to fix the
> problem! i.e. boot with a repo=nfs... arg like so:
>   repo=nfs:nfsvers=3:<SERVER>:<PATH>
> So, any ideas what nfs4-specific things might have a timeout after a couple of
> minutes?
Without seeing where the processes are hung or seeing any traffic its kinda tough to tell what's happening... Would it be possible to run wireshark trace
on the server side?
Comment 39 Will Woods 2012-05-14 17:26:17 EDT
Here's the requested wireshark trace: 

These files are unwieldy and it took me quite a while to get it uploaded anywhere, but that should be a pretty complete trace. The machine hangs right at the end.

If that doesn't help, you could probably use the script in comment #31 to reproduce the problem locally on any virt-capable machine. 

Let me know if you need any more info or help reproducing.
Comment 40 Brian Lane 2012-05-14 20:53:36 EDT
A workaround for this is in anaconda-17.27-1 (uses nfs3 instead of 4)
Comment 41 Will Woods 2012-05-15 15:35:43 EDT
Assuming the workaround holds up to testing, I'd suggest this bug be dropped from the blocker list and reassigned to an NFS component so we can continue to try to fix nfsv4.
Comment 42 Kamil Páral 2012-05-16 06:38:26 EDT
With TC6 i686 (anaconda-17.27) I have successfully completed PXE installation with repo=nfsiso:. The workaround seems to be working fine.
Comment 43 Kamil Páral 2012-05-16 07:01:23 EDT
repo=nfs: also seems to work well with TC6 i686.
Comment 44 Adam Williamson 2012-05-16 13:01:30 EDT
I re-vote -1 blocker with the workaround in TC6.

Fedora Bugzappers volunteer triage team
Comment 45 Jared Smith 2012-05-16 13:10:07 EDT
I'll re-vote -1 blocker as well, considering the workaround in the TC6 image.
Comment 46 Kamil Páral 2012-05-16 14:41:45 EDT
The workaround seems to work fine, I'm OK with dropping the blocker status.
Comment 47 Adam Williamson 2012-05-16 18:07:49 EDT
OK, that's solid confirmation for dropping blocker status.

Fedora Bugzappers volunteer triage team
Comment 48 Adam Williamson 2013-05-29 15:57:01 EDT
I'm setting this back to ASSIGNED. We're still carrying this workaround as of F19; it kinda obfuscates the behaviour of repo=nfs , it's a PITA to explain in documentation :)

As per comments #35 and #41, anaconda team believes this workaround shouldn't be necessary, and we should be able to fix nfs-utils such that using an NFSv4 mount as the source for the installer image should work, whereupon they can drop the workaround and the behaviour of inst.repo=nfs can be nice and predictable and explicable again. Will provided the wireshark trace that was the last thing Steve requested. So Steve, can you take a look at this? Thanks.
Comment 49 Steve Dickson 2014-12-13 13:47:25 EST
(In reply to Will Woods from comment #39)
> Here's the requested wireshark trace: 
>   http://wwoods.fedorapeople.org/logs/nfs4-hang/anaconda-nfsv4.pcap
> These files are unwieldy and it took me quite a while to get it uploaded
> anywhere, but that should be a pretty complete trace. The machine hangs
> right at the end.
> If that doesn't help, you could probably use the script in comment #31 to
> reproduce the problem locally on any virt-capable machine. 
> Let me know if you need any more info or help reproducing.


You don't have permission to access /logs/nfs4-hang/anaconda-nfsv4.pcap on this

what do I need to do?
Comment 50 Will Woods 2014-12-17 14:42:37 EST
You need to wait for me to fix the permissions on the file, apparently.
Sorry about that! Should work now:

Comment 51 Jan Kurik 2015-07-15 11:10:09 EDT
This bug appears to have been reported against 'rawhide' during the Fedora 23 development cycle.
Changing version to '23'.

(As we did not run this process for some time, it could affect also pre-Fedora 23 development
cycle bugs. We are very sorry. It will help us with cleanup during Fedora 23 End Of Life. Thank you.)

More information and reason for this action is here:
Comment 52 Steve Dickson 2016-08-22 12:31:00 EDT
I'm going to assume this is working now since its 
been over a year with no updates. Please feel free
to reopen it if this not the case
Comment 53 Kamil Páral 2016-08-22 12:41:13 EDT
It depends whether anaconda is still carrying the workaround or not. I don't know.