Description of problem: I boot from PXE. I tried any of these boot arguments: * repo=nfsiso:nfs.englab.brq.redhat.com:/pub/fedora/fedora-alt/stage/17-Beta.RC4/Fedora/i386/iso/Fedora-17-Beta-i386-DVD.iso * repo=nfs:nfs.englab.brq.redhat.com:/pub/fedora/linux/development/17/i386/os/ * root=live:nfs:nfs.englab.brq.redhat.com:/pub/fedora/fedora-alt/stage/17-Beta.RC4/Fedora/i386/os/LiveOS/squashfs.img The symptoms are always the same. Somewhere during the installation setup anaconda freezes. Usually it occurs during partition setup (while creating partitions), but it also occurs after providing root password and clicking Next, or clicking Next on the very first screen. The whole screen is frozen, only cursor can be moved around (spinning circle). When I switch to TTYs, I see no traceback or error. I can't execute any command, if I try to run e.g. 'ls', the prompt just hangs and provides no output. After switching back to TTY7 the whole system freezes and cursor nor any keys no longer work. I have this problem confirmed by 4 different people on 4 different machines (two bare-metal and two VMs). All of them booted from our company PXE. If I use repo=http or root=live:http everything works well. If I boot from optical media instead of PXE, everything works well for all protocols. If I use direct kernel boot in VM (not using PXE), everything works well for all protocols (but I have to admit I use different NFS server in this case, because I can't reach the original one in my VMs). The only broken use case seems to be PXE boot and one of the boot arguments listed above. There can be some problem with company's NFS server. But it serves well in all other cases - for F16 installation, for repo=nfs when booted from optical media, etc. I have no idea what might cause this. But it seems as if the root image (squashfs.img) was disconnected during installation setup, if fetched over nfs, and then everything simply hanged. Version-Release number of selected component (if applicable): F17 Beta RC4 How reproducible: 100% reproducible in my case, but some of my colleagues succeeded in 1 in several installation attempts Steps to Reproduce: 1. Boot from PXE 2. use repo=nfs or repo=nfsiso or root=live:nfs 3. finish installation Actual results: Most of installation attempts freezes during setup (not actual installation, just the setup, like providing password or creating partitions). Maybe related to any I/O. Additional info: All members of Anaconda team located in Brno should be able to reproduce it, since they have access to this very PXE server.
Proposing as Beta blocker, even though we still haven't accepted the PXE criteria.
I'll try now to find out whether it's a problem of the single NFS server we use in Brno (maybe overloaded or something) or whether it happens even with other NFS servers.
This will get even more complicated. I tried to boot from company PXE and use my own NFS server for repo=nfsiso option. It also freezes. So this is not a problem of a single overloaded/misconfigured NFS server. However the same (my own) NFS server works when I use direct kernel boot in VM and pass the same repo=nfsiso option. I thought there is no difference between direct kernel boot in VM and PXE boot, once you start the Linux kernel. But it seems to be the only distinction I have found. PXE+nfs broken, direct kernel boot in VM+(the same) nfs works.
I tested older composes. Beta RC1 and earlier don't work because of NetworkManager problem during boot. Beta RC2 has the same issues as Beta RC4, it freezes during installation setup. I also found out why sometimes I was able to perform an installation and sometimes not. It depends on timing. It seems that installer will get stuck after 60 seconds. If you are quick enough to make to the package installation phase, you're good (probably). If you are too slow, it will freeze. So if you want to reproduce it, just wait 60-90 seconds at the first screen and then hit next, it will freeze. Martin Sivak suggested there might be some problem with network during boot, and the connection/process will wait for another one or more minutes until it gives up and quits, which makes the root image unavailable. My current experience matches his suspicion.
Can you get a log out during the time when it's working, so we can maybe see what problem the network is having? I think I'm -1 blocker on this one...it's definitely a bug, but nothing is completely broken; you can do PXE/NFS if you're fast, you can do PXE if you use something other than NFS, you can do NFS if you use something other than PXE...just feels like it's not broken 'enough' to hold beta yet again. But it's a fairly weak -1. -- Fedora Bugzappers volunteer triage team https://fedoraproject.org/wiki/BugZappers
I'm wondering if this is too much of a corner case to be a beta blocker. Installation from NFS works, installation from PXE works but you can't put them together. Since F17 is still pre-release and this could be fixed with an update (once it's pushed to stable), I'm -1 beta blocker, +1 beta NTH for a well tested fix.
(In reply to comment #5) > Can you get a log out during the time when it's working, so we can maybe see > what problem the network is having? Man, that was harder that I would thought. Maybe the timeout is much shorter. I attach the logs I gathered before system got stuck. I also tried to run ping indefinitely on that system and the network is not disconnected. The root image somehow gets disconnected after short timeout, but the network is running fine, ping continues to ping (until I quit it and attempt to do another action, then it freezes). Martin Sivak advised me to install statically-linked busybox into ramdisk and run it, so that I can access the logs even after the root image is disconnected. But I can't get to it today and this is all Greek to me, so I can't really promise I can provide these "after-freeze" logs.
Created attachment 576714 [details] anaconda.log
Created attachment 576716 [details] program.log
Created attachment 576717 [details] storage.log
Created attachment 576718 [details] syslog
Created attachment 576719 [details] X.log
I'm inclined to vote -1 to blocker on this one, as I don't see what Beta criteria this hits.
For the record, I'm -1 on beta blocker.
I am -1 beta blocker from the cornercase angle; I'm leaning towards -1 beta NTH as well, mostly because I don't want to breathe on anything, though as Tim pointed out, well-tested might sway me elsewhere.
/me is -1 blocker also
Rejected as a Beta blocker. -- Fedora Bugzappers volunteer triage team https://fedoraproject.org/wiki/BugZappers
Proposing as Final blocker. We should have relevant criteria by then.
When was the last time this was tested? Does it work in Beta?
Beta RC4 = Beta GOLD. It is broken with Beta.
We now have release criterion for PXE: "It must be possible to install by booting the installation kernel directly, including via PXE, and correctly specifying a remote source for the installer itself, using whichever protocols are required to work for package retrieval at the current phase (Alpha, Beta, Final). This must work if the remote source is not a complete repository but contains only the files necessary for the installer itself to run." https://fedoraproject.org/wiki/Fedora_17_Beta_Release_Criteria
Discussed at 2012-04-20 blocker review meeting - http://meetbot.fedoraproject.org/fedora-bugzappers/2012-04-20/fedora-bugzappers.2012-04-20-17.01.log.txt . Accepted as a blocker per criterion cited in comment #21.
Works fine here. I set up a virt guest and NFS-exported a copy of a F17 tree, then booted with: repo=nfs:192.168.122.1:/srv/pungi/test/17/x86_64/os What happens if you use the IP of the NFS server rather than the hostname?
Ah, now I can reproduce the bug. Yay. When this sort of lockup happens it's usually because NetworkManager (or something else) has messed with the network interface. And since our entire root filesystem (including NetworkManager) is being read from that network interface, if the interface goes away we can't bring it back up. But I swear this was working before. Maybe some of my upstream dracut changes didn't make it into Beta as expected? I'll investigate further.
Okay, this is a tough one. As far as I can tell, anaconda/dracut is doing everything right to hand the device over to NetworkManager. If I run the install via kickstart, it completes successfully. However, if I do the install interactively and wait a while on the first anaconda screen, *usually* something seems to twiddle the interface, because the filesystem stops responding. Maybe something is messing with NetworkManager? Maybe there's some kind of activity timeout? It's very strange that using the same repo and the same ifcfg file will work for a kickstart install but hang (eventually) otherwise.
We don't expect NM to bounce the network at all here, but it sounds like it might be. We'd need to get some logs from /var for this. I don't think NM output goes to syslog here anymore, I think there's an NM specific logfile in the install environment that we want. If we can get that *and* the syslog logfile (since that's where stuff like DHCP info is dumped) then we can get a clue as to what's going on. Will asked about using gdb to trap where NM might be taking down the device, so we can stop it there and grab the logfile before the whole machine freezes. The function you want for that is nm_device_deactivate(). We don't expect that function to be called at all on the active device that NM gets handed from Dracut. To get a lot more info out of NM, you can set the "level=debug" option in the [logging] section of /etc/NetworkManager/NetworkManager.conf before NM starts, or we could use dbus-send to bump the log level if the anaconda environment includes dbus-send.
This was discussed at the blocker review meeting today: http://meetbot.fedoraproject.org/fedora-bugzappers/2012-05-04/f17-final-blocker-review-4.2012-05-04-17.00.log.html . The meeting conclusion was "There has been some movement on this bug but no fix has been proposed yet. There is a request for logs but the request is a little vague - ask for clarification on exactly which logs are being requested". I happen to know the answer to that at least as far as comment #26 goes: NM logs to syslog, so when you set level=debug in /etc/NetworkManager/NetworkManager.conf [logging] section, you just get a lot more output in /var/log/messages , and that's what would be useful. -- Fedora Bugzappers volunteer triage team https://fedoraproject.org/wiki/BugZappers
This almost certainly isn't a NM problem. I've been reproducing the problem using a virt guest, and here's what I've found: 1) With NM running with level=debug, there are *no* syslog messages that are anywhere *near* the eventual hang. Not NM, not kernel, nothing. 2) The interface never actually goes down. The system responds to pings at 0.2 second intervals for the entire duration of the test - before and after the root filesystem hangs - and an existing ssh connection continues to respond. 3) The install completes successfully* if automated via kickstart - or if you work quickly by hand. It seems to be some kind of timeout / autosuspend thing. * Except that it fails to reboot at the end - see bug 818581.
<steved> adamw: also it would be good to get a system trace (echo t > /proc/sysrq-trigger) <steved> adamw: what NFS protocol versions is being used? If its v4 you might want force v3 just to see if the problem goes away... -- Fedora Bugzappers volunteer triage team https://fedoraproject.org/wiki/BugZappers
Also getting a system trace (echo t > /proc/sysrq-trigger) of the system when its hug would be good as well....
We're not specifying v3, so it's defaulting to v4. Here's a (rough guide to setting up a) reproducer, assuming you have a DVD image and libvirt: virtip=192.168.122.1 # ip of virbr0 nfsdir=/srv/nfs/test # or wherever dvd=/path/to/Fedora-17-*-DVD.iso # make NFS export mkdir $nfsdir cd $nfsdir cp $dvd . exportfs -o all_squash $virtip/24:$nfsdir # grab kernel/initrd mkdir tmp mount -o ro,loop *.iso tmp cp tmp/images/pxeboot/* . umount tmp # create/run virt guest virt-install --name nfstest --ram 768 --disk disk.img,size=10 --network network=default,model=virtio --boot "kernel=$nfsdir/vmlinuz,initrd=$nfsdir/initrd.img,kernel_args=repo=nfs:$virtip:$nfsdir ip=dhcp" [Swap to the shell on tty2; 'ls' should work. Wait a few minutes and try 'ps aux' and e.g. 'ls /lib' - these should hang.] # clean up virt guest: virsh destroy nfstest; virsh undefine nfstest
If this problem ends up being too hard to resolve for F17 Final, I think there should be a simple workaround - just copy the LiveOS/squashfs.img the same way you do it with http, instead of mounting it remotely.
When everything hangs is it possible to get a system trace (echo t > /proc/sysrq-trigger) of what is running and where they hung? Also, is the NFS mount a loopback mount or is it going over a network to a remote server.... If its the latter would it be possible to get an network traces of any traffic?
The NFS mount goes over the virtual network to the virt host, as described in comment #31. Once the system hangs, I stop getting syslog output over my ssh pipe (although the connection doesn't seem to break?). Actually, once the system hangs, nothing more seems to be written to syslog - even though it's in /tmp? I had it run with "while :; do dmesg | tail; sleep 1; done" and there are no new kernel messages in the minutes leading to the hang. I ran with "while :; do logger "uptime: $(</proc/uptime)"; sleep 1; done"; the hang happens *reliably* about 180 seconds (+/-5) after switch_root / syslog start. Since we can't get data out of syslog after that, I can't get the output from the system trace. I'll try forcing nfsv3 in a bit. After that I'm basically out of ideas.
Okay, it's not always 180 seconds - sometimes it's 120 seconds? On the other hand: adding "nfsvers=3" to the nfs options seems to fix the problem! i.e. boot with a repo=nfs... arg like so: repo=nfs:nfsvers=3:<SERVER>:<PATH> So, any ideas what nfs4-specific things might have a timeout after a couple of minutes? We *can* use nfs3 by default, of course, but I'd rather *fix* the problem, not work around it... nevertheless, I'll work on a patch to use nfs3 by default instead of nfs4, just in case.
Just a thought... I've seen issues where nfs mounts go awry after the clock was jumped, and I've seen the clock jump during an install at some point after boot.
Okay, so the patch to do nfs3 by default is pretty easy. Doing the fallback to nfs2 if nfs3 fails is harder. How much do we need to care about nfs2 in This Modern Age?
(In reply to comment #35) > Okay, it's not always 180 seconds - sometimes it's 120 seconds? > > On the other hand: adding "nfsvers=3" to the nfs options seems to fix the > problem! i.e. boot with a repo=nfs... arg like so: > > repo=nfs:nfsvers=3:<SERVER>:<PATH> > > So, any ideas what nfs4-specific things might have a timeout after a couple of > minutes? Without seeing where the processes are hung or seeing any traffic its kinda tough to tell what's happening... Would it be possible to run wireshark trace on the server side?
Here's the requested wireshark trace: http://wwoods.fedorapeople.org/logs/nfs4-hang/anaconda-nfsv4.pcap These files are unwieldy and it took me quite a while to get it uploaded anywhere, but that should be a pretty complete trace. The machine hangs right at the end. If that doesn't help, you could probably use the script in comment #31 to reproduce the problem locally on any virt-capable machine. Let me know if you need any more info or help reproducing.
A workaround for this is in anaconda-17.27-1 (uses nfs3 instead of 4)
Assuming the workaround holds up to testing, I'd suggest this bug be dropped from the blocker list and reassigned to an NFS component so we can continue to try to fix nfsv4.
With TC6 i686 (anaconda-17.27) I have successfully completed PXE installation with repo=nfsiso:. The workaround seems to be working fine.
repo=nfs: also seems to work well with TC6 i686.
I re-vote -1 blocker with the workaround in TC6. -- Fedora Bugzappers volunteer triage team https://fedoraproject.org/wiki/BugZappers
I'll re-vote -1 blocker as well, considering the workaround in the TC6 image.
The workaround seems to work fine, I'm OK with dropping the blocker status.
OK, that's solid confirmation for dropping blocker status. -- Fedora Bugzappers volunteer triage team https://fedoraproject.org/wiki/BugZappers
I'm setting this back to ASSIGNED. We're still carrying this workaround as of F19; it kinda obfuscates the behaviour of repo=nfs , it's a PITA to explain in documentation :) As per comments #35 and #41, anaconda team believes this workaround shouldn't be necessary, and we should be able to fix nfs-utils such that using an NFSv4 mount as the source for the installer image should work, whereupon they can drop the workaround and the behaviour of inst.repo=nfs can be nice and predictable and explicable again. Will provided the wireshark trace that was the last thing Steve requested. So Steve, can you take a look at this? Thanks.
(In reply to Will Woods from comment #39) > Here's the requested wireshark trace: > > http://wwoods.fedorapeople.org/logs/nfs4-hang/anaconda-nfsv4.pcap > > These files are unwieldy and it took me quite a while to get it uploaded > anywhere, but that should be a pretty complete trace. The machine hangs > right at the end. > > If that doesn't help, you could probably use the script in comment #31 to > reproduce the problem locally on any virt-capable machine. > > Let me know if you need any more info or help reproducing. Forbidden You don't have permission to access /logs/nfs4-hang/anaconda-nfsv4.pcap on this server. what do I need to do?
You need to wait for me to fix the permissions on the file, apparently. Sorry about that! Should work now: http://wwoods.fedorapeople.org/logs/nfs4-hang/anaconda-nfsv4.pcap http://wwoods.fedorapeople.org/logs/nfs4-hang/anaconda-nfsv4-2.pcap
This bug appears to have been reported against 'rawhide' during the Fedora 23 development cycle. Changing version to '23'. (As we did not run this process for some time, it could affect also pre-Fedora 23 development cycle bugs. We are very sorry. It will help us with cleanup during Fedora 23 End Of Life. Thank you.) More information and reason for this action is here: https://fedoraproject.org/wiki/BugZappers/HouseKeeping/Fedora23
I'm going to assume this is working now since its been over a year with no updates. Please feel free to reopen it if this not the case
It depends whether anaconda is still carrying the workaround or not. I don't know.