Description of problem: A diskless client is running fine for 6 hours, but then suddenly freezes completely. Version-Release number of selected component (if applicable): kernel-4.1.5-100.fc21.x86_64 How reproducible: Every time Steps to Reproduce: 1. Boot 2. Wait 6 hours Actual results: Freeze Expected results: No freeze Additional info: "Pluto" is a diskless HTPC which runs MythTV, both the backend and a frontend. Recently we upgraded it from Fedora 19 (yes, an upgrade was overdue) to Fedora 21. After we did, it works fine for 6 hours after boot, when it freezes. I've tried to run "top" in batch mode over SSH to see what is happening. I can't see any strange processes. But notable is that the penultimate run says the machine has been up 5:59, and the last one says 6:00. These numbers have repeated in several attempts, so within a few seconds the machine works for 6 hours. I have enabled "netconsole" on the host, forwarding to a remote syslog. A few minutes after the freeze there starts coming messages "nfs: server mimmi not responding, still trying", later followed by the "timed out" variant of the message. I have collected the traffic using "tcpdump". Up until the time of the freeze, there is nothing out of the ordinary as far as I can tell. But then suddenly, no more packages comes from the host. There are no NFS packets sent for the server to reply to, so it does not seem to be on the NFS server side. Also, there are no replies to ping packets, nor to ARPs later on. If it wasn't for the syslog packets from netconsole coming through, I would have thought the network configuration on the host broke at the time. Stopping the MythTV processes didn't make any difference. The host froze just the same. Does anyone have any idea what might be going on? I attach logs from one test run, where I was running tcpdump and ping from the server, and via ssh top and ping from the host. All these ware started around 5-6 minutes before the crash. Client (HTPC) that freezes: pluto/172.17.0.5 Server for NFS (and other services): mimmi/172.17.0.1 Boot parameters used at this particular run (I've been trying different options, without any obvious effect): initrd=initramfs-4.1.5-100.fc21.x86_64.img root=nfs4:mimmi:/remote/pluto rd.nfs.domain=uddeborg rw acpi_enforce_resources=lax LANG=sv_SE.UTF-8 SYSFONT=latarcyrheb-sun16 KEYTABLE=sv-latin1 loglevel=8
Created attachment 1066043 [details] Data saved by 30 minutes run of: tcpdump -i em1 -w "#pluto-tcpdump-$TS" host pluto
Created attachment 1066044 [details] Output from: ssh -n pluto top -b -d 15
Created attachment 1066045 [details] Output from ping from server to HTPC: ping -i 15 -c 120 -D pluto
Created attachment 1066046 [details] Output from ping from HTPC to server: ssh -n pluto ping -i 15 -D mimmi
I tried the last kernel we used from F19, 3.14.27-100.fc19.x86_64. With that, the machine continues to run just fine for longer than 6 hours. Userspace is otherwise unchanged, so we are right now running F21 userspace on an F19 kernel. (I haven't noticed any obvious issues with the combination.) As far as I can tell, whatever is causing this problem was introduced somewhere between 3.14.27-100.fc19 and 4.1.5-100.fc21. Would some kind of bisection be feasible? I imagine it would be a bit complicated as both different kernel versions and different Fedora packagings with various patch sets would be involved.
I've been working on bisection scripts for the kernel but 3.14 is too old to use them unfortunately. Can you narrow down which package broke it using koji builds? We may be able to bisect if you can get a tighter window as well. This may be related to https://bugzilla.redhat.com/show_bug.cgi?id=1232528
I hadn't realised Koji builds were kept that long. Since they are, I'll certainly try to narrow it down. Meanwhile I've tried the 3.17.4-301.fc21 kernel. It's the original F21 kernel, and is thus still available from the "fedora" channel. It freezes after 6 hours; "BAD" in the bisection. So it breaks somewhere between 3.14.27-100.fc19 and 3.17.4-301.fc21. To narrow this down further, is it the kernel version number I should use for ordering the releases when bisecting? In calendar time, 3.14.27-100.fc19 was built after 3.17.4-301.fc21. But it's the "3.14.27" and "3.17.4" parts that are most important, right? Since the oldest F21 kernel is bad, I will have to use F20 kernels to get older versions. It isn't really that much difference between the builds for different branches right? That testing is still useful?
After having tried a number of kernels (with a turnaround of 6 hours for each test) I realized the problem doesn't actually lay in the kernel proper. When I reinstalled 3.14.27-100.fc19, the new installation too hangs. The difference is, of course, that the new installation's initramfs image was created using the F21 tools, while the original was done on F19. Now I'll read up on how initramfs images are made. Then I'll do a kind of bisection to find what difference between the to images is the important one. And see where this bugzilla really belongs. (Maybe back at my setup.)
It's not a bug, it is by design. The newer dhclient-script sets valid_lft for the interface when it is brought up. Sorry!
In case anyone comes by later: bug 1121258 is about having these parameters documented.
(In reply to Göran Uddeborg from comment #9) > It's not a bug, it is by design. The newer dhclient-script sets valid_lft > for the interface when it is brought up. Sorry! Still seems like a bug to have a system freeze when it renews a dhcp lease.
Maybe I should clarify for George, and any others who may read this report after the fact. Dhclient DOES update the lifetime parameters when it renews the DHCP lease. But I wasn't running dhclient on this box. I had given it a statically decided IP address in dhcpd.conf. During boot it would set this address, and then never change it. That worked fine as it "owned" the address, it only needed DHCP to find out in the beginning. No need to spend cycles on dhclient running. That is, until the new initramfs image set the lifetime parameters of the interface. Then the address stopped working after it expired. I fixed the problem by changing the timeout parameters for this and a few other statically assigned IPs in my dhcpd.conf. (I'm thus still not running dhclient on the box.)