Bug 1256042 - Diskless HTPC freezes after exactly 6 hours
Summary: Diskless HTPC freezes after exactly 6 hours
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 21
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-08-23 15:10 UTC by Göran Uddeborg
Modified: 2016-12-18 16:53 UTC (History)
9 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2015-09-22 09:18:27 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
Data saved by 30 minutes run of: tcpdump -i em1 -w "#pluto-tcpdump-$TS" host pluto (399.04 KB, application/x-xz)
2015-08-23 15:14 UTC, Göran Uddeborg
no flags Details
Output from: ssh -n pluto top -b -d 15 (3.35 KB, application/x-xz)
2015-08-23 15:15 UTC, Göran Uddeborg
no flags Details
Output from ping from server to HTPC: ping -i 15 -c 120 -D pluto (1.09 KB, application/x-xz)
2015-08-23 15:16 UTC, Göran Uddeborg
no flags Details
Output from ping from HTPC to server: ssh -n pluto ping -i 15 -D mimmi (364 bytes, application/x-xz)
2015-08-23 15:18 UTC, Göran Uddeborg
no flags Details

Description Göran Uddeborg 2015-08-23 15:10:30 UTC
Description of problem:
A diskless client is running fine for 6 hours, but then suddenly freezes completely.

Version-Release number of selected component (if applicable):
kernel-4.1.5-100.fc21.x86_64


How reproducible:
Every time


Steps to Reproduce:
1. Boot
2. Wait 6 hours


Actual results:
Freeze


Expected results:
No freeze


Additional info:
"Pluto" is a diskless HTPC which runs MythTV, both the backend and a frontend.  Recently we upgraded it from Fedora 19 (yes, an upgrade was overdue) to Fedora 21.  After we did, it works fine for 6 hours after boot, when it freezes.

I've tried to run "top" in batch mode over SSH to see what is happening.  I can't see any strange processes.  But notable is that the penultimate run says the machine has been up 5:59, and the last one says 6:00.  These numbers have repeated in several attempts, so within a few seconds the machine works for 6 hours.

I have enabled "netconsole" on the host, forwarding to a remote syslog.  A few minutes after the freeze there starts coming messages "nfs: server mimmi not responding, still trying", later followed by the "timed out" variant of the message.

I have collected the traffic using "tcpdump".  Up until the time of the freeze, there is nothing out of the ordinary as far as I can tell.  But then suddenly, no more packages comes from the host.  There are no NFS packets sent for the server to reply to, so it does not seem to be on the NFS server side.  Also, there are no replies to ping packets, nor to ARPs later on.

If it wasn't for the syslog packets from netconsole coming through, I would have thought the network configuration on the host broke at the time.

Stopping the MythTV processes didn't make any difference.  The host froze just the same.

Does anyone have any idea what might be going on?

I attach logs from one test run, where I was running tcpdump and ping from the server, and via ssh top and ping from the host.  All these ware started around 5-6 minutes before the crash.

Client (HTPC) that freezes: pluto/172.17.0.5
Server for NFS (and other services): mimmi/172.17.0.1

Boot parameters used at this particular run (I've been trying different options, without any obvious effect):
initrd=initramfs-4.1.5-100.fc21.x86_64.img root=nfs4:mimmi:/remote/pluto rd.nfs.domain=uddeborg rw acpi_enforce_resources=lax LANG=sv_SE.UTF-8 SYSFONT=latarcyrheb-sun16 KEYTABLE=sv-latin1 loglevel=8

Comment 1 Göran Uddeborg 2015-08-23 15:14:14 UTC
Created attachment 1066043 [details]
Data saved by 30 minutes run of:  tcpdump -i em1 -w "#pluto-tcpdump-$TS" host pluto

Comment 2 Göran Uddeborg 2015-08-23 15:15:49 UTC
Created attachment 1066044 [details]
Output from: ssh -n pluto top -b -d 15

Comment 3 Göran Uddeborg 2015-08-23 15:16:44 UTC
Created attachment 1066045 [details]
Output from ping from server to HTPC: ping -i 15 -c 120 -D pluto

Comment 4 Göran Uddeborg 2015-08-23 15:18:02 UTC
Created attachment 1066046 [details]
Output from ping from HTPC to server: ssh -n pluto ping -i 15 -D mimmi

Comment 5 Göran Uddeborg 2015-09-14 13:31:13 UTC
I tried the last kernel we used from F19, 3.14.27-100.fc19.x86_64.  With that, the machine continues to run just fine for longer than 6 hours.  Userspace is otherwise unchanged, so we are right now running F21 userspace on an F19 kernel.  (I haven't noticed any obvious issues with the combination.)

As far as I can tell, whatever is causing this problem was introduced somewhere between 3.14.27-100.fc19 and 4.1.5-100.fc21.

Would some kind of bisection be feasible?  I imagine it would be a bit complicated as both different kernel versions and different Fedora packagings with various patch sets would be involved.

Comment 6 Laura Abbott 2015-09-14 22:42:25 UTC
I've been working on bisection scripts for the kernel but 3.14 is too old to use them unfortunately. Can you narrow down which package broke it using koji builds? We may be able to bisect if you can get a tighter window as well. This may be related to https://bugzilla.redhat.com/show_bug.cgi?id=1232528

Comment 7 Göran Uddeborg 2015-09-15 19:36:01 UTC
I hadn't realised Koji builds were kept that long.  Since they are, I'll certainly try to narrow it down.

Meanwhile I've tried the 3.17.4-301.fc21 kernel.  It's the original F21 kernel, and is thus still available from the "fedora" channel.  It freezes after 6 hours; "BAD" in the bisection.  So it breaks somewhere between 3.14.27-100.fc19 and 3.17.4-301.fc21.

To narrow this down further, is it the kernel version number I should use for ordering the releases when bisecting?  In calendar time, 3.14.27-100.fc19 was built after 3.17.4-301.fc21.   But it's the "3.14.27" and "3.17.4" parts that are most important, right?

Since the oldest F21 kernel is bad, I will have to use F20 kernels to get older versions.  It isn't really that much difference between the builds for different branches right?  That testing is still useful?

Comment 8 Göran Uddeborg 2015-09-18 18:19:38 UTC
After having tried a number of kernels (with a turnaround of 6 hours for each test) I realized the problem doesn't actually lay in the kernel proper.  When I reinstalled 3.14.27-100.fc19, the new installation too hangs.  The difference is, of course, that the new installation's initramfs image was created using the F21 tools, while the original was done on F19.

Now I'll read up on how initramfs images are made.  Then I'll do a kind of bisection to find what difference between the to images is the important one.  And see where this bugzilla really belongs.  (Maybe back at my setup.)

Comment 9 Göran Uddeborg 2015-09-22 09:18:27 UTC
It's not a bug, it is by design.  The newer dhclient-script sets valid_lft for the interface when it is brought up.  Sorry!

Comment 10 Göran Uddeborg 2015-09-22 09:32:37 UTC
In case anyone comes by later: bug 1121258 is about having these parameters documented.

Comment 11 George Toft 2016-12-18 15:48:52 UTC
(In reply to Göran Uddeborg from comment #9)
> It's not a bug, it is by design.  The newer dhclient-script sets valid_lft
> for the interface when it is brought up.  Sorry!

Still seems like a bug to have a system freeze when it renews a dhcp lease.

Comment 12 Göran Uddeborg 2016-12-18 16:53:23 UTC
Maybe I should clarify for George, and any others who may read this report after the fact.

Dhclient DOES update the lifetime parameters when it renews the DHCP lease.  But I wasn't running dhclient on this box.  I had given it a statically decided IP address in dhcpd.conf.  During boot it would set this address, and then never change it.  That worked fine as it "owned" the address, it only needed DHCP to find out in the beginning.  No need to spend cycles on dhclient running.

That is, until the new initramfs image set the lifetime parameters of the interface.  Then the address stopped working after it expired.  I fixed the problem by changing the timeout parameters for this and a few other statically assigned IPs in my dhcpd.conf.  (I'm thus still not running dhclient on the box.)


Note You need to log in before you can comment on or make changes to this bug.