911668 – BUG: soft lockup - CPU#1 stuck for 23s! with Intel SpeedStep & NFSv4.1 client

Bug 911668 - BUG: soft lockup - CPU#1 stuck for 23s! with Intel SpeedStep & NFSv4.1 client

Summary: BUG: soft lockup - CPU#1 stuck for 23s! with Intel SpeedStep & NFSv4.1 client

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	18
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	nfs-maint
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-02-15 15:25 UTC by Anthony Messina
Modified:	2013-04-08 12:50 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2013-04-08 12:50:51 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
dmesg output (70.14 KB, text/plain) 2013-02-15 15:25 UTC, Anthony Messina	no flags	Details
Diff between commit c5f5e9c5d2 and 512e4b291c0 from upstream kernel (43.80 KB, patch) 2013-03-04 13:35 UTC, Trond Myklebust	no flags	Details \| Diff
View All

Description Anthony Messina 2013-02-15 15:25:31 UTC

Created attachment 697862 [details]
dmesg output

Using any of kernel-3.7.6-201.fc18.x86_64, kernel-3.7.7-201.fc18.x86_64, or
kernel-3.7.8-101.fc17.x86_64, I am having issues with an x86_64 computer with the Intel(R) Pentium(R) D CPU 2.80GHz processor if Intel SpeedStep is enabled in the BIOS.

[   56.152006] BUG: soft lockup - CPU#1 stuck for 23s! [10.77.79.2-mana:916]
[   56.152078] Modules linked in: cts rpcsec_gss_krb5 nfsv4 auth_rpcgss nfs lockd dns_resolver fscache nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6table_mangle ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 iptable_nat nf_nat_ipv4 nf_nat iptable_mangle nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ebtable_filter ebtables ip6table_filter ip6_tables snd_hda_codec_idt snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm sunrpc snd_page_alloc snd_timer snd iTCO_wdt iTCO_vendor_support e1000e soundcore microcode lpc_ich mfd_core i2c_i801 e1000 serio_raw dcdbas nouveau mxm_wmi wmi video i2c_algo_bit drm_kms_helper ttm drm i2c_core
[   56.152078] CPU 1 
[   56.152078] Pid: 916, comm: 10.77.79.2-mana Not tainted 3.7.8-101.fc17.x86_64 #1 Dell Inc.                 Dell DM061                   /0WG864
[   56.152078] RIP: 0010:[<ffffffffa043f56d>]  [<ffffffffa043f56d>] nfs4_run_state_manager+0x32d/0x700 [nfsv4]

I've attached the full dmesg output that I could grab before I wasn unable to do anything else.  In this state, the computer can only be restarted by holding down the power button.

I've disabled Intel SpeedStep and for a few hours, this computer seems to not lock up.

Comment 1 Anthony Messina 2013-02-16 15:57:47 UTC

As an update, the SpeedStep setting doesn't actually do anything--I was just getting lucky.  In order for this computer to boot and mount NFSv4.4 mountpoints successfully, I have to disable the SMP multiple core support in the BIOS.

On the client, I tried with both the following kernels after a completely fresh install of F18:
kernel-3.7.8-202.fc18.x86_64
kernel-3.7.7-201.fc18.x86_64

#/etc/fstab
host:/     /mnt/host nfs rw,minorversion=1,sec=krb5p,x-systemd.automount 0 0
host:/home /home     nfs rw,minorversion=1,sec=krb5p,x-systemd.automount 0 0

On the server:
~]# uname -r
3.7.8-102.fc17.x86_64

~]# cat /proc/fs/nfsd/versions 
-2 -3 +4 +4.1

Comment 2 Anthony Messina 2013-02-16 16:32:16 UTC

The same issue occurs using kernel-3.8.0-0.rc7.git3.1.fc19.x86_64.rpm on the client.

Comment 3 Anthony Messina 2013-02-26 21:09:41 UTC

This still occurs with 3.7.9-205.fc18.x86_64 on an "Intel(R) Pentium(R) D CPU 2.80GHz".

One subjective bit of information to add in trying to figure this out...  If the number of NFSv4 or NFSv4.1 mountpoints in /etc/fstab is greater than the number of processors (two in this case), it fails *every* time with output similar to that below.  Since I have commented out the last entry, the machine seems to work ok.  If I uncomment the last entry, I get the failure again

Furthermore, I see the same issue on another computer using an "Intel(R) Atom(TM) CPU D525   @ 1.80GHz" (with two processors listed in /proc/cpuinfo) and the same results when I have more than two NFSv4 mountpoints defined.  It did not seem to matter whether or not if was NFSv4 or NFSv4.1

### /etc/fstab
ds.example.com:/       /mnt/ds         nfs     rw,minorversion=1,sec=krb5p,x-systemd.automount 0 0
ds.example.com:/home   /home           nfs     rw,minorversion=1,sec=krb5p,x-systemd.automount 0 0
#example.com:/koji     /mnt/koji       nfs     ro,minorversion=1,sec=krb5p,x-systemd-automount 0 0

### The dmesg output:

[   25.816657] SELinux: initialized (dev 0:34, type nfs4), uses genfs_contexts
[   52.134001] BUG: soft lockup - CPU#0 stuck for 22s! [10.77.79.2-mana:786]
[   52.134001] Modules linked in: cts rpcsec_gss_krb5 nfsv4 auth_rpcgss nfs lockd dns_resolver fscache ipt_MASQUERADE nf_conntrack_netbios_ns nf_conntrack_broadcast ip6table_mangle ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 iptable_nat nf_nat_ipv4 nf_nat iptable_mangle nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ebtable_filter ebtables ip6table_filter ip6_tables snd_hda_codec_idt iTCO_wdt snd_hda_intel snd_hda_codec iTCO_vendor_support snd_hwdep lpc_ich snd_seq snd_seq_device snd_pcm mfd_core snd_page_alloc snd_timer snd i2c_i801 e1000e serio_raw soundcore dcdbas microcode sunrpc nouveau mxm_wmi wmi video i2c_algo_bit drm_kms_helper ttm drm i2c_core
[   52.134001] CPU 0 
[   52.134001] Pid: 786, comm: 10.77.79.2-mana Not tainted 3.7.9-205.fc18.x86_64 #1 Dell Inc.                 Dell DM061                   /0WG864
[   52.134001] RIP: 0010:[<ffffffffa03ec2cf>]  [<ffffffffa03ec2cf>] nfs4_run_state_manager+0x8f/0x700 [nfsv4]
[   52.134001] RSP: 0018:ffff8801312e3e78  EFLAGS: 00000246
[   52.134001] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff880134b8fa30
[   52.134001] RDX: 00000000000000d8 RSI: ffff880123d7d1b0 RDI: ffffffffa019d098
[   52.134001] RBP: ffff8801312e3eb8 R08: ffff880123dc9910 R09: 0123d7d110000000
[   52.134001] R10: febe2877e5f44400 R11: 0000000000000001 R12: ffff880123d7d110
[   52.134001] R13: ffff8801312e3df8 R14: ffffffff8108ea03 R15: ffff8801312e3e08
[   52.134001] FS:  0000000000000000(0000) GS:ffff88013bc00000(0000) knlGS:0000000000000000
[   52.134001] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[   52.134001] CR2: 00007fc4798b6000 CR3: 0000000001c0b000 CR4: 00000000000007f0
[   52.134001] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   52.134001] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[   52.134001] Process 10.77.79.2-mana (pid: 786, threadinfo ffff8801312e2000, task ffff880130861720)
[   52.134001] Stack:
[   52.134001]  0000000000000000 ffff880132f1f998 ffff880123d7d000 ffff880132f1f998
[   52.134001]  ffff880123d7d000 ffffffffa03ec240 0000000000000000 0000000000000000
[   52.134001]  ffff8801312e3f48 ffffffff81081e30 0000000100000000 0000000000000000
[   52.134001] Call Trace:
[   52.134001]  [<ffffffffa03ec240>] ? nfs4_do_reclaim+0x520/0x520 [nfsv4]
[   52.134001]  [<ffffffff81081e30>] kthread+0xc0/0xd0
[   52.134001]  [<ffffffff81010000>] ? ftrace_raw_event_xen_mmu_flush_tlb_others+0x50/0xe0
[   52.134001]  [<ffffffff81081d70>] ? kthread_create_on_node+0x120/0x120
[   52.134001]  [<ffffffff8163fdec>] ret_from_fork+0x7c/0xb0
[   52.134001]  [<ffffffff81081d70>] ? kthread_create_on_node+0x120/0x120
[   52.134001] Code: 06 19 c0 85 c0 0f 85 67 01 00 00 f0 0f ba b3 10 01 00 00 0b 19 c0 85 c0 0f 1f 40 00 0f 85 e4 01 00 00 f0 0f ba b3 10 01 00 00 07 <19> c0 85 c0 0f 85 a3 02 00 00 48 8b 83 10 01 00 00 a8 08 0f 85 
[   80.134001] BUG: soft lockup - CPU#0 stuck for 22s! [10.77.79.2-mana:786]
[   80.134001] Modules linked in: cts rpcsec_gss_krb5 nfsv4 auth_rpcgss nfs lockd dns_resolver fscache ipt_MASQUERADE nf_conntrack_netbios_ns nf_conntrack_broadcast ip6table_mangle ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 iptable_nat nf_nat_ipv4 nf_nat iptable_mangle nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ebtable_filter ebtables ip6table_filter ip6_tables snd_hda_codec_idt iTCO_wdt snd_hda_intel snd_hda_codec iTCO_vendor_support snd_hwdep lpc_ich snd_seq snd_seq_device snd_pcm mfd_core snd_page_alloc snd_timer snd i2c_i801 e1000e serio_raw soundcore dcdbas microcode sunrpc nouveau mxm_wmi wmi video i2c_algo_bit drm_kms_helper ttm drm i2c_core
[   80.134001] CPU 0 
[   80.134001] Pid: 786, comm: 10.77.79.2-mana Not tainted 3.7.9-205.fc18.x86_64 #1 Dell Inc.                 Dell DM061                   /0WG864
[   80.134001] RIP: 0010:[<ffffffffa03ec289>]  [<ffffffffa03ec289>] nfs4_run_state_manager+0x49/0x700 [nfsv4]
[   80.134001] RSP: 0018:ffff8801312e3e78  EFLAGS: 00000246
[   80.134001] RAX: 0000000000000101 RBX: 0000000000000000 RCX: ffff880134b8fa30
[   80.134001] RDX: 0000000000000052 RSI: ffff880123d7d1b0 RDI: ffffffffa019d098
[   80.134001] RBP: ffff8801312e3eb8 R08: ffff880123dc9910 R09: 0123d7d110000000
[   80.134001] R10: febe2877e5f44400 R11: 00000000005ba862 R12: ffff880123d7d110
[   80.134001] R13: 0123d7d110000000 R14: ffffffff8108ea03 R15: ffff8801312e3e08
[   80.134001] FS:  0000000000000000(0000) GS:ffff88013bc00000(0000) knlGS:0000000000000000
[   80.134001] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[   80.134001] CR2: 00007fc4798b6000 CR3: 0000000001c0b000 CR4: 00000000000007f0
[   80.134001] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   80.134001] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[   80.134001] Process 10.77.79.2-mana (pid: 786, threadinfo ffff8801312e2000, task ffff880130861720)
[   80.134001] Stack:
[   80.134001]  0000000000000000 ffff880132f1f998 ffff880123d7d000 ffff880132f1f998
[   80.134001]  ffff880123d7d000 ffffffffa03ec240 0000000000000000 0000000000000000
[   80.134001]  ffff8801312e3f48 ffffffff81081e30 0000000100000000 0000000000000000
[   80.134001] Call Trace:
[   80.134001]  [<ffffffffa03ec240>] ? nfs4_do_reclaim+0x520/0x520 [nfsv4]
[   80.134001]  [<ffffffff81081e30>] kthread+0xc0/0xd0
[   80.134001]  [<ffffffff81010000>] ? ftrace_raw_event_xen_mmu_flush_tlb_others+0x50/0xe0
[   80.134001]  [<ffffffff81081d70>] ? kthread_create_on_node+0x120/0x120
[   80.134001]  [<ffffffff8163fdec>] ret_from_fork+0x7c/0xb0
[   80.134001]  [<ffffffff81081d70>] ? kthread_create_on_node+0x120/0x120
[   80.134001] Code: 01 00 00 48 83 ec 18 e8 56 67 c7 e0 48 8b 83 10 01 00 00 f6 c4 04 0f 85 d6 00 00 00 48 8b 83 10 01 00 00 a8 04 0f 85 17 01 00 00 <f0> 0f ba b3 10 01 00 00 01 19 c0 85 c0 0f 85 4c 01 00 00 f0 0f 
[   85.671001] INFO: rcu_sched self-detected stall on CPU { 0}  (t=60001 jiffies)
[   85.671001] sending NMI to all CPUs:
[   85.671001] NMI backtrace for cpu 0
[   85.671001] CPU 0 
[   85.671001] Pid: 786, comm: 10.77.79.2-mana Not tainted 3.7.9-205.fc18.x86_64 #1 Dell Inc.                 Dell DM061                   /0WG864
[   85.671001] RIP: 0010:[<ffffffff812f5900>]  [<ffffffff812f5900>] kasprintf+0x40/0x40
[   85.671001] RSP: 0018:ffff88013bc03df0  EFLAGS: 00000082
[   85.671001] RAX: 0000000000000000 RBX: 0000000000002710 RCX: 0000000000000077
[   85.671001] RDX: 0000000000000c00 RSI: 0000000000000080 RDI: ffffffff81cdb360
[   85.671001] RBP: ffff88013bc03e08 R08: 20676e69646e6573 R09: 000000000000044b
[   85.671001] R10: 61206f7420494d4e R11: 3a73555043206c6c R12: 0000000000000000
[   85.671001] R13: ffff8801312e2000 R14: ffffffff81c3c600 R15: ffffffff81c3c600
[   85.671001] FS:  0000000000000000(0000) GS:ffff88013bc00000(0000) knlGS:0000000000000000
[   85.671001] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[   85.671001] CR2: 00007fc4798b6000 CR3: 0000000001c0b000 CR4: 00000000000007f0
[   85.671001] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   85.671001] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[   85.671001] Process 10.77.79.2-mana (pid: 786, threadinfo ffff8801312e2000, task ffff880130861720)
[   85.671001] Stack:
[   85.671001]  ffffffff81039b00 0000000000000000 ffff88013bc0ecc0 ffff88013bc03e68
[   85.671001]  ffffffff810f5470 ffff880130861720 0000000000000000 0000000000000096
[   85.671001]  0000000000000000 ffff88013bc03e58 ffff880130861720 0000000000000000
[   85.671001] Call Trace:
[   85.671001]  <IRQ> 

[   85.671001]  [<ffffffff81039b00>] ? arch_trigger_all_cpu_backtrace+0x80/0xa0
[   85.671001]  [<ffffffff810f5470>] rcu_check_callbacks+0x2c0/0x650
[   85.671001]  [<ffffffff8106fbf8>] update_process_times+0x48/0x90
[   85.671001]  [<ffffffff810b63ae>] tick_sched_timer+0x6e/0xe0
[   85.671001]  [<ffffffff810862d3>] __run_hrtimer+0x73/0x1d0
[   85.671001]  [<ffffffff810b6340>] ? tick_nohz_handler+0x110/0x110
[   85.671001]  [<ffffffff81086bf7>] hrtimer_interrupt+0xf7/0x230
[   85.671001]  [<ffffffff81641a29>] smp_apic_timer_interrupt+0x69/0x99
[   85.671001]  [<ffffffff8164095d>] apic_timer_interrupt+0x6d/0x80
[   85.671001]  <EOI> 

[   85.671001]  [<ffffffffa019d098>] ? rpc_wake_up+0x68/0x80 [sunrpc]
[   85.671001]  [<ffffffffa03ec2f3>] ? nfs4_run_state_manager+0xb3/0x700 [nfsv4]
[   85.671001]  [<ffffffffa03ec556>] ? nfs4_run_state_manager+0x316/0x700 [nfsv4]
[   85.671001]  [<ffffffffa03ec240>] ? nfs4_do_reclaim+0x520/0x520 [nfsv4]
[   85.671001]  [<ffffffff81081e30>] kthread+0xc0/0xd0
[   85.671001]  [<ffffffff81010000>] ? ftrace_raw_event_xen_mmu_flush_tlb_others+0x50/0xe0
[   85.671001]  [<ffffffff81081d70>] ? kthread_create_on_node+0x120/0x120
[   85.671001]  [<ffffffff8163fdec>] ret_from_fork+0x7c/0xb0
[   85.671001]  [<ffffffff81081d70>] ? kthread_create_on_node+0x120/0x120
[   85.671001] Code: 89 4d e8 4c 89 45 f0 48 89 45 c0 48 8d 45 d0 4c 89 4d f8 c7 45 b8 10 00 00 00 48 89 45 c8 e8 38 ff ff ff c9 c3 66 0f 1f 44 00 00 <8d> 4e 3f 85 f6 55 0f 49 ce 48 89 e5 c1 f9 06 85 c9 7e 61 48 83 
[   85.671274] NMI backtrace for cpu 1
[   85.671281] CPU 1 
[   85.671288] Pid: 0, comm: swapper/1 Not tainted 3.7.9-205.fc18.x86_64 #1 Dell Inc.                 Dell DM061                   /0WG864
[   85.671293] RIP: 0010:[<ffffffff8101caa1>]  [<ffffffff8101caa1>] mwait_idle+0x91/0x1e0
[   85.671306] RSP: 0018:ffff880135b2bee8  EFLAGS: 00000246
[   85.671310] RAX: 0000000000000000 RBX: ffff880135b2bfd8 RCX: 0000000000000000
[   85.671313] RDX: 0000000000000000 RSI: ffff880135b2bfd8 RDI: 0000000000000096
[   85.671316] RBP: ffff880135b2bef8 R08: 0000000000000000 R09: 0000000000000000
[   85.671318] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
[   85.671321] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[   85.671325] FS:  0000000000000000(0000) GS:ffff88013bc40000(0000) knlGS:0000000000000000
[   85.671328] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[   85.671331] CR2: 00007f0fd5698558 CR3: 0000000130cf2000 CR4: 00000000000007e0
[   85.671334] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   85.671338] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[   85.671341] Process swapper/1 (pid: 0, threadinfo ffff880135b2a000, task ffff880135b21720)
[   85.671343] Stack:
[   85.671346]  ffff880135b2bfd8 ffffffff81cdc650 ffff880135b2bf28 ffffffff8101d4fe
[   85.671353]  ffff880135b2bf18 b6d619229820a8b8 0000000000000000 0000000000000000
[   85.671359]  ffff880135b2bf48 ffffffff816265c4 ff7efefefefffefe daf66c6017f47e20
[   85.671365] Call Trace:
[   85.671375]  [<ffffffff8101d4fe>] cpu_idle+0xfe/0x120
[   85.671386]  [<ffffffff816265c4>] start_secondary+0x23e/0x240
[   85.671390] Code: 34 25 f0 c6 00 00 48 89 d1 48 8d 86 38 e0 ff ff 0f 01 c8 0f ae f0 48 8b 86 38 e0 ff ff a8 08 0f 85 36 01 00 00 31 c0 fb 0f 01 c9 <65> 44 8b 24 25 34 b0 00 00 0f 1f 44 00 00 65 44 8b 24 25 34 b0 
[   85.672004] INFO: rcu_sched detected stalls on CPUs/tasks: { 0} (detected by 1, t=60002 jiffies)
[  112.134001] BUG: soft lockup - CPU#0 stuck for 22s! [10.77.79.2-mana:786]
[  112.134001] Modules linked in: cts rpcsec_gss_krb5 nfsv4 auth_rpcgss nfs lockd dns_resolver fscache ipt_MASQUERADE nf_conntrack_netbios_ns nf_conntrack_broadcast ip6table_mangle ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 iptable_nat nf_nat_ipv4 nf_nat iptable_mangle nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ebtable_filter ebtables ip6table_filter ip6_tables snd_hda_codec_idt iTCO_wdt snd_hda_intel snd_hda_codec iTCO_vendor_support snd_hwdep lpc_ich snd_seq snd_seq_device snd_pcm mfd_core snd_page_alloc snd_timer snd i2c_i801 e1000e serio_raw soundcore dcdbas microcode sunrpc nouveau mxm_wmi wmi video i2c_algo_bit drm_kms_helper ttm drm i2c_core
[  112.134001] CPU 0 
[  112.134001] Pid: 786, comm: 10.77.79.2-mana Not tainted 3.7.9-205.fc18.x86_64 #1 Dell Inc.                 Dell DM061                   /0WG864
[  112.134001] RIP: 0010:[<ffffffffa03ec52f>]  [<ffffffffa03ec52f>] nfs4_run_state_manager+0x2ef/0x700 [nfsv4]
[  112.134001] RSP: 0018:ffff8801312e3e78  EFLAGS: 00000246
[  112.134001] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff880134b8fa30
[  112.134001] RDX: 0000000000000081 RSI: ffff880123d7d1b0 RDI: ffffffffa019d098
[  112.134001] RBP: ffff8801312e3eb8 R08: ffff880123dc9910 R09: 0123d7d110000000
[  112.134001] R10: febe2877e5f44400 R11: 0000000000000001 R12: ffff880123d7d110
[  112.134001] R13: 0000000000000001 R14: ffffffff8108ea03 R15: ffff8801312e3e08
[  112.134001] FS:  0000000000000000(0000) GS:ffff88013bc00000(0000) knlGS:0000000000000000
[  112.134001] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  112.134001] CR2: 00007fc4798b6000 CR3: 0000000001c0b000 CR4: 00000000000007f0
[  112.134001] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  112.134001] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  112.134001] Process 10.77.79.2-mana (pid: 786, threadinfo ffff8801312e2000, task ffff880130861720)
[  112.134001] Stack:
[  112.134001]  0000000000000000 ffff880132f1f998 ffff880123d7d000 ffff880132f1f998
[  112.134001]  ffff880123d7d000 ffffffffa03ec240 0000000000000000 0000000000000000
[  112.134001]  ffff8801312e3f48 ffffffff81081e30 0000000100000000 0000000000000000
[  112.134001] Call Trace:
[  112.134001]  [<ffffffffa03ec240>] ? nfs4_do_reclaim+0x520/0x520 [nfsv4]
[  112.134001]  [<ffffffff81081e30>] kthread+0xc0/0xd0
[  112.134001]  [<ffffffff81010000>] ? ftrace_raw_event_xen_mmu_flush_tlb_others+0x50/0xe0
[  112.134001]  [<ffffffff81081d70>] ? kthread_create_on_node+0x120/0x120
[  112.134001]  [<ffffffff8163fdec>] ret_from_fork+0x7c/0xb0
[  112.134001]  [<ffffffff81081d70>] ? kthread_create_on_node+0x120/0x120
[  112.134001] Code: 05 87 58 dd ff 40 0f 84 68 fe ff ff 48 8b 93 a8 00 00 00 48 c7 c6 a0 ae 3f a0 48 c7 c7 10 e0 3f a0 e8 20 12 24 e1 e9 49 fe ff ff <48> 8b bb c0 02 00 00 e8 65 ea ff ff f0 0f ba b3 10 01 00 00 05 
[  113.843883] systemd-readahead[236]: Failed to read event: Value too large for defined data type

Comment 4 Trond Myklebust 2013-03-04 13:33:43 UTC

A number of state recovery deadlocks were identified in the course of the
last 2 weeks, and have been fixed in the 3.9-rc1 kernel.

Do you have the ability to compile your own kernels? If so, it would be
interesting if you could apply the following patch to the latest Linux-3.8
based Fedora kernel, and see if it suffices to fix the problem.
If not, then we will know that this is something that has not yet been
fixed in upstream.

Comment 5 Trond Myklebust 2013-03-04 13:35:20 UTC

Created attachment 704987 [details]
Diff between commit c5f5e9c5d2 and 512e4b291c0 from upstream kernel

Comment 6 Anthony Messina 2013-03-04 13:44:37 UTC

Trond, with *much* effort, I probably could rebuild my own kernel.  However, I will not be able to update that kernel on my production server which services these machines (3.7.9-104.fc17.x86_64).  The clients are all 3.8.1-201.fc18.x86_64 now and if I did rebuild, I would be able to update one of those for testing.

Would that be helpful, or do you think both the server and client would need the updates?

Comment 7 Trond Myklebust 2013-03-04 13:51:02 UTC

Ah... The latest Fedora kernel is based on 3.8.1? OK, that actually contains
3 of the patches from the above diff (more to come in 3.8.3).
Can we therefore maybe start by looking at reproducing the problem with your
3.8.1 based clients?

Comment 8 Anthony Messina 2013-03-04 13:59:18 UTC

(In reply to comment #7)
> Ah... The latest Fedora kernel is based on 3.8.1? OK, that actually contains
> 3 of the patches from the above diff (more to come in 3.8.3).
> Can we therefore maybe start by looking at reproducing the problem with your
> 3.8.1 based clients?

I think that would be possible.  Since 3.8.2 is now stable, and appears to contain some more related patches according to https://www.kernel.org/pub/linux/kernel/v3.x/ChangeLog-3.8.2, how long do think it will take for a Koji build of 3.8.2 for F18 will take?

Maybe later today? *wink*

Comment 9 Trond Myklebust 2013-03-04 14:10:42 UTC

There are no client changes between 3.8.1 and 3.8.2. The next set of client
fixes will only appear in 3.8.3...

Comment 10 Anthony Messina 2013-03-04 14:14:32 UTC

Ok, I'll do all the testing I can today with the following fstab layout.  I'll report back likely this evening.

### /etc/fstab
ds.example.com:/       /mnt/ds         nfs     rw,minorversion=1,sec=krb5p,x-systemd.automount 0 0
ds.example.com:/home   /home           nfs     rw,minorversion=1,sec=krb5p,x-systemd.automount 0 0
example.com:/koji     /mnt/koji       nfs     ro,minorversion=1,sec=krb5p,x-systemd-automount 0 0

The servers are both 3.7.9-104.fc17.x86_64, and the client is 3.8.1-201.fc18.x86_64

Comment 11 Anthony Messina 2013-03-05 02:00:56 UTC

Reporting back some good news...

So far today, I have not been able to reproduce this issue with the servers running 3.7.9-104.fc17.x86_64, and the client running 3.8.1-201.fc18.x86_64 (Intel(R) Pentium(R) D CPU 2.80GHz) and the fstab configuration as in comment #10.

The testing has been through several soft restarts as well as cold reboots.

I'll continue to test things out as I soon upgrade to kernel-3.8.2-201 (http://koji.fedoraproject.org/koji/buildinfo?buildID=399946) and beyond in the next week or so.

Comment 12 Anthony Messina 2013-03-12 09:15:04 UTC

So far, so good.  With clients running 3.8.2-206.fc18.x86_64 and one server running 3.8.2-105.fc17.x86_64, an the other server running 3.7.9-104.fc17.x86_64, I do not see this issue.

According to Trond in Comment #7, server changes will come in kernel-3.8.3, which isn't yet available.  I'd like to wait until then, retest and report back.

Of course, I will be keeping my fingers crossed, as this seems to ensure success during a server reboot ;)

Comment 13 Anthony Messina 2013-03-15 11:03:50 UTC

Unfortunately, I encountered this issue again on the client originally described in the bug report "Intel(R) Pentium(R) D CPU 2.80GHz", running 3.8.2-206.fc18.x86_64.

Comment 14 Anthony Messina 2013-03-16 23:04:23 UTC

The issue continues with the "Intel(R) Pentium(R) D CPU 2.80GHz" client, running 3.8.3-201.fc18.x86_64 where the ds.example.com server is running 3.8.3-101.fc17.x86_64 and the example.com server is running 3.7.9-104.fc17.x86_64.

Again, if I disable the third NFS mount shown in Comment #10, this issue does not occur at all.

Comment 15 Anthony Messina 2013-04-03 11:59:47 UTC

Updates:

I have now upgraded all of my machines to Fedora 18 and I am not able to reproduce this issue at this time.  The "Intel(R) Pentium(R) D CPU 2.80GHz" client is running 3.8.5-201.fc18.x86_64 and has NFSv4.1 mounts to one server running 3.8.5-201.fc18.x86_64, and the other running 3.8.4-202.fc18.x86_64 (not running upgraded kernel due to bug 947539 occurring on this machine).

All servers and clients are now using nfs-utils-1.2.7-6.fc18.  I *think* this problem may have been related in some strange way to bug 909882, because I have not been able to reproduce the issue since nfs-utils-1.2.7-5.fc18 and I do use the crossmnt option in my servers.

If there are no concerns raised with this bug that the devs want to track, I think it can be closed from my perspective.

Thank you for your help.

Note You need to log in before you can comment on or make changes to this bug.