1185519 – INFO: rcu_sched self-detected stall on CPU with nfsd

Bug 1185519 - INFO: rcu_sched self-detected stall on CPU with nfsd

Summary: INFO: rcu_sched self-detected stall on CPU with nfsd

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	21
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Assignee:	nfs-maint
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-01-24 07:26 UTC by Anthony Messina
Modified:	2015-03-18 10:25 UTC (History)
CC List:	8 users (show)
Fixed In Version:	kernel-4.0.0-0.rc3.git0.1.fc22
Clone Of:
Clones:	1303013 (view as bug list)
Environment:
Last Closed:	2015-03-14 09:14:59 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Anthony Messina 2015-01-24 07:26:33 UTC

Getting this kernel lockup on my NFS server, NFSv4.2/KRB5.  After this occurs, I am unable to "systemctl reboot" and have to do a hard shut down.

Using:
kernel-3.18.3-201.fc21
nfs-utils-1.3.1-6.0.fc21
gssproxy-0.3.1-4.fc21.x86_64

kernel: NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [kworker/u16:1:28347]
kernel: Modules linked in: binfmt_misc cts rpcsec_gss_krb5 vhost_net vhost macvtap macvlan tun ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw intel_rapl x86_pkg_temp_thermal raid10 coretemp kvm_intel kvm bnx2 iTCO_wdt iTCO_vendor_support crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel lpc_ich mfd_core wmi shpchp ipmi_si ipmi_msghandler dcdbas i2c_i801 ie31200_edac edac_core nfsd auth_rpcgss nfs_acl lockd grace sunrpc raid1 mgag200 i2c_algo_bit drm_kms_helper
kernel:  ttm drm mpt2sas raid_class scsi_transport_sas
kernel: CPU: 1 PID: 28347 Comm: kworker/u16:1 Tainted: G             L 3.18.3-201.fc21.x86_64 #1
kernel: Hardware name: Dell Inc. PowerEdge R210 II/09T7VV, BIOS 2.2.3 10/25/2012
kernel: Workqueue: nfsd4 laundromat_main [nfsd]
kernel: task: ffff8801351944b0 ti: ffff8800046cc000 task.ti: ffff8800046cc000
kernel: RIP: 0010:[<ffffffff8138f410>]  [<ffffffff8138f410>] _atomic_dec_and_lock+0x0/0x70
kernel: RSP: 0018:ffff8800046cfcd0  EFLAGS: 00000296
kernel: RAX: ffff8800046cfd10 RBX: 0000000000000000 RCX: dead000000200200
kernel: RDX: ffff8800046cfd10 RSI: 0000000000000586 RDI: ffff8800046cfcc0
kernel: RBP: ffff8800046cfcf8 R08: ffff8800046cfd10 R09: 0000000180210001
kernel: R10: ffff88023fc5ac90 R11: ffff8801a3c28d90 R12: ffff88023fc5ac90
kernel: R13: ffff8801a3c28d90 R14: 0000000000000000 R15: ffff8800046cfcc8
kernel: FS:  0000000000000000(0000) GS:ffff88023fc40000(0000) knlGS:0000000000000000
kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 00007f0b59d46410 CR3: 0000000001c11000 CR4: 00000000000427e0
kernel: Stack:
kernel:  ffffffffa018020a 0000000000000246 ffff8800046cfd10 ffff8800aced6800
kernel:  ffff8800046cfcc0 ffff8800046cfd48 ffffffffa0180557 ffff88006d9e4d80
kernel:  ffff8800046cfd10 ffff8800046cfd10 0000000015d50db1 ffff8800aced6800
kernel: Call Trace:
kernel:  [<ffffffffa018020a>] ? nfs4_put_stid+0x2a/0xe0 [nfsd]
kernel:  [<ffffffffa0180557>] __destroy_client+0xe7/0x160 [nfsd]
kernel:  [<ffffffffa01805f6>] expire_client+0x26/0x30 [nfsd]
kernel:  [<ffffffffa018099e>] laundromat_main+0x15e/0x490 [nfsd]
kernel:  [<ffffffff810b240d>] process_one_work+0x14d/0x400
kernel:  [<ffffffff810b2d9b>] worker_thread+0x6b/0x4a0
kernel:  [<ffffffff810b2d30>] ? rescuer_thread+0x2a0/0x2a0
kernel:  [<ffffffff810b7fea>] kthread+0xea/0x100
kernel:  [<ffffffff8109bf28>] ? do_exit+0x848/0xab0
kernel:  [<ffffffff810b7f00>] ? kthread_create_on_node+0x1b0/0x1b0
kernel:  [<ffffffff817646fc>] ret_from_fork+0x7c/0xb0
kernel:  [<ffffffff810b7f00>] ? kthread_create_on_node+0x1b0/0x1b0
kernel: Code: 83 c2 01 48 89 df 48 63 d2 e8 fd 2e 01 00 48 63 35 1a a7 99 00 89 c2 39 c6 77 e0 5b 41 5c 5d c3 90 90 90 90 90 90 90 90 90 90 90 <55> 48 89 e5 41 54 53 8b 0f 83 f9 01 74 3a 8d 51 ff 89 c8 f0 0f


kernel: INFO: rcu_sched self-detected stall on CPU { 1}  (t=420006 jiffies g=4708124 c=4708123 q=0)
kernel: Task dump for CPU 1:
kernel: kworker/u16:1   R  running task        0 28347      2 0x00000888
kernel: Workqueue: nfsd4 laundromat_main [nfsd]
kernel:  ffff8801351944b0 0000000015d50db1 ffff88023fc43d58 ffffffff810c4f58
kernel:  0000000000000001 ffffffff81c58a80 ffff88023fc43d78 ffffffff810c85bd
kernel:  ffff88023fc43d78 0000000000000002 ffff88023fc43da8 ffffffff810f6951
kernel: Call Trace:
kernel:  <IRQ>  [<ffffffff810c4f58>] sched_show_task+0xa8/0x110
kernel:  [<ffffffff810c85bd>] dump_cpu_task+0x3d/0x50
kernel:  [<ffffffff810f6951>] rcu_dump_cpu_stacks+0x91/0xd0
kernel:  [<ffffffff810fa647>] rcu_check_callbacks+0x497/0x710
kernel:  [<ffffffff810ff93b>] update_process_times+0x4b/0x80
kernel:  [<ffffffff8110f455>] tick_sched_handle.isra.19+0x25/0x60
kernel:  [<ffffffff8110f5c5>] tick_sched_timer+0x45/0x80
kernel:  [<ffffffff81100664>] __run_hrtimer+0x74/0x1d0
kernel:  [<ffffffff8110f580>] ? tick_sched_do_timer+0x60/0x60
kernel:  [<ffffffff81100a7f>] hrtimer_interrupt+0x10f/0x260
kernel:  [<ffffffff8104c09e>] local_apic_timer_interrupt+0x3e/0x70
kernel:  [<ffffffff81767553>] smp_apic_timer_interrupt+0x43/0x60
kernel:  [<ffffffff8176563d>] apic_timer_interrupt+0x6d/0x80
kernel:  <EOI>  [<ffffffff81763e32>] ? _raw_spin_unlock_irqrestore+0x12/0x20
kernel:  [<ffffffff810da178>] __wake_up+0x48/0x60
kernel:  [<ffffffffa01802b5>] nfs4_put_stid+0xd5/0xe0 [nfsd]
kernel:  [<ffffffffa0180557>] __destroy_client+0xe7/0x160 [nfsd]
kernel:  [<ffffffffa01805f6>] expire_client+0x26/0x30 [nfsd]
kernel:  [<ffffffffa018099e>] laundromat_main+0x15e/0x490 [nfsd]
kernel:  [<ffffffff810b240d>] process_one_work+0x14d/0x400
kernel:  [<ffffffff810b2d9b>] worker_thread+0x6b/0x4a0
kernel:  [<ffffffff810b2d30>] ? rescuer_thread+0x2a0/0x2a0
kernel:  [<ffffffff810b7fea>] kthread+0xea/0x100
kernel:  [<ffffffff8109bf28>] ? do_exit+0x848/0xab0
kernel:  [<ffffffff810b7f00>] ? kthread_create_on_node+0x1b0/0x1b0
kernel:  [<ffffffff817646fc>] ret_from_fork+0x7c/0xb0
kernel:  [<ffffffff810b7f00>] ? kthread_create_on_node+0x1b0/0x1b0

Comment 1 Anthony Messina 2015-02-21 03:34:16 UTC

The following occurred on the NFS server, after an NFS client had a similar and needed to be hard-rebooted.  Upon client reboot, when attempting to mount the same NFS shares.  Client and Server now running 3.18.7-200.fc21.x86_64.

kernel: NMI watchdog: BUG: soft lockup - CPU#2 stuck for 23s! [nfsd:1708]
kernel: Modules linked in: cts rpcsec_gss_krb5 vhost_net vhost macvtap macvlan tun ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw raid10 intel_rapl x86_pkg_temp_thermal bnx2 coretemp iTCO_wdt iTCO_vendor_support lpc_ich dcdbas kvm_intel mfd_core ie31200_edac edac_core kvm ipmi_si ipmi_msghandler crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel wmi shpchp i2c_i801 nfsd auth_rpcgss nfs_acl lockd grace sunrpc raid1 mgag200 i2c_algo_bit drm_kms_helper ttm mpt2sas
kernel:  drm raid_class scsi_transport_sas
kernel: CPU: 2 PID: 1708 Comm: nfsd Not tainted 3.18.7-200.fc21.x86_64 #1
kernel: Hardware name: Dell Inc. PowerEdge R210 II/09T7VV, BIOS 2.2.3 10/25/2012
kernel: task: ffff8802212b89d0 ti: ffff8800aee90000 task.ti: ffff8800aee90000
kernel: RIP: 0010:[<ffffffff81764a92>]  [<ffffffff81764a92>] _raw_spin_unlock_irqrestore+0x12/0x20
kernel: RSP: 0018:ffff8800aee93bf8  EFLAGS: 00000292
kernel: RAX: ffffffffa0195710 RBX: 00ff880100000001 RCX: 0000000000000000
kernel: RDX: ffffffffa0195728 RSI: 0000000000000292 RDI: 0000000000000292
kernel: RBP: ffff8800aee93bf8 R08: 0000000000000000 R09: 0000000180210019
kernel: R10: ffff88023fc9ac90 R11: ffff8801c9da1c18 R12: 00ffffff00000001
kernel: R13: ffff88011f9542b8 R14: ffffffff811f508d R15: ffff8800aee93c08
kernel: FS:  0000000000000000(0000) GS:ffff88023fc80000(0000) knlGS:0000000000000000
kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 00007f1f06af0fa8 CR3: 0000000001c11000 CR4: 00000000000427e0
kernel: Stack:
kernel:  ffff8800aee93c38 ffffffff810da1c8 0000000000000246 ffff8800aee93c30
kernel:  ffffffffa017d556 ffff8800aee93c80 ffff88007bade078 ffff880231c44600
kernel:  ffff8800aee93c68 ffffffffa017d2b5 ffffffffa017d557 ffff8800aee93c80
kernel: Call Trace:
kernel:  [<ffffffff810da1c8>] __wake_up+0x48/0x60
kernel:  [<ffffffffa017d556>] ? __destroy_client+0xe6/0x160 [nfsd]
kernel:  [<ffffffffa017d2b5>] nfs4_put_stid+0xd5/0xe0 [nfsd]
kernel:  [<ffffffffa017d557>] ? __destroy_client+0xe7/0x160 [nfsd]
kernel:  [<ffffffffa017d557>] __destroy_client+0xe7/0x160 [nfsd]
kernel:  [<ffffffffa017d5f6>] expire_client+0x26/0x30 [nfsd]
kernel:  [<ffffffffa017e908>] nfsd4_create_session+0x778/0x800 [nfsd]
kernel:  [<ffffffffa016f4af>] nfsd4_proc_compound+0x3ef/0x700 [nfsd]
kernel:  [<ffffffffa015beb3>] nfsd_dispatch+0xc3/0x210 [nfsd]
kernel:  [<ffffffffa00f7f32>] ? svc_tcp_adjust_wspace+0x12/0x30 [sunrpc]
kernel:  [<ffffffffa00f6e31>] svc_process_common+0x401/0x680 [sunrpc]
kernel:  [<ffffffffa00f71b7>] svc_process+0x107/0x170 [sunrpc]
kernel:  [<ffffffffa015b807>] nfsd+0x117/0x190 [nfsd]
kernel:  [<ffffffffa015b6f0>] ? nfsd_destroy+0x90/0x90 [nfsd]
kernel:  [<ffffffff810b801a>] kthread+0xea/0x100
kernel:  [<ffffffff810b7f30>] ? kthread_create_on_node+0x1b0/0x1b0
kernel:  [<ffffffff8176537c>] ret_from_fork+0x7c/0xb0
kernel:  [<ffffffff810b7f30>] ? kthread_create_on_node+0x1b0/0x1b0
kernel: Code: 41 83 47 08 01 fb 66 66 90 66 66 90 eb cb 31 c0 eb cf e8 32 45 93 ff 90 90 66 66 66 66 90 55 48 89 e5 66 83 07 01 48 89 f7 57 9d <66> 66 90 66 90 5d c3 0f 1f 80 00 00 00 00 66 66 66 66 90 55 48 
kernel: INFO: rcu_sched self-detected stall on CPU { 2}  (t=60000 jiffies g=4858111 c=4858110 q=0)
kernel: Task dump for CPU 2:
kernel: nfsd            R  running task        0  1708      2 0x00000888
kernel:  ffff8802212b89d0 00000000f8ec8e32 ffff88023fc83d58 ffffffff810c4f88
kernel:  0000000000000002 ffffffff81c58a80 ffff88023fc83d78 ffffffff810c860d
kernel:  ffff88023fc83d78 0000000000000003 ffff88023fc83da8 ffffffff810f6a31
kernel: Call Trace:
kernel:  <IRQ>  [<ffffffff810c4f88>] sched_show_task+0xa8/0x110
kernel:  [<ffffffff810c860d>] dump_cpu_task+0x3d/0x50
kernel:  [<ffffffff810f6a31>] rcu_dump_cpu_stacks+0x91/0xd0
kernel:  [<ffffffff810fa727>] rcu_check_callbacks+0x497/0x710
kernel:  [<ffffffff810ffa4b>] update_process_times+0x4b/0x80
kernel:  [<ffffffff8110f595>] tick_sched_handle.isra.19+0x25/0x60
kernel:  [<ffffffff8110f705>] tick_sched_timer+0x45/0x80
kernel:  [<ffffffff81100774>] __run_hrtimer+0x74/0x1d0
kernel:  [<ffffffff8110f6c0>] ? tick_sched_do_timer+0x60/0x60
kernel:  [<ffffffff81100b8f>] hrtimer_interrupt+0x10f/0x260
kernel:  [<ffffffff8104c0ce>] local_apic_timer_interrupt+0x3e/0x70
kernel:  [<ffffffff817681d3>] smp_apic_timer_interrupt+0x43/0x60
kernel:  [<ffffffff817662bd>] apic_timer_interrupt+0x6d/0x80
kernel:  <EOI>  [<ffffffff81764a92>] ? _raw_spin_unlock_irqrestore+0x12/0x20
kernel:  [<ffffffff810da1c8>] __wake_up+0x48/0x60
kernel:  [<ffffffffa017d1e4>] ? nfs4_put_stid+0x4/0xe0 [nfsd]
kernel:  [<ffffffffa017d2b5>] nfs4_put_stid+0xd5/0xe0 [nfsd]
kernel:  [<ffffffffa017d1e5>] ? nfs4_put_stid+0x5/0xe0 [nfsd]
kernel:  [<ffffffffa017d557>] __destroy_client+0xe7/0x160 [nfsd]
kernel:  [<ffffffffa017d5f6>] expire_client+0x26/0x30 [nfsd]
kernel:  [<ffffffffa017e908>] nfsd4_create_session+0x778/0x800 [nfsd]
kernel:  [<ffffffffa016f4af>] nfsd4_proc_compound+0x3ef/0x700 [nfsd]
kernel:  [<ffffffffa015beb3>] nfsd_dispatch+0xc3/0x210 [nfsd]
kernel:  [<ffffffffa00f7f32>] ? svc_tcp_adjust_wspace+0x12/0x30 [sunrpc]
kernel:  [<ffffffffa00f6e31>] svc_process_common+0x401/0x680 [sunrpc]
kernel:  [<ffffffffa00f71b7>] svc_process+0x107/0x170 [sunrpc]
kernel:  [<ffffffffa015b807>] nfsd+0x117/0x190 [nfsd]
kernel:  [<ffffffffa015b6f0>] ? nfsd_destroy+0x90/0x90 [nfsd]
kernel:  [<ffffffff810b801a>] kthread+0xea/0x100
kernel:  [<ffffffff810b7f30>] ? kthread_create_on_node+0x1b0/0x1b0
kernel:  [<ffffffff8176537c>] ret_from_fork+0x7c/0xb0
kernel:  [<ffffffff810b7f30>] ? kthread_create_on_node+0x1b0/0x1b0

Comment 2 J. Bruce Fields 2015-02-23 16:58:37 UTC

In both the original report and comment #1 the laundromat thread is stuck in __destroy_client.

It's a little uncertain exactly where: in the original report we're in _atomic_dec_and_lock, and the "? nfs4_put_stid" suggests this could be one of the puts of a delegation stateid.

In comment #1 we're probably again in one of those two nfs4_put_stid's, but I don't understand how you hang in _raw_spin_unlock_irqrestore in __wake_up.

Maybe the state is corrupted somehow (e.g. maybe we're trying to free an already-freed delegation).

Looking at the disassembled kernel might help figure out exactly where the hangs are.

It would of course be useful to know if this becomes reliably reproduceable.  Then we could at least bisect, or maybe add some debugging code.

Comment 3 Benjamin Coddington 2015-02-23 17:32:42 UTC

Hey Bruce.. I was able to reproduce it last week by simulating callback channel failures using iptables nf_conntrack timeouts to block the callback.  I'm at cthon today, so I don't have the exact setup on-hand.  Let me see if I can get my local setup to blow up the same way.

The process I used was to tune down nf_conntrack's expiry on the client -- first make sure there's ESTABLISHED,RELATED rules for NFS, then:

echo 10 > /proc/sys/net/netfilter/nf_conntrack_tcp_timeout_established
echo 10 > /proc/sys/net/netfilter/nf_conntrack_tcp_timeout_time_wait

Then take a read delegation, wait for the nf_conntrack tables to time out, then write to the file on the server to force a recall.

I'll try to get that setup today and get a vmcore.

Comment 4 Benjamin Coddington 2015-02-25 16:35:26 UTC

This is the likely fix:

http://marc.info/?l=linux-nfs&m=142487180905534&w=2

Here's the discussion:

http://marc.info/?l=linux-nfs&m=142482299023511&w=2

I'll test this early next week.

Comment 5 Anthony Messina 2015-03-01 22:19:33 UTC

(In reply to Benjamin Coddington from comment #4)
> This is the likely fix:
> 
> http://marc.info/?l=linux-nfs&m=142487180905534&w=2

I'd love to see this patch hit Fedora 21's 3.18.8 kernel release, if possible.

Comment 6 Josh Boyer 2015-03-02 19:21:22 UTC

I've added the referenced patch today.  It isn't in the initial 3.18.8 build, but will be in any subsequent builds we do.

Comment 7 Fedora Update System 2015-03-04 01:45:00 UTC

kernel-4.0.0-0.rc2.git0.1.fc22 has been submitted as an update for Fedora 22.
https://admin.fedoraproject.org/updates/kernel-4.0.0-0.rc2.git0.1.fc22

Comment 8 Fedora Update System 2015-03-04 21:06:46 UTC

Package kernel-4.0.0-0.rc2.git0.1.fc22:
* should fix your issue,
* was pushed to the Fedora 22 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing kernel-4.0.0-0.rc2.git0.1.fc22'
as soon as you are able to, then reboot.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2015-3060/kernel-4.0.0-0.rc2.git0.1.fc22
then log in and leave karma (feedback).

Comment 9 Fedora Update System 2015-03-09 21:13:06 UTC

kernel-3.18.9-200.fc21 has been submitted as an update for Fedora 21.
https://admin.fedoraproject.org/updates/kernel-3.18.9-200.fc21

Comment 10 Fedora Update System 2015-03-09 21:13:46 UTC

kernel-3.18.9-100.fc20 has been submitted as an update for Fedora 20.
https://admin.fedoraproject.org/updates/kernel-3.18.9-100.fc20

Comment 11 Fedora Update System 2015-03-10 13:25:53 UTC

kernel-4.0.0-0.rc3.git0.1.fc22 has been submitted as an update for Fedora 22.
https://admin.fedoraproject.org/updates/kernel-4.0.0-0.rc3.git0.1.fc22

Comment 12 Fedora Update System 2015-03-14 09:14:59 UTC

kernel-3.18.9-100.fc20 has been pushed to the Fedora 20 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 13 Fedora Update System 2015-03-14 09:16:35 UTC

kernel-3.18.9-200.fc21 has been pushed to the Fedora 21 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 14 Fedora Update System 2015-03-18 10:25:05 UTC

kernel-4.0.0-0.rc3.git0.1.fc22 has been pushed to the Fedora 22 stable repository.  If problems still persist, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.