Getting this kernel lockup on my NFS server, NFSv4.2/KRB5. After this occurs, I am unable to "systemctl reboot" and have to do a hard shut down. Using: kernel-3.18.3-201.fc21 nfs-utils-1.3.1-6.0.fc21 gssproxy-0.3.1-4.fc21.x86_64 kernel: NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [kworker/u16:1:28347] kernel: Modules linked in: binfmt_misc cts rpcsec_gss_krb5 vhost_net vhost macvtap macvlan tun ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw intel_rapl x86_pkg_temp_thermal raid10 coretemp kvm_intel kvm bnx2 iTCO_wdt iTCO_vendor_support crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel lpc_ich mfd_core wmi shpchp ipmi_si ipmi_msghandler dcdbas i2c_i801 ie31200_edac edac_core nfsd auth_rpcgss nfs_acl lockd grace sunrpc raid1 mgag200 i2c_algo_bit drm_kms_helper kernel: ttm drm mpt2sas raid_class scsi_transport_sas kernel: CPU: 1 PID: 28347 Comm: kworker/u16:1 Tainted: G L 3.18.3-201.fc21.x86_64 #1 kernel: Hardware name: Dell Inc. PowerEdge R210 II/09T7VV, BIOS 2.2.3 10/25/2012 kernel: Workqueue: nfsd4 laundromat_main [nfsd] kernel: task: ffff8801351944b0 ti: ffff8800046cc000 task.ti: ffff8800046cc000 kernel: RIP: 0010:[<ffffffff8138f410>] [<ffffffff8138f410>] _atomic_dec_and_lock+0x0/0x70 kernel: RSP: 0018:ffff8800046cfcd0 EFLAGS: 00000296 kernel: RAX: ffff8800046cfd10 RBX: 0000000000000000 RCX: dead000000200200 kernel: RDX: ffff8800046cfd10 RSI: 0000000000000586 RDI: ffff8800046cfcc0 kernel: RBP: ffff8800046cfcf8 R08: ffff8800046cfd10 R09: 0000000180210001 kernel: R10: ffff88023fc5ac90 R11: ffff8801a3c28d90 R12: ffff88023fc5ac90 kernel: R13: ffff8801a3c28d90 R14: 0000000000000000 R15: ffff8800046cfcc8 kernel: FS: 0000000000000000(0000) GS:ffff88023fc40000(0000) knlGS:0000000000000000 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 kernel: CR2: 00007f0b59d46410 CR3: 0000000001c11000 CR4: 00000000000427e0 kernel: Stack: kernel: ffffffffa018020a 0000000000000246 ffff8800046cfd10 ffff8800aced6800 kernel: ffff8800046cfcc0 ffff8800046cfd48 ffffffffa0180557 ffff88006d9e4d80 kernel: ffff8800046cfd10 ffff8800046cfd10 0000000015d50db1 ffff8800aced6800 kernel: Call Trace: kernel: [<ffffffffa018020a>] ? nfs4_put_stid+0x2a/0xe0 [nfsd] kernel: [<ffffffffa0180557>] __destroy_client+0xe7/0x160 [nfsd] kernel: [<ffffffffa01805f6>] expire_client+0x26/0x30 [nfsd] kernel: [<ffffffffa018099e>] laundromat_main+0x15e/0x490 [nfsd] kernel: [<ffffffff810b240d>] process_one_work+0x14d/0x400 kernel: [<ffffffff810b2d9b>] worker_thread+0x6b/0x4a0 kernel: [<ffffffff810b2d30>] ? rescuer_thread+0x2a0/0x2a0 kernel: [<ffffffff810b7fea>] kthread+0xea/0x100 kernel: [<ffffffff8109bf28>] ? do_exit+0x848/0xab0 kernel: [<ffffffff810b7f00>] ? kthread_create_on_node+0x1b0/0x1b0 kernel: [<ffffffff817646fc>] ret_from_fork+0x7c/0xb0 kernel: [<ffffffff810b7f00>] ? kthread_create_on_node+0x1b0/0x1b0 kernel: Code: 83 c2 01 48 89 df 48 63 d2 e8 fd 2e 01 00 48 63 35 1a a7 99 00 89 c2 39 c6 77 e0 5b 41 5c 5d c3 90 90 90 90 90 90 90 90 90 90 90 <55> 48 89 e5 41 54 53 8b 0f 83 f9 01 74 3a 8d 51 ff 89 c8 f0 0f kernel: INFO: rcu_sched self-detected stall on CPU { 1} (t=420006 jiffies g=4708124 c=4708123 q=0) kernel: Task dump for CPU 1: kernel: kworker/u16:1 R running task 0 28347 2 0x00000888 kernel: Workqueue: nfsd4 laundromat_main [nfsd] kernel: ffff8801351944b0 0000000015d50db1 ffff88023fc43d58 ffffffff810c4f58 kernel: 0000000000000001 ffffffff81c58a80 ffff88023fc43d78 ffffffff810c85bd kernel: ffff88023fc43d78 0000000000000002 ffff88023fc43da8 ffffffff810f6951 kernel: Call Trace: kernel: <IRQ> [<ffffffff810c4f58>] sched_show_task+0xa8/0x110 kernel: [<ffffffff810c85bd>] dump_cpu_task+0x3d/0x50 kernel: [<ffffffff810f6951>] rcu_dump_cpu_stacks+0x91/0xd0 kernel: [<ffffffff810fa647>] rcu_check_callbacks+0x497/0x710 kernel: [<ffffffff810ff93b>] update_process_times+0x4b/0x80 kernel: [<ffffffff8110f455>] tick_sched_handle.isra.19+0x25/0x60 kernel: [<ffffffff8110f5c5>] tick_sched_timer+0x45/0x80 kernel: [<ffffffff81100664>] __run_hrtimer+0x74/0x1d0 kernel: [<ffffffff8110f580>] ? tick_sched_do_timer+0x60/0x60 kernel: [<ffffffff81100a7f>] hrtimer_interrupt+0x10f/0x260 kernel: [<ffffffff8104c09e>] local_apic_timer_interrupt+0x3e/0x70 kernel: [<ffffffff81767553>] smp_apic_timer_interrupt+0x43/0x60 kernel: [<ffffffff8176563d>] apic_timer_interrupt+0x6d/0x80 kernel: <EOI> [<ffffffff81763e32>] ? _raw_spin_unlock_irqrestore+0x12/0x20 kernel: [<ffffffff810da178>] __wake_up+0x48/0x60 kernel: [<ffffffffa01802b5>] nfs4_put_stid+0xd5/0xe0 [nfsd] kernel: [<ffffffffa0180557>] __destroy_client+0xe7/0x160 [nfsd] kernel: [<ffffffffa01805f6>] expire_client+0x26/0x30 [nfsd] kernel: [<ffffffffa018099e>] laundromat_main+0x15e/0x490 [nfsd] kernel: [<ffffffff810b240d>] process_one_work+0x14d/0x400 kernel: [<ffffffff810b2d9b>] worker_thread+0x6b/0x4a0 kernel: [<ffffffff810b2d30>] ? rescuer_thread+0x2a0/0x2a0 kernel: [<ffffffff810b7fea>] kthread+0xea/0x100 kernel: [<ffffffff8109bf28>] ? do_exit+0x848/0xab0 kernel: [<ffffffff810b7f00>] ? kthread_create_on_node+0x1b0/0x1b0 kernel: [<ffffffff817646fc>] ret_from_fork+0x7c/0xb0 kernel: [<ffffffff810b7f00>] ? kthread_create_on_node+0x1b0/0x1b0
The following occurred on the NFS server, after an NFS client had a similar and needed to be hard-rebooted. Upon client reboot, when attempting to mount the same NFS shares. Client and Server now running 3.18.7-200.fc21.x86_64. kernel: NMI watchdog: BUG: soft lockup - CPU#2 stuck for 23s! [nfsd:1708] kernel: Modules linked in: cts rpcsec_gss_krb5 vhost_net vhost macvtap macvlan tun ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw raid10 intel_rapl x86_pkg_temp_thermal bnx2 coretemp iTCO_wdt iTCO_vendor_support lpc_ich dcdbas kvm_intel mfd_core ie31200_edac edac_core kvm ipmi_si ipmi_msghandler crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel wmi shpchp i2c_i801 nfsd auth_rpcgss nfs_acl lockd grace sunrpc raid1 mgag200 i2c_algo_bit drm_kms_helper ttm mpt2sas kernel: drm raid_class scsi_transport_sas kernel: CPU: 2 PID: 1708 Comm: nfsd Not tainted 3.18.7-200.fc21.x86_64 #1 kernel: Hardware name: Dell Inc. PowerEdge R210 II/09T7VV, BIOS 2.2.3 10/25/2012 kernel: task: ffff8802212b89d0 ti: ffff8800aee90000 task.ti: ffff8800aee90000 kernel: RIP: 0010:[<ffffffff81764a92>] [<ffffffff81764a92>] _raw_spin_unlock_irqrestore+0x12/0x20 kernel: RSP: 0018:ffff8800aee93bf8 EFLAGS: 00000292 kernel: RAX: ffffffffa0195710 RBX: 00ff880100000001 RCX: 0000000000000000 kernel: RDX: ffffffffa0195728 RSI: 0000000000000292 RDI: 0000000000000292 kernel: RBP: ffff8800aee93bf8 R08: 0000000000000000 R09: 0000000180210019 kernel: R10: ffff88023fc9ac90 R11: ffff8801c9da1c18 R12: 00ffffff00000001 kernel: R13: ffff88011f9542b8 R14: ffffffff811f508d R15: ffff8800aee93c08 kernel: FS: 0000000000000000(0000) GS:ffff88023fc80000(0000) knlGS:0000000000000000 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 kernel: CR2: 00007f1f06af0fa8 CR3: 0000000001c11000 CR4: 00000000000427e0 kernel: Stack: kernel: ffff8800aee93c38 ffffffff810da1c8 0000000000000246 ffff8800aee93c30 kernel: ffffffffa017d556 ffff8800aee93c80 ffff88007bade078 ffff880231c44600 kernel: ffff8800aee93c68 ffffffffa017d2b5 ffffffffa017d557 ffff8800aee93c80 kernel: Call Trace: kernel: [<ffffffff810da1c8>] __wake_up+0x48/0x60 kernel: [<ffffffffa017d556>] ? __destroy_client+0xe6/0x160 [nfsd] kernel: [<ffffffffa017d2b5>] nfs4_put_stid+0xd5/0xe0 [nfsd] kernel: [<ffffffffa017d557>] ? __destroy_client+0xe7/0x160 [nfsd] kernel: [<ffffffffa017d557>] __destroy_client+0xe7/0x160 [nfsd] kernel: [<ffffffffa017d5f6>] expire_client+0x26/0x30 [nfsd] kernel: [<ffffffffa017e908>] nfsd4_create_session+0x778/0x800 [nfsd] kernel: [<ffffffffa016f4af>] nfsd4_proc_compound+0x3ef/0x700 [nfsd] kernel: [<ffffffffa015beb3>] nfsd_dispatch+0xc3/0x210 [nfsd] kernel: [<ffffffffa00f7f32>] ? svc_tcp_adjust_wspace+0x12/0x30 [sunrpc] kernel: [<ffffffffa00f6e31>] svc_process_common+0x401/0x680 [sunrpc] kernel: [<ffffffffa00f71b7>] svc_process+0x107/0x170 [sunrpc] kernel: [<ffffffffa015b807>] nfsd+0x117/0x190 [nfsd] kernel: [<ffffffffa015b6f0>] ? nfsd_destroy+0x90/0x90 [nfsd] kernel: [<ffffffff810b801a>] kthread+0xea/0x100 kernel: [<ffffffff810b7f30>] ? kthread_create_on_node+0x1b0/0x1b0 kernel: [<ffffffff8176537c>] ret_from_fork+0x7c/0xb0 kernel: [<ffffffff810b7f30>] ? kthread_create_on_node+0x1b0/0x1b0 kernel: Code: 41 83 47 08 01 fb 66 66 90 66 66 90 eb cb 31 c0 eb cf e8 32 45 93 ff 90 90 66 66 66 66 90 55 48 89 e5 66 83 07 01 48 89 f7 57 9d <66> 66 90 66 90 5d c3 0f 1f 80 00 00 00 00 66 66 66 66 90 55 48 kernel: INFO: rcu_sched self-detected stall on CPU { 2} (t=60000 jiffies g=4858111 c=4858110 q=0) kernel: Task dump for CPU 2: kernel: nfsd R running task 0 1708 2 0x00000888 kernel: ffff8802212b89d0 00000000f8ec8e32 ffff88023fc83d58 ffffffff810c4f88 kernel: 0000000000000002 ffffffff81c58a80 ffff88023fc83d78 ffffffff810c860d kernel: ffff88023fc83d78 0000000000000003 ffff88023fc83da8 ffffffff810f6a31 kernel: Call Trace: kernel: <IRQ> [<ffffffff810c4f88>] sched_show_task+0xa8/0x110 kernel: [<ffffffff810c860d>] dump_cpu_task+0x3d/0x50 kernel: [<ffffffff810f6a31>] rcu_dump_cpu_stacks+0x91/0xd0 kernel: [<ffffffff810fa727>] rcu_check_callbacks+0x497/0x710 kernel: [<ffffffff810ffa4b>] update_process_times+0x4b/0x80 kernel: [<ffffffff8110f595>] tick_sched_handle.isra.19+0x25/0x60 kernel: [<ffffffff8110f705>] tick_sched_timer+0x45/0x80 kernel: [<ffffffff81100774>] __run_hrtimer+0x74/0x1d0 kernel: [<ffffffff8110f6c0>] ? tick_sched_do_timer+0x60/0x60 kernel: [<ffffffff81100b8f>] hrtimer_interrupt+0x10f/0x260 kernel: [<ffffffff8104c0ce>] local_apic_timer_interrupt+0x3e/0x70 kernel: [<ffffffff817681d3>] smp_apic_timer_interrupt+0x43/0x60 kernel: [<ffffffff817662bd>] apic_timer_interrupt+0x6d/0x80 kernel: <EOI> [<ffffffff81764a92>] ? _raw_spin_unlock_irqrestore+0x12/0x20 kernel: [<ffffffff810da1c8>] __wake_up+0x48/0x60 kernel: [<ffffffffa017d1e4>] ? nfs4_put_stid+0x4/0xe0 [nfsd] kernel: [<ffffffffa017d2b5>] nfs4_put_stid+0xd5/0xe0 [nfsd] kernel: [<ffffffffa017d1e5>] ? nfs4_put_stid+0x5/0xe0 [nfsd] kernel: [<ffffffffa017d557>] __destroy_client+0xe7/0x160 [nfsd] kernel: [<ffffffffa017d5f6>] expire_client+0x26/0x30 [nfsd] kernel: [<ffffffffa017e908>] nfsd4_create_session+0x778/0x800 [nfsd] kernel: [<ffffffffa016f4af>] nfsd4_proc_compound+0x3ef/0x700 [nfsd] kernel: [<ffffffffa015beb3>] nfsd_dispatch+0xc3/0x210 [nfsd] kernel: [<ffffffffa00f7f32>] ? svc_tcp_adjust_wspace+0x12/0x30 [sunrpc] kernel: [<ffffffffa00f6e31>] svc_process_common+0x401/0x680 [sunrpc] kernel: [<ffffffffa00f71b7>] svc_process+0x107/0x170 [sunrpc] kernel: [<ffffffffa015b807>] nfsd+0x117/0x190 [nfsd] kernel: [<ffffffffa015b6f0>] ? nfsd_destroy+0x90/0x90 [nfsd] kernel: [<ffffffff810b801a>] kthread+0xea/0x100 kernel: [<ffffffff810b7f30>] ? kthread_create_on_node+0x1b0/0x1b0 kernel: [<ffffffff8176537c>] ret_from_fork+0x7c/0xb0 kernel: [<ffffffff810b7f30>] ? kthread_create_on_node+0x1b0/0x1b0
In both the original report and comment #1 the laundromat thread is stuck in __destroy_client. It's a little uncertain exactly where: in the original report we're in _atomic_dec_and_lock, and the "? nfs4_put_stid" suggests this could be one of the puts of a delegation stateid. In comment #1 we're probably again in one of those two nfs4_put_stid's, but I don't understand how you hang in _raw_spin_unlock_irqrestore in __wake_up. Maybe the state is corrupted somehow (e.g. maybe we're trying to free an already-freed delegation). Looking at the disassembled kernel might help figure out exactly where the hangs are. It would of course be useful to know if this becomes reliably reproduceable. Then we could at least bisect, or maybe add some debugging code.
Hey Bruce.. I was able to reproduce it last week by simulating callback channel failures using iptables nf_conntrack timeouts to block the callback. I'm at cthon today, so I don't have the exact setup on-hand. Let me see if I can get my local setup to blow up the same way. The process I used was to tune down nf_conntrack's expiry on the client -- first make sure there's ESTABLISHED,RELATED rules for NFS, then: echo 10 > /proc/sys/net/netfilter/nf_conntrack_tcp_timeout_established echo 10 > /proc/sys/net/netfilter/nf_conntrack_tcp_timeout_time_wait Then take a read delegation, wait for the nf_conntrack tables to time out, then write to the file on the server to force a recall. I'll try to get that setup today and get a vmcore.
This is the likely fix: http://marc.info/?l=linux-nfs&m=142487180905534&w=2 Here's the discussion: http://marc.info/?l=linux-nfs&m=142482299023511&w=2 I'll test this early next week.
(In reply to Benjamin Coddington from comment #4) > This is the likely fix: > > http://marc.info/?l=linux-nfs&m=142487180905534&w=2 I'd love to see this patch hit Fedora 21's 3.18.8 kernel release, if possible.
I've added the referenced patch today. It isn't in the initial 3.18.8 build, but will be in any subsequent builds we do.
kernel-4.0.0-0.rc2.git0.1.fc22 has been submitted as an update for Fedora 22. https://admin.fedoraproject.org/updates/kernel-4.0.0-0.rc2.git0.1.fc22
Package kernel-4.0.0-0.rc2.git0.1.fc22: * should fix your issue, * was pushed to the Fedora 22 testing repository, * should be available at your local mirror within two days. Update it with: # su -c 'yum update --enablerepo=updates-testing kernel-4.0.0-0.rc2.git0.1.fc22' as soon as you are able to, then reboot. Please go to the following url: https://admin.fedoraproject.org/updates/FEDORA-2015-3060/kernel-4.0.0-0.rc2.git0.1.fc22 then log in and leave karma (feedback).
kernel-3.18.9-200.fc21 has been submitted as an update for Fedora 21. https://admin.fedoraproject.org/updates/kernel-3.18.9-200.fc21
kernel-3.18.9-100.fc20 has been submitted as an update for Fedora 20. https://admin.fedoraproject.org/updates/kernel-3.18.9-100.fc20
kernel-4.0.0-0.rc3.git0.1.fc22 has been submitted as an update for Fedora 22. https://admin.fedoraproject.org/updates/kernel-4.0.0-0.rc3.git0.1.fc22
kernel-3.18.9-100.fc20 has been pushed to the Fedora 20 stable repository. If problems still persist, please make note of it in this bug report.
kernel-3.18.9-200.fc21 has been pushed to the Fedora 21 stable repository. If problems still persist, please make note of it in this bug report.
kernel-4.0.0-0.rc3.git0.1.fc22 has been pushed to the Fedora 22 stable repository. If problems still persist, please make note of it in this bug report.