Bug 1368577
Summary: | kernel crash after a few hours/days with NFS 4.1 and 4.2 enabled | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Rik Theys <rik.theys> | ||||||||||
Component: | kernel | Assignee: | J. Bruce Fields <bfields> | ||||||||||
kernel sub component: | NFS | QA Contact: | JianHong Yin <jiyin> | ||||||||||
Status: | CLOSED ERRATA | Docs Contact: | |||||||||||
Severity: | urgent | ||||||||||||
Priority: | urgent | CC: | bcodding, bfields, dhoward, eguan, fs-qe, jiyin, jlayton, swhiteho, xzhou, yoyang | ||||||||||
Version: | 7.2 | Keywords: | ZStream | ||||||||||
Target Milestone: | rc | ||||||||||||
Target Release: | --- | ||||||||||||
Hardware: | Unspecified | ||||||||||||
OS: | Linux | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | kernel-3.10.0-517.el7 | Doc Type: | If docs needed, set a value | ||||||||||
Doc Text: | Story Points: | --- | |||||||||||
Clone Of: | |||||||||||||
: | 1390672 (view as bug list) | Environment: | |||||||||||
Last Closed: | 2017-08-02 00:43:46 UTC | Type: | Bug | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Bug Depends On: | |||||||||||||
Bug Blocks: | 1390672 | ||||||||||||
Attachments: |
|
Created attachment 1192291 [details]
netconsole.log.xz - part 2/2
Second part of the netconsole.log.xz file.
I think this is actually several different bugs. The first WARN is interesting, but harmless. The second one in rpc_exit_task is interesting. Perhaps we have a race of some sort when tearing down a backchannel transport? It seems like some of the RPC tasks might not have gotten torn down correctly. The third stack trace is probably the problem that was fixed by 3fcbbd244ed1d20dc0eb7d48d729503992fa9b7d upstream, or maybe a few of the other patches that went in around that time. I think Bruce has backported that, so you may want to try a more recent rhel7 kernel, as that's a clear memory corruptor. The remaining softlockups are probably fallout from the problem that caused the 3rd stack trace. Hi, Thanks for your prompt reply. (In reply to Jeff Layton from comment #3) > I think this is actually several different bugs. The first WARN is > interesting, but harmless. > > The second one in rpc_exit_task is interesting. Perhaps we have a race of > some sort when tearing down a backchannel transport? It seems like some of > the RPC tasks might not have gotten torn down correctly. > > The third stack trace is probably the problem that was fixed by > 3fcbbd244ed1d20dc0eb7d48d729503992fa9b7d upstream, or maybe a few of the > other patches that went in around that time. I think Bruce has backported > that, so you may want to try a more recent rhel7 kernel, as that's a clear > memory corruptor. 3.10.0-327.28.3.el7 seems to be the latest available kernel and compared to 28.2 it only contains a security fix? Where can I find more recent kernels? > The remaining softlockups are probably fallout from the problem that caused > the 3rd stack trace. Any idea on the following messages: receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt ffff8800bae57000 xid 570c89f3 We are seeing lots of them (but they don't seem to crash the box on their own). The only thing I could find online about these messages is about exporting an XFS filesystem and the bug should be fixed in the RHEL 7.2 kernel. I have another netconsole capture from the same system and this includes a final PANIC message: [243229.909545] Kernel panic - not syncing: Fatal exception [243229.919826] drm_kms_helper: panic occurred, switching back to text console [243229.920805] ------------[ cut here ]------------ [243229.921780] WARNING: at arch/x86/kernel/smp.c:124 native_smp_send_reschedule+0x5f/0x70() [243229.922772] Modules linked in: netconsole fuse rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache 8021q garp stp mrp llc xt_set xt_multiport ip_set_hash_ip ip_set nfnetlink bonding ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter dm_snapshot dm_bufio intel_powerclamp coretemp kvm_intel kvm crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel lrw ixgbe gf128mul bnx2 glue_helper ablk_helper iTCO_wdt cryptd dm_service_time iTCO_vendor_support ptp pps_core mdio dca lpc_ich dcdbas mfd_core sg i7core_edac shpchp edac_core pcspkr ipmi_devintf ipmi_si ipmi_msghandler wmi acpi_power_meter acpi_cpufreq binfmt_misc nfsd auth_rpcgss nfs_acl lockd grace sunrpc dm_multipath ip_tables ext4 mbcache jbd2 sd_mod sr_mod cdrom ata_generic pata_acpi mgag200 syscopyarea sysfillrect sysimgblt i2c_algo_bit lpfc drm_kms_helper ttm crc_t10dif drm crct10dif_generic mpt2sas ata_piix crct10dif_pclmul scsi_transport_fc serio_raw raid_class libata i2c_core scsi_tgt scsi_transport_sas crct10dif_common dm_mirror dm_region_hash dm_log dm_mod [243229.933053] CPU: 0 PID: 3331 Comm: nfsd Tainted: G D W I ------------ 3.10.0-327.28.2.el7.x86_64 #1 [243229.934239] Hardware name: Dell Inc. PowerEdge R710/0MD99X, BIOS 6.3.0 07/24/2012 [243229.935409] 0000000000000000 000000004a89d18c ffff88121b203d98 ffffffff81636553 [243229.936576] ffff88121b203dd0 ffffffff8107b200 0000000000000001 ffff88121b2167c0 [243229.937723] 000000010e7a9b5e ffff88091ba167c0 0000000000000000 ffff88121b203de0 [243229.938846] Call Trace: [243229.939924] <IRQ> [<ffffffff81636553>] dump_stack+0x19/0x1b [243229.941007] [<ffffffff8107b200>] warn_slowpath_common+0x70/0xb0 [243229.942057] [<ffffffff8107b34a>] warn_slowpath_null+0x1a/0x20 [243229.943076] [<ffffffff8104684f>] native_smp_send_reschedule+0x5f/0x70 [243229.944084] [<ffffffff810c650d>] trigger_load_balance+0x16d/0x200 [243229.945074] [<ffffffff810b7703>] scheduler_tick+0x103/0x150 [243229.946052] [<ffffffff810e08a0>] ? tick_sched_handle.isra.14+0x60/0x60 [243229.947033] [<ffffffff8108e926>] update_process_times+0x66/0x80 [243229.948104] [<ffffffff810e0865>] tick_sched_handle.isra.14+0x25/0x60 [243229.949075] [<ffffffff810e08e1>] tick_sched_timer+0x41/0x70 [243229.950040] [<ffffffff810a9d82>] __hrtimer_run_queues+0xd2/0x260 [243229.951004] [<ffffffff810aa320>] hrtimer_interrupt+0xb0/0x1e0 [243229.951974] [<ffffffff8164855c>] ? call_softirq+0x1c/0x30 [243229.952941] [<ffffffff810495c7>] local_apic_timer_interrupt+0x37/0x60 [243229.953915] [<ffffffff816491cf>] smp_apic_timer_interrupt+0x3f/0x60 [243229.954889] [<ffffffff8164789d>] apic_timer_interrupt+0x6d/0x80 [243229.955858] <EOI> [<ffffffffa0143853>] ? __drm_modeset_lock_all+0x33/0x110 [drm] [243229.956868] [<ffffffff8162feb2>] ? panic+0x1a3/0x1e7 [243229.957854] [<ffffffff8163f55b>] oops_end+0x12b/0x150 [243229.958834] [<ffffffff8162f688>] no_context+0x280/0x2a3 [243229.959813] [<ffffffff8162f71e>] __bad_area_nosemaphore+0x73/0x1ca [243229.960798] [<ffffffff8162f888>] bad_area_nosemaphore+0x13/0x15 [243229.961781] [<ffffffff816422be>] __do_page_fault+0x2de/0x450 [243229.962760] [<ffffffff81642453>] do_page_fault+0x23/0x80 [243229.963736] [<ffffffff8163e748>] page_fault+0x28/0x30 [243229.964718] [<ffffffffa0411478>] ? find_openstateowner_str_locked.isra.60+0x38/0x80 [nfsd] [243229.965715] [<ffffffffa041149d>] ? find_openstateowner_str_locked.isra.60+0x5d/0x80 [nfsd] [243229.966695] [<ffffffffa0416051>] nfsd4_process_open1+0x101/0x360 [nfsd] [243229.967687] [<ffffffffa0405cd9>] nfsd4_open+0xe9/0x850 [nfsd] [243229.968672] [<ffffffffa0406917>] nfsd4_proc_compound+0x4d7/0x7f0 [nfsd] [243229.969653] [<ffffffffa03f212b>] nfsd_dispatch+0xbb/0x200 [nfsd] [243229.970635] [<ffffffffa02d8183>] svc_process_common+0x453/0x6f0 [sunrpc] [243229.971605] [<ffffffffa02d8523>] svc_process+0x103/0x170 [sunrpc] [243229.972544] [<ffffffffa03f1ab7>] nfsd+0xe7/0x150 [nfsd] [243229.973456] [<ffffffffa03f19d0>] ? nfsd_destroy+0x80/0x80 [nfsd] [243229.974344] [<ffffffff810a5b2f>] kthread+0xcf/0xe0 [243229.975200] [<ffffffff810a5a60>] ? kthread_create_on_node+0x140/0x140 [243229.976038] [<ffffffff81646b98>] ret_from_fork+0x58/0x90 [243229.976849] [<ffffffff810a5a60>] ? kthread_create_on_node+0x140/0x140 [243229.977638] ---[ end trace ea0c6327415cf4d3 ]--- I will compress and upload the entire capture. Regards, Rik Created attachment 1192454 [details]
netconsole log of another crash - part 1/2
Created attachment 1192455 [details]
netconsole log of another crash - part 2/2
(In reply to Jeff Layton from comment #3) > The third stack trace is probably the problem that was fixed by > 3fcbbd244ed1d20dc0eb7d48d729503992fa9b7d upstream, or maybe a few of the > other patches that went in around that time. I think Bruce has backported > that, so you may want to try a more recent rhel7 kernel, as that's a clear > memory corruptor. That patch was already in 3.10.0-325. (In reply to Jeff Layton from comment #3) > I think this is actually several different bugs. The first WARN is > interesting, but harmless. Well, it means we've got no read open at a point when we definitely should have one.... I wonder if this might be Oleg's 5cc1fb2a093e and/or feb9dad52092. Those aren't backported yet. I can build an rpm with those and some other more recent upstream bugfixes, though I'm not sure how to get that to Rik for testing. See also bug 1369600. Possibly not strictly a dup since you may be dealing with more than one problem here. Rik, do you mind if I open this bug to the public? (In reply to J. Bruce Fields from comment #9) > > Rik, do you mind if I open this bug to the public? No problem, you can open it to the public. Rik Ahh sorry I missed the comments on this earlier this week... The "unrecognized reply" just means that the kernel got a reply for an XID that it couldn't find. Often that means that the reply stream is misaligned in some fashion. Hmm...maybe when we get something like that, we ought to close the connection? Getting a reply for a bogus XID would seem to be an indicator that something is out of whack there. (In reply to J. Bruce Fields from comment #8) > (In reply to Jeff Layton from comment #3) > > I think this is actually several different bugs. The first WARN is > > interesting, but harmless. > > Well, it means we've got no read open at a point when we definitely should > have one.... > > I wonder if this might be Oleg's 5cc1fb2a093e and/or feb9dad52092. Those > aren't backported yet. I can build an rpm with those and some other more > recent upstream bugfixes, though I'm not sure how to get that to Rik for > testing. Oh right. I had forgotten about that -- yeah that does seem like it could be what's needed. Hi, We've disabled NFS 4.1 and 4.2 on the server (so only 4.0 is used now), and the system has been up for 9 days now (which is a new record for this system). We do still see a lot (16104243 on Aug 27) of the "receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt ffff88120a983000 xid 0598188f" messages. Should be commits mentioned (5cc1fb2a093e and/or feb9dad52092) fix this issue? I will see if they apply cleanly to the 3.10.0-327.28.2.el7.x86_64 kernel. Regards, Rik Hi, It seems the two patches depend on other commits for the 3.10.0-327.28.3.el7.x86_64 kernel. I was unable to quickly determine which commits are probably missing. Rik (In reply to Rik Theys from comment #14) > We've disabled NFS 4.1 and 4.2 on the server (so only 4.0 is used now), and > the system has been up for 9 days now (which is a new record for this > system). I stand corrected. The 4.1 and 4.2 was only disabled last Friday so the system has only been running without those versions for 3 days. The system was not rebooted after disabling 4.1 and 4.2. Could that explain why the receive_cb_reply messages keep coming? Rik (In reply to Rik Theys from comment #16) > (In reply to Rik Theys from comment #14) > > We've disabled NFS 4.1 and 4.2 on the server (so only 4.0 is used now), and > > the system has been up for 9 days now (which is a new record for this > > system). > > I stand corrected. The 4.1 and 4.2 was only disabled last Friday so the > system has only been running without those versions for 3 days. > > The system was not rebooted after disabling 4.1 and 4.2. Could that explain > why the receive_cb_reply messages keep coming? You'd need to at least restart the nfs server to for the change to take effect. Also, I'm not sure off the top of my head whether 4.1/4.2 clients will deal well with the case of a server that reboots lacking support for the protocol version they were using before, so unmounting from the client and then remounting may be required. (In reply to J. Bruce Fields from comment #17) > You'd need to at least restart the nfs server to for the change to take > effect. > > Also, I'm not sure off the top of my head whether 4.1/4.2 clients will deal > well with the case of a server that reboots lacking support for the protocol > version they were using before, so unmounting from the client and then > remounting may be required. Yes, just checked this, and clients just error out with "protocol not supported" on any attempt to use the mountpoint. You can also check /proc/fs/nfsd/versions to see if the change really took effect. Hi, (In reply to J. Bruce Fields from comment #18) > (In reply to J. Bruce Fields from comment #17) > > You'd need to at least restart the nfs server to for the change to take > > effect. > > > > Also, I'm not sure off the top of my head whether 4.1/4.2 clients will deal > > well with the case of a server that reboots lacking support for the protocol > > version they were using before, so unmounting from the client and then > > remounting may be required. > > Yes, just checked this, and clients just error out with "protocol not > supported" on any attempt to use the mountpoint. > > You can also check /proc/fs/nfsd/versions to see if the change really took > effect. You're right. That file shows 4.1 and 4.2 are still enabled. Strange that the box hasn't crashed then (up for 23 days now). Could this bug be related to the following (client side) bug? https://bugzilla.redhat.com/show_bug.cgi?id=1371237 Regards, Rik (In reply to Rik Theys from comment #19) > Could this bug be related to the following (client side) bug? > https://bugzilla.redhat.com/show_bug.cgi?id=1371237 If any of your clients fall into the range affected by that bug, then, sure, I can imagine that might make it more likely to hit some race in the server. Of course, no behavior from the client, however bad, should cause the server to crash, so there's some server bug here regardless. (In reply to J. Bruce Fields from comment #8) > I wonder if this might be Oleg's 5cc1fb2a093e and/or feb9dad52092. Those > aren't backported yet. I can build an rpm with those and some other more > recent upstream bugfixes, though I'm not sure how to get that to Rik for > testing. And it sounds like Rik's having trouble reproducing. But I think we should apply those two patches regardless, as they may help, and are for serious problems that others managed to hit: feb9dad5209280085d5b0c094fa67e7a8d75c81a "nfsd: Always lock state exclusively." 5cc1fb2a093e254b656c64ff24b0b76bed1d34d9 "nfsd: Extend the mutex holding region around in nfsd4_process_open2()" And while we're at it, rhel7 is missing three more upstream fixes for state-related races that we should also apply: 885848186fbc2d1d8fb6d2fdc2156638ae289a46 "nfsd: Close race between nfsd4_release_lockowner and nfsd4_lock" 42691398be08bd1fe99326911a0aa31f2c041d53 "nfsd: Fix race between FREE_STATEID and LOCK" dd257933fa4b9fea66a1195f8a15111029810abc "nfsd: don't return an unhashed lock stateid after taking mutex" Can we have a QE ack for this one? Bruce, do you have a way to test this? Hi, Will all the required patches be part of RHEL 7.3, which is now in beta? Regards, Rik Patch(es) committed on kernel repository and an interim kernel build is undergoing testing Patch(es) available on kernel-3.10.0-517.el7 *** Bug 1380594 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:1842 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:1842 |
Created attachment 1192290 [details] netconsole.log.xz - part 1/2 Description of problem: We upgraded an NFS file server to version 7 and enabled NFS 4.1 and 4.2 in /etc/sysconfig/nfs. Since the upgrade, the system keeps on crashing after a few hours/days of uptime. After a few hours it starts to log errors (see attachment) and seems to completely crash after a few hours/days. The attached log was captured using a netconsole to another system. The log shows the following error at first: [57229.919808] ------------[ cut here ]------------ [57229.919883] WARNING: at fs/nfsd/nfs4state.c:3853 nfsd4_process_open2+0xb72/0xf70 [nfsd]() [57229.919963] Modules linked in: netconsole ip6table_filter ip6_tables rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache fuse 8021q garp stp mrp llc xt_set xt_multiport ip_set_hash_ip ip_set nfnetlink bonding ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter dm_snapshot dm_bufio dcdbas iTCO_wdt intel_powerclamp iTCO_vendor_support bnx2 ixgbe coretemp kvm_intel kvm dm_service_time ptp pps_core mdio crc32_pclmul crc32c_intel ipmi_devintf ghash_clmulni_intel aesni_intel dca lrw gf128mul glue_helper ablk_helper shpchp pcspkr cryptd lpc_ich sg mfd_core i7core_edac edac_core wmi acpi_cpufreq ipmi_si ipmi_msghandler acpi_power_meter binfmt_misc nfsd auth_rpcgss nfs_acl lockd dm_multipath grace sunrpc ip_tables ext4 mbcache jbd2 sd_mod sr_mod cdrom ata_generic pata_acpi mgag200 syscopyarea sysfillrect sysimgblt i2c_algo_bit lpfc drm_kms_helper ttm drm crc_t10dif mpt2sas ata_piix crct10dif_generic serio_raw crct10dif_pclmul libata scsi_transport_fc raid_class i2c_core scsi_tgt scsi_transport_sas crct10dif_common dm_mirror dm_region_hash dm_log dm_mod [57229.921319] CPU: 0 PID: 3380 Comm: nfsd Tainted: G I ------------ 3.10.0-327.28.2.el7.x86_64 #1 [57229.921378] Hardware name: Dell Inc. PowerEdge R710/0MD99X, BIOS 6.3.0 07/24/2012 [57229.921424] 0000000000000000 00000000bceb2a6a ffff88120a1dfc28 ffffffff81636553 [57229.921505] ffff88120a1dfc60 ffffffff8107b200 ffff880e0b94d168 ffff8808f5a0e3e0 [57229.921566] ffff880b6106b878 ffff8803f0c68c00 0000000000000000 ffff88120a1dfc70 [57229.921625] Call Trace: [57229.923339] [<ffffffff81636553>] dump_stack+0x19/0x1b [57229.925046] [<ffffffff8107b200>] warn_slowpath_common+0x70/0xb0 [57229.926723] [<ffffffff8107b34a>] warn_slowpath_null+0x1a/0x20 [57229.928376] [<ffffffffa03dfe22>] nfsd4_process_open2+0xb72/0xf70 [nfsd] [57229.930057] [<ffffffffa03cf14a>] nfsd4_open+0x55a/0x850 [nfsd] [57229.931762] [<ffffffffa03cf917>] nfsd4_proc_compound+0x4d7/0x7f0 [nfsd] [57229.933480] [<ffffffffa03bb12b>] nfsd_dispatch+0xbb/0x200 [nfsd] [57229.935152] [<ffffffffa0379183>] svc_process_common+0x453/0x6f0 [sunrpc] [57229.936840] [<ffffffffa0379523>] svc_process+0x103/0x170 [sunrpc] [57229.938501] [<ffffffffa03baab7>] nfsd+0xe7/0x150 [nfsd] [57229.940171] [<ffffffffa03ba9d0>] ? nfsd_destroy+0x80/0x80 [nfsd] [57229.941420] [<ffffffff810a5b2f>] kthread+0xcf/0xe0 [57229.942617] [<ffffffff810a5a60>] ? kthread_create_on_node+0x140/0x140 [57229.943664] [<ffffffff81646b98>] ret_from_fork+0x58/0x90 [57229.944714] [<ffffffff810a5a60>] ? kthread_create_on_node+0x140/0x140 [57229.943664] [<ffffffff81646b98>] ret_from_fork+0x58/0x90 [57229.944714] [<ffffffff810a5a60>] ? kthread_create_on_node+0x140/0x140 [57229.945734] ---[ end trace 3c885cce7e01b16c ]--- A while later it starts spewing the following error: [339281.304267] receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt ffff8800bae57000 xid 570c89f3 The log shows this error 7760511 times. Afterwards, the following warning is logged (281 times) [340168.473391] ------------[ cut here ]------------ [340168.473416] WARNING: at net/sunrpc/sched.c:692 rpc_exit_task+0x85/0x90 [sunrpc]() [340168.473457] Modules linked in: netconsole ip6table_filter ip6_tables rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache fuse 8021q garp stp mrp llc xt_set xt_multiport ip_set_hash_ip ip_set nfnetlink bonding ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter dm_snapshot dm_bufio dcdbas iTCO_wdt intel_powerclamp iTCO_vendor_support bnx2 ixgbe coretemp kvm_intel kvm dm_service_time ptp pps_core mdio crc32_pclmul crc32c_intel ipmi_devintf ghash_clmulni_intel aesni_intel dca lrw gf128mul glue_helper ablk_helper shpchp pcspkr cryptd lpc_ich sg mfd_core i7core_edac edac_core wmi acpi_cpufreq ipmi_si ipmi_msghandler acpi_power_meter binfmt_misc nfsd auth_rpcgss nfs_acl lockd dm_multipath grace sunrpc ip_tables ext4 mbcache jbd2 sd_mod sr_mod cdrom ata_generic pata_acpi mgag200 syscopyarea sysfillrect sysimgblt i2c_algo_bit lpfc drm_kms_helper ttm drm crc_t10dif mpt2sas ata_piix crct10dif_generic serio_raw crct10dif_pclmul libata scsi_transport_fc raid_class i2c_core scsi_tgt scsi_transport_sas crct10dif_common dm_mirror dm_region_hash dm_log dm_mod [340168.473471] CPU: 0 PID: 11768 Comm: kworker/0:1H Tainted: G W I ------------ 3.10.0-327.28.2.el7.x86_64 #1 [340168.473495] Hardware name: Dell Inc. PowerEdge R710/0MD99X, BIOS 6.3.0 07/24/2012 [340168.473505] Workqueue: rpciod rpc_async_schedule [sunrpc] [340168.473509] 0000000000000000 000000001abd1d5d ffff880a7523fd40 ffffffff81636553 [340168.473511] ffff880a7523fd78 ffffffff8107b200 ffff8811e1c55b00 ffff8812031ba600 [340168.473514] ffffffffa0372e00 ffffffffa0372e00 0000000000000020 ffff880a7523fd88 [340168.473515] Call Trace: [340168.473523] [<ffffffff81636553>] dump_stack+0x19/0x1b [340168.473528] [<ffffffff8107b200>] warn_slowpath_common+0x70/0xb0 [340168.473558] [<ffffffffa0372e00>] ? rpc_destroy_wait_queue+0x20/0x20 [sunrpc] [340168.473584] [<ffffffffa0372e00>] ? rpc_destroy_wait_queue+0x20/0x20 [sunrpc] [340168.473586] [<ffffffff8107b34a>] warn_slowpath_null+0x1a/0x20 [340168.473595] [<ffffffffa0372e85>] rpc_exit_task+0x85/0x90 [sunrpc] [340168.473620] [<ffffffffa0373b84>] __rpc_execute+0x94/0x420 [sunrpc] [340168.473628] [<ffffffffa0373f36>] rpc_async_schedule+0x26/0x30 [sunrpc] [340168.473632] [<ffffffff8109d63b>] process_one_work+0x17b/0x470 [340168.473634] [<ffffffff8109e40b>] worker_thread+0x11b/0x400 [340168.473636] [<ffffffff8109e2f0>] ? rescuer_thread+0x400/0x400 [340168.473640] [<ffffffff810a5b2f>] kthread+0xcf/0xe0 [340168.473652] [<ffffffff810a5a60>] ? kthread_create_on_node+0x140/0x140 [340168.473663] [<ffffffff81646b98>] ret_from_fork+0x58/0x90 [340168.473665] [<ffffffff810a5a60>] ? kthread_create_on_node+0x140/0x140 [340168.473667] ---[ end trace 3c885cce7e01b16d ]--- [341612.361857] ------------[ cut here ]------------ After that is starts to log the following warning: [505731.859397] ------------[ cut here ]------------ [505731.860230] WARNING: at lib/list_debug.c:59 __list_del_entry+0xa1/0xd0() [505731.860935] list_del corruption. prev->next should be ffff880b07bbc1b8, but was ffff880b07bbfde0 [505731.861652] Modules linked in: netconsole ip6table_filter ip6_tables rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache fuse 8021q garp stp mrp llc xt_set xt_multiport ip_set_hash_ip ip_set nfnetlink bonding ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter dm_snapshot dm_bufio dcdbas iTCO_wdt intel_powerclamp iTCO_vendor_support bnx2 ixgbe coretemp kvm_intel kvm dm_service_time ptp pps_core mdio crc32_pclmul crc32c_intel ipmi_devintf ghash_clmulni_intel aesni_intel dca lrw gf128mul glue_helper ablk_helper shpchp pcspkr cryptd lpc_ich sg mfd_core i7core_edac edac_core wmi acpi_cpufreq ipmi_si ipmi_msghandler acpi_power_meter binfmt_misc nfsd auth_rpcgss nfs_acl lockd dm_multipath grace sunrpc ip_tables ext4 mbcache jbd2 sd_mod sr_mod cdrom ata_generic pata_acpi mgag200 syscopyarea sysfillrect sysimgblt i2c_algo_bit lpfc drm_kms_helper ttm drm crc_t10dif mpt2sas ata_piix crct10dif_generic serio_raw crct10dif_pclmul libata scsi_transport_fc raid_class i2c_core scsi_tgt scsi_transport_sas crct10dif_common dm_mirror dm_region_hash dm_log dm_mod [505731.870718] CPU: 0 PID: 3377 Comm: nfsd Tainted: G W I ------------ 3.10.0-327.28.2.el7.x86_64 #1 [505731.871944] Hardware name: Dell Inc. PowerEdge R710/0MD99X, BIOS 6.3.0 07/24/2012 [505731.873192] 0000000000000246 000000002b736c75 ffff88120509bc18 ffffffff81636553 [505731.874451] ffff88120509bc50 ffffffff8107b200 ffff880b07bbc1b8 ffff880b07bbc338 [505731.875595] 0000000002000000 ffff8808f5814000 ffff880a4c86d600 ffff88120509bcb8 [505731.877006] Call Trace: [505731.878272] [<ffffffff81636553>] dump_stack+0x19/0x1b [505731.879526] [<ffffffff8107b200>] warn_slowpath_common+0x70/0xb0 [505731.880668] [<ffffffff8107b29c>] warn_slowpath_fmt+0x5c/0x80 [505731.882039] [<ffffffff81286410>] ? security_inode_getattr+0x20/0x30 [505731.883518] [<ffffffff8130cd11>] __list_del_entry+0xa1/0xd0 [505731.885030] [<ffffffffa03da9ab>] nfs4_unhash_openowner+0x1b/0x40 [nfsd] [505731.886501] [<ffffffffa03d9d53>] nfs4_put_stateowner+0x33/0x60 [nfsd] [505731.887999] [<ffffffffa03e024d>] nfsd4_cleanup_open_state+0x2d/0x70 [nfsd] [505731.889422] [<ffffffffa03cef17>] nfsd4_open+0x327/0x850 [nfsd] [505731.890919] [<ffffffffa03cf917>] nfsd4_proc_compound+0x4d7/0x7f0 [nfsd] [505731.892148] [<ffffffffa03bb12b>] nfsd_dispatch+0xbb/0x200 [nfsd] [505731.893828] [<ffffffffa0379183>] svc_process_common+0x453/0x6f0 [sunrpc] [505731.895387] [<ffffffffa0379523>] svc_process+0x103/0x170 [sunrpc] [505731.896959] [<ffffffffa03baab7>] nfsd+0xe7/0x150 [nfsd] [505731.898198] [<ffffffffa03ba9d0>] ? nfsd_destroy+0x80/0x80 [nfsd] [505731.899744] [<ffffffff810a5b2f>] kthread+0xcf/0xe0 [505731.901179] [<ffffffff810a5a60>] ? kthread_create_on_node+0x140/0x140 [505731.902593] [<ffffffff81646b98>] ret_from_fork+0x58/0x90 [505731.904019] [<ffffffff810a5a60>] ? kthread_create_on_node+0x140/0x140 [505731.905475] ---[ end trace 3c885cce7e01b286 ]--- Immediately afterwards it starts logging CPU soft lockups such as the following. It logs a lot of them and at some point the log stops and the server is completely frozen and unresponsive. [505778.541238] BUG: soft lockup - CPU#0 stuck for 22s! [nfsd:3355] [505778.542735] Modules linked in: netconsole ip6table_filter ip6_tables rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache fuse 8021q garp stp mrp llc xt_set xt_multiport ip_set_hash_ip ip_set nfnetlink bonding ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter dm_snapshot dm_bufio dcdbas iTCO_wdt intel_powerclamp iTCO_vendor_support bnx2 ixgbe coretemp kvm_intel kvm dm_service_time ptp pps_core mdio crc32_pclmul crc32c_intel ipmi_devintf ghash_clmulni_intel aesni_intel dca lrw gf128mul glue_helper ablk_helper shpchp pcspkr cryptd lpc_ich sg mfd_core i7core_edac edac_core wmi acpi_cpufreq ipmi_si ipmi_msghandler acpi_power_meter binfmt_misc nfsd auth_rpcgss nfs_acl lockd dm_multipath grace sunrpc ip_tables ext4 mbcache jbd2 sd_mod sr_mod cdrom ata_generic pata_acpi mgag200 syscopyarea sysfillrect sysimgblt i2c_algo_bit lpfc drm_kms_helper ttm drm crc_t10dif mpt2sas ata_piix crct10dif_generic serio_raw crct10dif_pclmul libata scsi_transport_fc raid_class i2c_core scsi_tgt scsi_transport_sas crct10dif_common dm_mirror dm_region_hash dm_log dm_mod [505778.556995] CPU: 0 PID: 3355 Comm: nfsd Tainted: G W I ------------ 3.10.0-327.28.2.el7.x86_64 #1 [505778.558664] Hardware name: Dell Inc. PowerEdge R710/0MD99X, BIOS 6.3.0 07/24/2012 [505778.560329] task: ffff880908976780 ti: ffff8811f4dcc000 task.ti: ffff8811f4dcc000 [505778.561977] RIP: 0010:[<ffffffffa03da470>] [<ffffffffa03da470>] find_openstateowner_str_locked.isra.60+0x30/0x80 [nfsd] [505778.563656] RSP: 0018:ffff8811f4dcfcc8 EFLAGS: 00000216 [505778.565279] RAX: 00000000fffffffc RBX: 0000000000000000 RCX: 000000000000000b [505778.566882] RDX: 000000006ce1ca60 RSI: ffff880e286040cc RDI: ffff880a6fe7c480 [505778.568457] RBP: ffff8811f4dcfce0 R08: 0000000000000059 R09: 00000000fffffffc [505778.570006] R10: 0000000000000000 R11: ffff8808f7b543c0 R12: ffff880ffeb43100 [505778.571527] R13: ffff8811f4dcfcb8 R14: ffff8810a26b2d80 R15: ffff8811f4dcfc48 [505778.573019] FS: 0000000000000000(0000) GS:ffff88121b200000(0000) knlGS:0000000000000000 [505778.574494] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [505778.575941] CR2: 00007fd3e0f2d838 CR3: 000000000194a000 CR4: 00000000000007f0 [505778.577380] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [505778.578868] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [505778.580271] Stack: [505778.581659] ffff8808f7b543e0 ffff8808f7b55068 0000000000000061 ffff8811f4dcfd28 [505778.583084] ffffffffa03df051 0000000000000000 ffff880f0729f340 ffff8808f7b543e0 [505778.584511] ffff8808f7b55068 ffff8808f7b55068 ffff8808f6d60000 ffff881209161a00 [505778.585942] Call Trace: [505778.587372] [<ffffffffa03df051>] nfsd4_process_open1+0x101/0x360 [nfsd] [505778.588828] [<ffffffffa03cecd9>] nfsd4_open+0xe9/0x850 [nfsd] [505778.590279] [<ffffffffa03cf917>] nfsd4_proc_compound+0x4d7/0x7f0 [nfsd] [505778.591729] [<ffffffffa03bb12b>] nfsd_dispatch+0xbb/0x200 [nfsd] [505778.593185] [<ffffffffa0379183>] svc_process_common+0x453/0x6f0 [sunrpc] [505778.594641] [<ffffffffa0379523>] svc_process+0x103/0x170 [sunrpc] [505778.596085] [<ffffffffa03baab7>] nfsd+0xe7/0x150 [nfsd] [505778.597524] [<ffffffffa03ba9d0>] ? nfsd_destroy+0x80/0x80 [nfsd] [505778.598965] [<ffffffff810a5b2f>] kthread+0xcf/0xe0 [505778.600396] [<ffffffff810a5a60>] ? kthread_create_on_node+0x140/0x140 [505778.601840] [<ffffffff81646b98>] ret_from_fork+0x58/0x90 [505778.603278] [<ffffffff810a5a60>] ? kthread_create_on_node+0x140/0x140 [505778.604722] Code: 55 89 ff 48 c1 e7 04 48 89 e5 41 55 49 89 f5 41 54 4c 8d 24 3a 53 49 8b 1c 24 49 39 dc 75 14 eb 4a 66 2e 0f 1f 84 00 00 00 00 00 <48> 8b 1b 49 39 dc 74 38 80 bb 78 01 00 00 00 74 ef 8b 53 38 41 Version-Release number of selected component (if applicable): kernel-3.10.0-327.28.2.el7.x86_64 How reproducible: Steps to Reproduce: 1. Enable NFS 4.1 and 4.2 by setting the following variable in /etc/sysconfig/nfs RPCNFSDARGS="-N2 -N3 -V4 -V4.1 -V4.2 -U -s" 2. Have a lot of clients (approx 150) use the server intensively. The clients are mostly Fedora 23/24 and are using NFS 4.2. A smaller set of clients are running CentOS 7.2 and are using NFS 4.1 3. Actual results: Server crashes after a while Expected results: Server keeps on running Additional info: I had to split the xz-compressed netconsole log into two attachments as it's 34M and bugzilla only allows up to 20000KB.