Bug 1368577

Summary: kernel crash after a few hours/days with NFS 4.1 and 4.2 enabled
Product: Red Hat Enterprise Linux 7 Reporter: Rik Theys <rik.theys>
Component: kernelAssignee: J. Bruce Fields <bfields>
kernel sub component: NFS QA Contact: JianHong Yin <jiyin>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: urgent CC: bcodding, bfields, dhoward, eguan, fs-qe, jiyin, jlayton, swhiteho, xzhou, yoyang
Version: 7.2Keywords: ZStream
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: kernel-3.10.0-517.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1390672 (view as bug list) Environment:
Last Closed: 2017-08-02 00:43:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1390672    
Attachments:
Description Flags
netconsole.log.xz - part 1/2
none
netconsole.log.xz - part 2/2
none
netconsole log of another crash - part 1/2
none
netconsole log of another crash - part 2/2 none

Description Rik Theys 2016-08-19 20:22:56 UTC
Created attachment 1192290 [details]
netconsole.log.xz - part 1/2

Description of problem:

We upgraded an NFS file server to version 7 and enabled NFS 4.1 and 4.2 in /etc/sysconfig/nfs.

Since the upgrade, the system keeps on crashing after a few hours/days of uptime. After a few hours it starts to log errors (see attachment) and seems to completely crash after a few hours/days.

The attached log was captured using a netconsole to another system.

The log shows the following error at first:

[57229.919808] ------------[ cut here ]------------
[57229.919883] WARNING: at fs/nfsd/nfs4state.c:3853 nfsd4_process_open2+0xb72/0xf70 [nfsd]()
[57229.919963] Modules linked in: netconsole ip6table_filter ip6_tables rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache fuse 8021q garp stp mrp llc xt_set xt_multiport ip_set_hash_ip ip_set nfnetlink bonding ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter dm_snapshot dm_bufio dcdbas iTCO_wdt intel_powerclamp iTCO_vendor_support bnx2 ixgbe coretemp kvm_intel kvm dm_service_time ptp pps_core mdio crc32_pclmul crc32c_intel ipmi_devintf ghash_clmulni_intel aesni_intel dca lrw gf128mul glue_helper ablk_helper shpchp pcspkr cryptd lpc_ich sg mfd_core i7core_edac edac_core wmi acpi_cpufreq ipmi_si ipmi_msghandler acpi_power_meter binfmt_misc nfsd auth_rpcgss nfs_acl lockd dm_multipath grace sunrpc ip_tables ext4 mbcache jbd2 sd_mod sr_mod cdrom ata_generic pata_acpi mgag200 syscopyarea sysfillrect sysimgblt i2c_algo_bit lpfc drm_kms_helper ttm drm crc_t10dif mpt2sas ata_piix crct10dif_generic serio_raw crct10dif_pclmul libata scsi_transport_fc raid_class i2c_core scsi_tgt scsi_transport_sas crct10dif_common dm_mirror dm_region_hash dm_log dm_mod
[57229.921319] CPU: 0 PID: 3380 Comm: nfsd Tainted: G          I    ------------   3.10.0-327.28.2.el7.x86_64 #1
[57229.921378] Hardware name: Dell Inc. PowerEdge R710/0MD99X, BIOS 6.3.0 07/24/2012
[57229.921424]  0000000000000000 00000000bceb2a6a ffff88120a1dfc28 ffffffff81636553
[57229.921505]  ffff88120a1dfc60 ffffffff8107b200 ffff880e0b94d168 ffff8808f5a0e3e0
[57229.921566]  ffff880b6106b878 ffff8803f0c68c00 0000000000000000 ffff88120a1dfc70
[57229.921625] Call Trace:
[57229.923339]  [<ffffffff81636553>] dump_stack+0x19/0x1b
[57229.925046]  [<ffffffff8107b200>] warn_slowpath_common+0x70/0xb0
[57229.926723]  [<ffffffff8107b34a>] warn_slowpath_null+0x1a/0x20
[57229.928376]  [<ffffffffa03dfe22>] nfsd4_process_open2+0xb72/0xf70 [nfsd]
[57229.930057]  [<ffffffffa03cf14a>] nfsd4_open+0x55a/0x850 [nfsd]
[57229.931762]  [<ffffffffa03cf917>] nfsd4_proc_compound+0x4d7/0x7f0 [nfsd]
[57229.933480]  [<ffffffffa03bb12b>] nfsd_dispatch+0xbb/0x200 [nfsd]
[57229.935152]  [<ffffffffa0379183>] svc_process_common+0x453/0x6f0 [sunrpc]
[57229.936840]  [<ffffffffa0379523>] svc_process+0x103/0x170 [sunrpc]
[57229.938501]  [<ffffffffa03baab7>] nfsd+0xe7/0x150 [nfsd]
[57229.940171]  [<ffffffffa03ba9d0>] ? nfsd_destroy+0x80/0x80 [nfsd]
[57229.941420]  [<ffffffff810a5b2f>] kthread+0xcf/0xe0
[57229.942617]  [<ffffffff810a5a60>] ? kthread_create_on_node+0x140/0x140
[57229.943664]  [<ffffffff81646b98>] ret_from_fork+0x58/0x90
[57229.944714]  [<ffffffff810a5a60>] ? kthread_create_on_node+0x140/0x140
[57229.943664]  [<ffffffff81646b98>] ret_from_fork+0x58/0x90
[57229.944714]  [<ffffffff810a5a60>] ? kthread_create_on_node+0x140/0x140
[57229.945734] ---[ end trace 3c885cce7e01b16c ]---

A while later it starts spewing the following error:

[339281.304267] receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt ffff8800bae57000 xid 570c89f3

The log shows this error 7760511 times.

Afterwards, the following warning is logged (281 times)

[340168.473391] ------------[ cut here ]------------
[340168.473416] WARNING: at net/sunrpc/sched.c:692 rpc_exit_task+0x85/0x90 [sunrpc]()
[340168.473457] Modules linked in: netconsole ip6table_filter ip6_tables rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache fuse 8021q garp stp mrp llc xt_set xt_multiport ip_set_hash_ip ip_set nfnetlink bonding ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter dm_snapshot dm_bufio dcdbas iTCO_wdt intel_powerclamp iTCO_vendor_support bnx2 ixgbe coretemp kvm_intel kvm dm_service_time ptp pps_core mdio crc32_pclmul crc32c_intel ipmi_devintf ghash_clmulni_intel aesni_intel dca lrw gf128mul glue_helper ablk_helper shpchp pcspkr cryptd lpc_ich sg mfd_core i7core_edac edac_core wmi acpi_cpufreq ipmi_si ipmi_msghandler acpi_power_meter binfmt_misc nfsd auth_rpcgss nfs_acl lockd dm_multipath grace sunrpc ip_tables ext4 mbcache jbd2 sd_mod sr_mod cdrom ata_generic pata_acpi mgag200 syscopyarea sysfillrect sysimgblt i2c_algo_bit lpfc drm_kms_helper ttm drm crc_t10dif mpt2sas ata_piix crct10dif_generic serio_raw crct10dif_pclmul libata scsi_transport_fc raid_class i2c_core scsi_tgt scsi_transport_sas crct10dif_common dm_mirror dm_region_hash dm_log dm_mod
[340168.473471] CPU: 0 PID: 11768 Comm: kworker/0:1H Tainted: G        W I    ------------   3.10.0-327.28.2.el7.x86_64 #1
[340168.473495] Hardware name: Dell Inc. PowerEdge R710/0MD99X, BIOS 6.3.0 07/24/2012
[340168.473505] Workqueue: rpciod rpc_async_schedule [sunrpc]
[340168.473509]  0000000000000000 000000001abd1d5d ffff880a7523fd40 ffffffff81636553
[340168.473511]  ffff880a7523fd78 ffffffff8107b200 ffff8811e1c55b00 ffff8812031ba600
[340168.473514]  ffffffffa0372e00 ffffffffa0372e00 0000000000000020 ffff880a7523fd88
[340168.473515] Call Trace:
[340168.473523]  [<ffffffff81636553>] dump_stack+0x19/0x1b
[340168.473528]  [<ffffffff8107b200>] warn_slowpath_common+0x70/0xb0
[340168.473558]  [<ffffffffa0372e00>] ? rpc_destroy_wait_queue+0x20/0x20 [sunrpc]
[340168.473584]  [<ffffffffa0372e00>] ? rpc_destroy_wait_queue+0x20/0x20 [sunrpc]
[340168.473586]  [<ffffffff8107b34a>] warn_slowpath_null+0x1a/0x20
[340168.473595]  [<ffffffffa0372e85>] rpc_exit_task+0x85/0x90 [sunrpc]
[340168.473620]  [<ffffffffa0373b84>] __rpc_execute+0x94/0x420 [sunrpc]
[340168.473628]  [<ffffffffa0373f36>] rpc_async_schedule+0x26/0x30 [sunrpc]
[340168.473632]  [<ffffffff8109d63b>] process_one_work+0x17b/0x470
[340168.473634]  [<ffffffff8109e40b>] worker_thread+0x11b/0x400
[340168.473636]  [<ffffffff8109e2f0>] ? rescuer_thread+0x400/0x400
[340168.473640]  [<ffffffff810a5b2f>] kthread+0xcf/0xe0
[340168.473652]  [<ffffffff810a5a60>] ? kthread_create_on_node+0x140/0x140
[340168.473663]  [<ffffffff81646b98>] ret_from_fork+0x58/0x90
[340168.473665]  [<ffffffff810a5a60>] ? kthread_create_on_node+0x140/0x140
[340168.473667] ---[ end trace 3c885cce7e01b16d ]---
[341612.361857] ------------[ cut here ]------------

After that is starts to log the following warning:

[505731.859397] ------------[ cut here ]------------
[505731.860230] WARNING: at lib/list_debug.c:59 __list_del_entry+0xa1/0xd0()
[505731.860935] list_del corruption. prev->next should be ffff880b07bbc1b8, but was ffff880b07bbfde0
[505731.861652] Modules linked in: netconsole ip6table_filter ip6_tables rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache fuse 8021q garp stp mrp llc xt_set xt_multiport ip_set_hash_ip ip_set nfnetlink bonding ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter dm_snapshot dm_bufio dcdbas iTCO_wdt intel_powerclamp iTCO_vendor_support bnx2 ixgbe coretemp kvm_intel kvm dm_service_time ptp pps_core mdio crc32_pclmul crc32c_intel ipmi_devintf ghash_clmulni_intel aesni_intel dca lrw gf128mul glue_helper ablk_helper shpchp pcspkr cryptd lpc_ich sg mfd_core i7core_edac edac_core wmi acpi_cpufreq ipmi_si ipmi_msghandler acpi_power_meter binfmt_misc nfsd auth_rpcgss nfs_acl lockd dm_multipath grace sunrpc ip_tables ext4 mbcache jbd2 sd_mod sr_mod cdrom ata_generic pata_acpi mgag200 syscopyarea sysfillrect sysimgblt i2c_algo_bit lpfc drm_kms_helper ttm drm crc_t10dif mpt2sas ata_piix crct10dif_generic serio_raw crct10dif_pclmul libata scsi_transport_fc raid_class i2c_core scsi_tgt scsi_transport_sas crct10dif_common dm_mirror dm_region_hash dm_log dm_mod
[505731.870718] CPU: 0 PID: 3377 Comm: nfsd Tainted: G        W I    ------------   3.10.0-327.28.2.el7.x86_64 #1
[505731.871944] Hardware name: Dell Inc. PowerEdge R710/0MD99X, BIOS 6.3.0 07/24/2012
[505731.873192]  0000000000000246 000000002b736c75 ffff88120509bc18 ffffffff81636553
[505731.874451]  ffff88120509bc50 ffffffff8107b200 ffff880b07bbc1b8 ffff880b07bbc338
[505731.875595]  0000000002000000 ffff8808f5814000 ffff880a4c86d600 ffff88120509bcb8
[505731.877006] Call Trace:
[505731.878272]  [<ffffffff81636553>] dump_stack+0x19/0x1b
[505731.879526]  [<ffffffff8107b200>] warn_slowpath_common+0x70/0xb0
[505731.880668]  [<ffffffff8107b29c>] warn_slowpath_fmt+0x5c/0x80
[505731.882039]  [<ffffffff81286410>] ? security_inode_getattr+0x20/0x30
[505731.883518]  [<ffffffff8130cd11>] __list_del_entry+0xa1/0xd0
[505731.885030]  [<ffffffffa03da9ab>] nfs4_unhash_openowner+0x1b/0x40 [nfsd]
[505731.886501]  [<ffffffffa03d9d53>] nfs4_put_stateowner+0x33/0x60 [nfsd]
[505731.887999]  [<ffffffffa03e024d>] nfsd4_cleanup_open_state+0x2d/0x70 [nfsd]
[505731.889422]  [<ffffffffa03cef17>] nfsd4_open+0x327/0x850 [nfsd]
[505731.890919]  [<ffffffffa03cf917>] nfsd4_proc_compound+0x4d7/0x7f0 [nfsd]
[505731.892148]  [<ffffffffa03bb12b>] nfsd_dispatch+0xbb/0x200 [nfsd]
[505731.893828]  [<ffffffffa0379183>] svc_process_common+0x453/0x6f0 [sunrpc]
[505731.895387]  [<ffffffffa0379523>] svc_process+0x103/0x170 [sunrpc]
[505731.896959]  [<ffffffffa03baab7>] nfsd+0xe7/0x150 [nfsd]
[505731.898198]  [<ffffffffa03ba9d0>] ? nfsd_destroy+0x80/0x80 [nfsd]
[505731.899744]  [<ffffffff810a5b2f>] kthread+0xcf/0xe0
[505731.901179]  [<ffffffff810a5a60>] ? kthread_create_on_node+0x140/0x140
[505731.902593]  [<ffffffff81646b98>] ret_from_fork+0x58/0x90
[505731.904019]  [<ffffffff810a5a60>] ? kthread_create_on_node+0x140/0x140
[505731.905475] ---[ end trace 3c885cce7e01b286 ]---


Immediately afterwards it starts logging CPU soft lockups such as the following. It logs a lot of them and at some point the log stops and the server is completely frozen and unresponsive.

[505778.541238] BUG: soft lockup - CPU#0 stuck for 22s! [nfsd:3355]
[505778.542735] Modules linked in: netconsole ip6table_filter ip6_tables rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache fuse 8021q garp stp mrp llc xt_set xt_multiport ip_set_hash_ip ip_set nfnetlink bonding ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter dm_snapshot dm_bufio dcdbas iTCO_wdt intel_powerclamp iTCO_vendor_support bnx2 ixgbe coretemp kvm_intel kvm dm_service_time ptp pps_core mdio crc32_pclmul crc32c_intel ipmi_devintf ghash_clmulni_intel aesni_intel dca lrw gf128mul glue_helper ablk_helper shpchp pcspkr cryptd lpc_ich sg mfd_core i7core_edac edac_core wmi acpi_cpufreq ipmi_si ipmi_msghandler acpi_power_meter binfmt_misc nfsd auth_rpcgss nfs_acl lockd dm_multipath grace sunrpc ip_tables ext4 mbcache jbd2 sd_mod sr_mod cdrom ata_generic pata_acpi mgag200 syscopyarea sysfillrect sysimgblt i2c_algo_bit lpfc drm_kms_helper ttm drm crc_t10dif mpt2sas ata_piix crct10dif_generic serio_raw crct10dif_pclmul libata scsi_transport_fc raid_class i2c_core scsi_tgt scsi_transport_sas crct10dif_common dm_mirror dm_region_hash dm_log dm_mod
[505778.556995] CPU: 0 PID: 3355 Comm: nfsd Tainted: G        W I    ------------   3.10.0-327.28.2.el7.x86_64 #1
[505778.558664] Hardware name: Dell Inc. PowerEdge R710/0MD99X, BIOS 6.3.0 07/24/2012
[505778.560329] task: ffff880908976780 ti: ffff8811f4dcc000 task.ti: ffff8811f4dcc000
[505778.561977] RIP: 0010:[<ffffffffa03da470>]  [<ffffffffa03da470>] find_openstateowner_str_locked.isra.60+0x30/0x80 [nfsd]
[505778.563656] RSP: 0018:ffff8811f4dcfcc8  EFLAGS: 00000216
[505778.565279] RAX: 00000000fffffffc RBX: 0000000000000000 RCX: 000000000000000b
[505778.566882] RDX: 000000006ce1ca60 RSI: ffff880e286040cc RDI: ffff880a6fe7c480
[505778.568457] RBP: ffff8811f4dcfce0 R08: 0000000000000059 R09: 00000000fffffffc
[505778.570006] R10: 0000000000000000 R11: ffff8808f7b543c0 R12: ffff880ffeb43100
[505778.571527] R13: ffff8811f4dcfcb8 R14: ffff8810a26b2d80 R15: ffff8811f4dcfc48
[505778.573019] FS:  0000000000000000(0000) GS:ffff88121b200000(0000) knlGS:0000000000000000
[505778.574494] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[505778.575941] CR2: 00007fd3e0f2d838 CR3: 000000000194a000 CR4: 00000000000007f0
[505778.577380] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[505778.578868] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[505778.580271] Stack:
[505778.581659]  ffff8808f7b543e0 ffff8808f7b55068 0000000000000061 ffff8811f4dcfd28
[505778.583084]  ffffffffa03df051 0000000000000000 ffff880f0729f340 ffff8808f7b543e0
[505778.584511]  ffff8808f7b55068 ffff8808f7b55068 ffff8808f6d60000 ffff881209161a00
[505778.585942] Call Trace:
[505778.587372]  [<ffffffffa03df051>] nfsd4_process_open1+0x101/0x360 [nfsd]
[505778.588828]  [<ffffffffa03cecd9>] nfsd4_open+0xe9/0x850 [nfsd]
[505778.590279]  [<ffffffffa03cf917>] nfsd4_proc_compound+0x4d7/0x7f0 [nfsd]
[505778.591729]  [<ffffffffa03bb12b>] nfsd_dispatch+0xbb/0x200 [nfsd]
[505778.593185]  [<ffffffffa0379183>] svc_process_common+0x453/0x6f0 [sunrpc]
[505778.594641]  [<ffffffffa0379523>] svc_process+0x103/0x170 [sunrpc]
[505778.596085]  [<ffffffffa03baab7>] nfsd+0xe7/0x150 [nfsd]
[505778.597524]  [<ffffffffa03ba9d0>] ? nfsd_destroy+0x80/0x80 [nfsd]
[505778.598965]  [<ffffffff810a5b2f>] kthread+0xcf/0xe0
[505778.600396]  [<ffffffff810a5a60>] ? kthread_create_on_node+0x140/0x140
[505778.601840]  [<ffffffff81646b98>] ret_from_fork+0x58/0x90
[505778.603278]  [<ffffffff810a5a60>] ? kthread_create_on_node+0x140/0x140
[505778.604722] Code: 55 89 ff 48 c1 e7 04 48 89 e5 41 55 49 89 f5 41 54 4c 8d 24 3a 53 49 8b 1c 24 49 39 dc 75 14 eb 4a 66 2e 0f 1f 84 00 00 00 00 00 <48> 8b 1b 49 39 dc 74 38 80 bb 78 01 00 00 00 74 ef 8b 53 38 41 


Version-Release number of selected component (if applicable):
kernel-3.10.0-327.28.2.el7.x86_64

How reproducible:


Steps to Reproduce:
1. Enable NFS 4.1 and 4.2 by setting the following variable in /etc/sysconfig/nfs
RPCNFSDARGS="-N2 -N3 -V4 -V4.1 -V4.2 -U -s"
2. Have a lot of clients (approx 150) use the server intensively. The clients are mostly Fedora 23/24 and are using NFS 4.2. A smaller set of clients are running CentOS 7.2 and are using NFS 4.1
3.

Actual results:
Server crashes after a while

Expected results:
Server keeps on running


Additional info:
I had to split the xz-compressed netconsole log into two attachments as it's 34M and bugzilla only allows up to 20000KB.

Comment 1 Rik Theys 2016-08-19 20:24:45 UTC
Created attachment 1192291 [details]
netconsole.log.xz - part 2/2

Second part of the netconsole.log.xz file.

Comment 3 Jeff Layton 2016-08-19 23:16:41 UTC
I think this is actually several different bugs. The first WARN is interesting, but harmless.

The second one in rpc_exit_task is interesting. Perhaps we have a race of some sort when tearing down a backchannel transport? It seems like some of the RPC tasks might not have gotten torn down correctly.

The third stack trace is probably the problem that was fixed by 3fcbbd244ed1d20dc0eb7d48d729503992fa9b7d upstream, or maybe a few of the other patches that went in around that time. I think Bruce has backported that, so you may want to try a more recent rhel7 kernel, as that's a clear memory corruptor.

The remaining softlockups are probably fallout from the problem that caused the 3rd stack trace.

Comment 4 Rik Theys 2016-08-20 09:54:37 UTC
Hi,

Thanks for your prompt reply.

(In reply to Jeff Layton from comment #3)
> I think this is actually several different bugs. The first WARN is
> interesting, but harmless.
> 
> The second one in rpc_exit_task is interesting. Perhaps we have a race of
> some sort when tearing down a backchannel transport? It seems like some of
> the RPC tasks might not have gotten torn down correctly.
> 
> The third stack trace is probably the problem that was fixed by
> 3fcbbd244ed1d20dc0eb7d48d729503992fa9b7d upstream, or maybe a few of the
> other patches that went in around that time. I think Bruce has backported
> that, so you may want to try a more recent rhel7 kernel, as that's a clear
> memory corruptor.

3.10.0-327.28.3.el7 seems to be the latest available kernel and compared to 28.2 it only contains a security fix? Where can I find more recent kernels?

> The remaining softlockups are probably fallout from the problem that caused
> the 3rd stack trace.

Any idea on the following messages:
receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt ffff8800bae57000 xid 570c89f3

We are seeing lots of them (but they don't seem to crash the box on their own). The only thing I could find online about these messages is about exporting an XFS filesystem and the bug should be fixed in the RHEL 7.2 kernel.

I have another netconsole capture from the same system and this includes a final PANIC message:

[243229.909545] Kernel panic - not syncing: Fatal exception
[243229.919826] drm_kms_helper: panic occurred, switching back to text console
[243229.920805] ------------[ cut here ]------------
[243229.921780] WARNING: at arch/x86/kernel/smp.c:124 native_smp_send_reschedule+0x5f/0x70()
[243229.922772] Modules linked in: netconsole fuse rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache 8021q garp stp mrp llc xt_set xt_multiport ip_set_hash_ip ip_set nfnetlink bonding ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter dm_snapshot dm_bufio intel_powerclamp coretemp kvm_intel kvm crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel lrw ixgbe gf128mul bnx2 glue_helper ablk_helper iTCO_wdt cryptd dm_service_time iTCO_vendor_support ptp pps_core mdio dca lpc_ich dcdbas mfd_core sg i7core_edac shpchp edac_core pcspkr ipmi_devintf ipmi_si ipmi_msghandler wmi acpi_power_meter acpi_cpufreq binfmt_misc nfsd auth_rpcgss nfs_acl lockd grace sunrpc dm_multipath ip_tables ext4 mbcache jbd2 sd_mod sr_mod cdrom ata_generic pata_acpi mgag200 syscopyarea sysfillrect sysimgblt i2c_algo_bit lpfc drm_kms_helper ttm crc_t10dif drm crct10dif_generic mpt2sas ata_piix crct10dif_pclmul scsi_transport_fc serio_raw raid_class libata i2c_core scsi_tgt scsi_transport_sas crct10dif_common dm_mirror dm_region_hash dm_log dm_mod
[243229.933053] CPU: 0 PID: 3331 Comm: nfsd Tainted: G      D W I    ------------   3.10.0-327.28.2.el7.x86_64 #1
[243229.934239] Hardware name: Dell Inc. PowerEdge R710/0MD99X, BIOS 6.3.0 07/24/2012
[243229.935409]  0000000000000000 000000004a89d18c ffff88121b203d98 ffffffff81636553
[243229.936576]  ffff88121b203dd0 ffffffff8107b200 0000000000000001 ffff88121b2167c0
[243229.937723]  000000010e7a9b5e ffff88091ba167c0 0000000000000000 ffff88121b203de0
[243229.938846] Call Trace:
[243229.939924]  <IRQ>  [<ffffffff81636553>] dump_stack+0x19/0x1b
[243229.941007]  [<ffffffff8107b200>] warn_slowpath_common+0x70/0xb0
[243229.942057]  [<ffffffff8107b34a>] warn_slowpath_null+0x1a/0x20
[243229.943076]  [<ffffffff8104684f>] native_smp_send_reschedule+0x5f/0x70
[243229.944084]  [<ffffffff810c650d>] trigger_load_balance+0x16d/0x200
[243229.945074]  [<ffffffff810b7703>] scheduler_tick+0x103/0x150
[243229.946052]  [<ffffffff810e08a0>] ? tick_sched_handle.isra.14+0x60/0x60
[243229.947033]  [<ffffffff8108e926>] update_process_times+0x66/0x80
[243229.948104]  [<ffffffff810e0865>] tick_sched_handle.isra.14+0x25/0x60
[243229.949075]  [<ffffffff810e08e1>] tick_sched_timer+0x41/0x70
[243229.950040]  [<ffffffff810a9d82>] __hrtimer_run_queues+0xd2/0x260
[243229.951004]  [<ffffffff810aa320>] hrtimer_interrupt+0xb0/0x1e0
[243229.951974]  [<ffffffff8164855c>] ? call_softirq+0x1c/0x30
[243229.952941]  [<ffffffff810495c7>] local_apic_timer_interrupt+0x37/0x60
[243229.953915]  [<ffffffff816491cf>] smp_apic_timer_interrupt+0x3f/0x60
[243229.954889]  [<ffffffff8164789d>] apic_timer_interrupt+0x6d/0x80
[243229.955858]  <EOI>  [<ffffffffa0143853>] ? __drm_modeset_lock_all+0x33/0x110 [drm]
[243229.956868]  [<ffffffff8162feb2>] ? panic+0x1a3/0x1e7
[243229.957854]  [<ffffffff8163f55b>] oops_end+0x12b/0x150
[243229.958834]  [<ffffffff8162f688>] no_context+0x280/0x2a3
[243229.959813]  [<ffffffff8162f71e>] __bad_area_nosemaphore+0x73/0x1ca
[243229.960798]  [<ffffffff8162f888>] bad_area_nosemaphore+0x13/0x15
[243229.961781]  [<ffffffff816422be>] __do_page_fault+0x2de/0x450
[243229.962760]  [<ffffffff81642453>] do_page_fault+0x23/0x80
[243229.963736]  [<ffffffff8163e748>] page_fault+0x28/0x30
[243229.964718]  [<ffffffffa0411478>] ? find_openstateowner_str_locked.isra.60+0x38/0x80 [nfsd]
[243229.965715]  [<ffffffffa041149d>] ? find_openstateowner_str_locked.isra.60+0x5d/0x80 [nfsd]
[243229.966695]  [<ffffffffa0416051>] nfsd4_process_open1+0x101/0x360 [nfsd]
[243229.967687]  [<ffffffffa0405cd9>] nfsd4_open+0xe9/0x850 [nfsd]
[243229.968672]  [<ffffffffa0406917>] nfsd4_proc_compound+0x4d7/0x7f0 [nfsd]
[243229.969653]  [<ffffffffa03f212b>] nfsd_dispatch+0xbb/0x200 [nfsd]
[243229.970635]  [<ffffffffa02d8183>] svc_process_common+0x453/0x6f0 [sunrpc]
[243229.971605]  [<ffffffffa02d8523>] svc_process+0x103/0x170 [sunrpc]
[243229.972544]  [<ffffffffa03f1ab7>] nfsd+0xe7/0x150 [nfsd]
[243229.973456]  [<ffffffffa03f19d0>] ? nfsd_destroy+0x80/0x80 [nfsd]
[243229.974344]  [<ffffffff810a5b2f>] kthread+0xcf/0xe0
[243229.975200]  [<ffffffff810a5a60>] ? kthread_create_on_node+0x140/0x140
[243229.976038]  [<ffffffff81646b98>] ret_from_fork+0x58/0x90
[243229.976849]  [<ffffffff810a5a60>] ? kthread_create_on_node+0x140/0x140
[243229.977638] ---[ end trace ea0c6327415cf4d3 ]---

I will compress and upload the entire capture.

Regards,

Rik

Comment 5 Rik Theys 2016-08-20 09:58:58 UTC
Created attachment 1192454 [details]
netconsole log of another crash - part 1/2

Comment 6 Rik Theys 2016-08-20 09:59:47 UTC
Created attachment 1192455 [details]
netconsole log of another crash - part 2/2

Comment 7 J. Bruce Fields 2016-08-26 20:16:39 UTC
(In reply to Jeff Layton from comment #3)
> The third stack trace is probably the problem that was fixed by
> 3fcbbd244ed1d20dc0eb7d48d729503992fa9b7d upstream, or maybe a few of the
> other patches that went in around that time. I think Bruce has backported
> that, so you may want to try a more recent rhel7 kernel, as that's a clear
> memory corruptor.

That patch was already in 3.10.0-325.

Comment 8 J. Bruce Fields 2016-08-26 20:26:37 UTC
(In reply to Jeff Layton from comment #3)
> I think this is actually several different bugs. The first WARN is
> interesting, but harmless.

Well, it means we've got no read open at a point when we definitely should have one....

I wonder if this might be Oleg's 5cc1fb2a093e and/or feb9dad52092.  Those aren't backported yet.  I can build an rpm with those and some other more recent upstream bugfixes, though I'm not sure how to get that to Rik for testing.

Comment 9 J. Bruce Fields 2016-08-26 20:34:29 UTC
See also bug 1369600.  Possibly not strictly a dup since you may be dealing with more than one problem here.

Rik, do you mind if I open this bug to the public?

Comment 11 Rik Theys 2016-08-27 11:04:45 UTC
(In reply to J. Bruce Fields from comment #9)

> 
> Rik, do you mind if I open this bug to the public?

No problem, you can open it to the public.

Rik

Comment 12 Jeff Layton 2016-08-27 11:55:33 UTC
Ahh sorry I missed the comments on this earlier this week...

The "unrecognized reply" just means that the kernel got a reply for an XID that it couldn't find. Often that means that the reply stream is misaligned in some fashion.

Hmm...maybe when we get something like that, we ought to close the connection? Getting a reply for a bogus XID would seem to be an indicator that something is out of whack there.

Comment 13 Jeff Layton 2016-08-27 11:58:26 UTC
(In reply to J. Bruce Fields from comment #8)
> (In reply to Jeff Layton from comment #3)
> > I think this is actually several different bugs. The first WARN is
> > interesting, but harmless.
> 
> Well, it means we've got no read open at a point when we definitely should
> have one....
> 
> I wonder if this might be Oleg's 5cc1fb2a093e and/or feb9dad52092.  Those
> aren't backported yet.  I can build an rpm with those and some other more
> recent upstream bugfixes, though I'm not sure how to get that to Rik for
> testing.

Oh right. I had forgotten about that -- yeah that does seem like it could be what's needed.

Comment 14 Rik Theys 2016-08-29 07:59:26 UTC
Hi,

We've disabled NFS 4.1 and 4.2 on the server (so only 4.0 is used now), and the system has been up for 9 days now (which is a new record for this system).

We do still see a lot (16104243 on Aug 27) of the "receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt ffff88120a983000 xid 0598188f" messages.

Should be commits mentioned (5cc1fb2a093e and/or feb9dad52092) fix this issue? I will see if they apply cleanly to the 3.10.0-327.28.2.el7.x86_64 kernel.

Regards,

Rik

Comment 15 Rik Theys 2016-08-29 08:11:01 UTC
Hi,

It seems the two patches depend on other commits for the 3.10.0-327.28.3.el7.x86_64 kernel.

I was unable to quickly determine which commits are probably missing.

Rik

Comment 16 Rik Theys 2016-08-29 08:23:28 UTC
(In reply to Rik Theys from comment #14)
> We've disabled NFS 4.1 and 4.2 on the server (so only 4.0 is used now), and
> the system has been up for 9 days now (which is a new record for this
> system).

I stand corrected. The 4.1 and 4.2 was only disabled last Friday so the system has only been running without those versions for 3 days.

The system was not rebooted after disabling 4.1 and 4.2. Could that explain why the receive_cb_reply messages keep coming?

Rik

Comment 17 J. Bruce Fields 2016-09-06 14:40:09 UTC
(In reply to Rik Theys from comment #16)
> (In reply to Rik Theys from comment #14)
> > We've disabled NFS 4.1 and 4.2 on the server (so only 4.0 is used now), and
> > the system has been up for 9 days now (which is a new record for this
> > system).
> 
> I stand corrected. The 4.1 and 4.2 was only disabled last Friday so the
> system has only been running without those versions for 3 days.
> 
> The system was not rebooted after disabling 4.1 and 4.2. Could that explain
> why the receive_cb_reply messages keep coming?

You'd need to at least restart the nfs server to for the change to take effect.

Also, I'm not sure off the top of my head whether 4.1/4.2 clients will deal well with the case of a server that reboots lacking support for the protocol version they were using before, so unmounting from the client and then remounting may be required.

Comment 18 J. Bruce Fields 2016-09-09 21:13:51 UTC
(In reply to J. Bruce Fields from comment #17)
> You'd need to at least restart the nfs server to for the change to take
> effect.
> 
> Also, I'm not sure off the top of my head whether 4.1/4.2 clients will deal
> well with the case of a server that reboots lacking support for the protocol
> version they were using before, so unmounting from the client and then
> remounting may be required.

Yes, just checked this, and clients just error out with "protocol not supported" on any attempt to use the mountpoint.

You can also check /proc/fs/nfsd/versions to see if the change really took effect.

Comment 19 Rik Theys 2016-09-12 07:04:40 UTC
Hi,

(In reply to J. Bruce Fields from comment #18)
> (In reply to J. Bruce Fields from comment #17)
> > You'd need to at least restart the nfs server to for the change to take
> > effect.
> > 
> > Also, I'm not sure off the top of my head whether 4.1/4.2 clients will deal
> > well with the case of a server that reboots lacking support for the protocol
> > version they were using before, so unmounting from the client and then
> > remounting may be required.
> 
> Yes, just checked this, and clients just error out with "protocol not
> supported" on any attempt to use the mountpoint.
> 
> You can also check /proc/fs/nfsd/versions to see if the change really took
> effect.

You're right. That file shows 4.1 and 4.2 are still enabled. Strange that the box hasn't crashed then (up for 23 days now).

Could this bug be related to the following (client side) bug?
https://bugzilla.redhat.com/show_bug.cgi?id=1371237

Regards,

Rik

Comment 20 J. Bruce Fields 2016-09-28 19:58:39 UTC
(In reply to Rik Theys from comment #19)
> Could this bug be related to the following (client side) bug?
> https://bugzilla.redhat.com/show_bug.cgi?id=1371237

If any of your clients fall into the range affected by that bug, then, sure, I can imagine that might make it more likely to hit some race in the server.

Of course, no behavior from the client, however bad, should cause the server to crash, so there's some server bug here regardless.

Comment 21 J. Bruce Fields 2016-09-29 20:05:41 UTC
(In reply to J. Bruce Fields from comment #8)
> I wonder if this might be Oleg's 5cc1fb2a093e and/or feb9dad52092.  Those
> aren't backported yet.  I can build an rpm with those and some other more
> recent upstream bugfixes, though I'm not sure how to get that to Rik for
> testing.

And it sounds like Rik's having trouble reproducing.

But I think we should apply those two patches regardless, as they may help, and are for serious problems that others managed to hit:

feb9dad5209280085d5b0c094fa67e7a8d75c81a "nfsd: Always lock state exclusively."
5cc1fb2a093e254b656c64ff24b0b76bed1d34d9 "nfsd: Extend the mutex holding region around in nfsd4_process_open2()"

And while we're at it, rhel7 is missing three more upstream fixes for state-related races that we should also apply:

885848186fbc2d1d8fb6d2fdc2156638ae289a46 "nfsd: Close race between nfsd4_release_lockowner and nfsd4_lock"
42691398be08bd1fe99326911a0aa31f2c041d53 "nfsd: Fix race between FREE_STATEID and LOCK"
dd257933fa4b9fea66a1195f8a15111029810abc "nfsd: don't return an unhashed lock stateid after taking mutex"

Comment 24 Steve Whitehouse 2016-10-03 10:58:00 UTC
Can we have a QE ack for this one? Bruce, do you have a way to test this?

Comment 25 Rik Theys 2016-10-06 14:40:47 UTC
Hi,

Will all the required patches be part of RHEL 7.3, which is now in beta?

Regards,

Rik

Comment 27 Rafael Aquini 2016-11-01 00:29:46 UTC
Patch(es) committed on kernel repository and an interim kernel build is undergoing testing

Comment 29 Rafael Aquini 2016-11-01 14:37:29 UTC
Patch(es) available on kernel-3.10.0-517.el7

Comment 34 Steve Dickson 2017-04-08 16:02:55 UTC
*** Bug 1380594 has been marked as a duplicate of this bug. ***

Comment 44 errata-xmlrpc 2017-08-02 00:43:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:1842

Comment 45 errata-xmlrpc 2017-08-02 01:38:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:1842