Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1233284

Summary:	RHEL7: repeated NFS4 server untainted kernel panic with RIP locks_in_grace called from nfsd4_process_open2, xfs used as export for diskless NFS clients
Product:	Red Hat Enterprise Linux 7	Reporter:	Dave Wysochanski <dwysocha>
Component:	kernel	Assignee:	J. Bruce Fields <bfields>
kernel sub component:	NFS	QA Contact:	JianHong Yin <jiyin>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	medium
Priority:	high	CC:	bfields, cww, eguan, fsorenso, fs-qe, jiyin, jlayton, plambri, smayhew, steved, swhiteho, tlavigne, vaggarwa
Version:	7.1
Target Milestone:	rc
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	kernel-3.10.0-325.el7	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2015-11-19 22:42:51 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1133060

Description Dave Wysochanski 2015-06-18 15:18:23 UTC

Description of problem:
* Prior to the panic, a number of messages are seen indicating an invalid response was received on a NFS4 callback channel, `receive_cb_reply: Got unrecognized reply`, and `client 1.2.3.4 testing state ID with incorrect client ID`
~~~
[ 6588.914463] receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt ffff880231a91800 xid 8aa8279c
[ 6644.081946] receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt ffff880231a91800 xid 8aa82811
[ 6662.894826] NFSD: client 1.2.3.4 testing state ID with incorrect client ID
[ 6663.557121] receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt ffff8801abcb5000 xid e823de78
[ 6678.944874] receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt ffff88007869f800 xid 5c262126
[ 6695.660586] receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt ffff8801564e3800 xid e88fabdd
~~~

* Next a warning occurs, indicating a `__list_add` has detected a corrupted list, somewhere inside `hash_delegation_locked`
~~~
[ 7866.957210] receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt ffff880080868000 xid 84e07778
[ 7878.099182] receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt ffff8801394eb000 xid b31652cb
[ 7882.883142] ------------[ cut here ]------------
[ 7882.883156] WARNING: at lib/list_debug.c:29 __list_add+0x65/0xc0()
[ 7882.883157] list_add corruption. next->prev should be prev (ffff880230a5f068), but was ffff8801abc65868. (next=ffff8801abc65868).
[ 7882.883159] Modules linked in: nfsv3 nfs fscache binfmt_misc ip6t_rpfilter ip6t_REJECT ipt_REJECT xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw iptable_filter ip_tables ext4 mbcache jbd2 coretemp crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr ppdev serio_raw vmw_balloon vmw_vmci i2c_piix4 parport_pc parport shpchp nfsd auth_rpcgss nfs_acl lockd sunrpc xfs libcrc32c sr_mod cdrom ata_generic pata_acpi sd_mod crc_t10dif crct10dif_common vmwgfx
[ 7882.883231]  drm_kms_helper ttm ahci ata_piix libahci drm i2c_core vmxnet3 libata vmw_pvscsi floppy dm_mirror dm_region_hash dm_log dm_mod
[ 7882.883254] CPU: 3 PID: 2773 Comm: nfsd Not tainted 3.10.0-229.4.2.el7.x86_64 #1
[ 7882.883255] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/14/2014
[ 7882.883257]  ffff8800b5757bd8 00000000385971d4 ffff8800b5757b90 ffffffff816042d6
[ 7882.883259]  ffff8800b5757bc8 ffffffff8106e28b ffff880147a637c0 ffff8801abc65868
[ 7882.883261]  ffff880230a5f068 ffff8800b54443e0 ffff8801bee93a50 ffff8800b5757c30
[ 7882.883263] Call Trace:
[ 7882.883270]  [<ffffffff816042d6>] dump_stack+0x19/0x1b
[ 7882.883275]  [<ffffffff8106e28b>] warn_slowpath_common+0x6b/0xb0
[ 7882.883277]  [<ffffffff8106e32c>] warn_slowpath_fmt+0x5c/0x80
[ 7882.883281]  [<ffffffff812d623b>] ? idr_alloc_cyclic+0x2b/0x60
[ 7882.883283]  [<ffffffff812ed685>] __list_add+0x65/0xc0
[ 7882.883294]  [<ffffffffa033278a>] hash_delegation_locked+0x3a/0x40 [nfsd]
[ 7882.883300]  [<ffffffffa0338b29>] nfsd4_process_open2+0x979/0xfc0 [nfsd]
[ 7882.883306]  [<ffffffffa0327a8a>] nfsd4_open+0x55a/0x850 [nfsd]
[ 7882.883311]  [<ffffffffa0328257>] nfsd4_proc_compound+0x4d7/0x7f0 [nfsd]
[ 7882.883316]  [<ffffffffa0313e1b>] nfsd_dispatch+0xbb/0x200 [nfsd]
[ 7882.883328]  [<ffffffffa02d9b33>] svc_process_common+0x453/0x6f0 [sunrpc]
[ 7882.883336]  [<ffffffffa02d9ed3>] svc_process+0x103/0x170 [sunrpc]
[ 7882.883340]  [<ffffffffa03137a7>] nfsd+0xe7/0x150 [nfsd]
[ 7882.883345]  [<ffffffffa03136c0>] ? nfsd_destroy+0x80/0x80 [nfsd]
[ 7882.883347]  [<ffffffff8109726f>] kthread+0xcf/0xe0
[ 7882.883349]  [<ffffffff810971a0>] ? kthread_create_on_node+0x140/0x140
[ 7882.883352]  [<ffffffff816140bc>] ret_from_fork+0x7c/0xb0
[ 7882.883353]  [<ffffffff810971a0>] ? kthread_create_on_node+0x140/0x140
[ 7882.883355] ---[ end trace 2c830e49e095cf84 ]---
~~~

* finally, a kernel panic on NFS server, with RIP `locks_in_grace`
~~~
[ 7893.172876] receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt ffff880231a96000 xid 7cff4b0e
[ 7894.631260] receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt ffff8801abc65000 xid f47a468f
[ 7894.799450] general protection fault: 0000 [#1] SMP
[ 7894.799519] Modules linked in: fuse btrfs zlib_deflate raid6_pq xor vfat msdos fat nfsv3 nfs fscache binfmt_misc ip6t_rpfilter ip6t_REJECT ipt_REJECT xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw iptable_filter ip_tables ext4 mbcache jbd2 coretemp crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr ppdev serio_raw vmw_balloon vmw_vmci i2c_piix4 parport_pc parport shpchp nfsd auth_rpcgss nfs_acl lockd sunrpc xfs libcrc32c sr_mod cdrom ata_generic
[ 7894.799778]  pata_acpi sd_mod crc_t10dif crct10dif_common vmwgfx drm_kms_helper ttm ahci ata_piix libahci drm i2c_core vmxnet3 libata vmw_pvscsi floppy dm_mirror dm_region_hash dm_log dm_mod
[ 7894.799843] CPU: 3 PID: 2763 Comm: nfsd Tainted: G        W   --------------   3.10.0-229.4.2.el7.x86_64 #1
[ 7894.799870] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/14/2014
[ 7894.799899] task: ffff8800bb382d80 ti: ffff88023342c000 task.ti: ffff88023342c000
[ 7894.799920] RIP: 0010:[<ffffffffa0298620>]  [<ffffffffa0298620>] locks_in_grace+0x30/0x50 [lockd]
[ 7894.799953] RSP: 0018:ffff88023342fc78  EFLAGS: 00010202
[ 7894.799968] RAX: 002fffff0002002c RBX: ffff880182cb6000 RCX: 0000000000000001
[ 7894.799988] RDX: 002fffff000200d4 RSI: 0000000000000000 RDI: ffffea0004cea100
[ 7894.800008] RBP: ffff88023342fd28 R08: ffff880230a5f000 R09: 0000000000000000
[ 7894.800028] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[ 7894.800047] R13: 0000000000000001 R14: ffff8802315303e0 R15: ffff8801bee93a50
[ 7894.800071] FS:  0000000000000000(0000) GS:ffff88023fd80000(0000) knlGS:0000000000000000
[ 7894.800093] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 7894.800110] CR2: 00007f671823b7b0 CR3: 00000000bb199000 CR4: 00000000001407e0
[ 7894.800165] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 7894.800216] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 7894.800237] Stack:
[ 7894.800245]  ffffffffa0338999 ffff880200000001 ffff880222d84a50 ffff880230a5f000
[ 7894.800270]  ffff880231531068 ffff880231531000 0000000000000008 0000000000000000
[ 7894.800294]  0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 7894.800318] Call Trace:
[ 7894.800350]  [<ffffffffa0338999>] ? nfsd4_process_open2+0x7e9/0xfc0 [nfsd]
[ 7894.800375]  [<ffffffffa0327a8a>] nfsd4_open+0x55a/0x850 [nfsd]
[ 7894.800396]  [<ffffffffa0328257>] nfsd4_proc_compound+0x4d7/0x7f0 [nfsd]
[ 7894.800419]  [<ffffffffa0313e1b>] nfsd_dispatch+0xbb/0x200 [nfsd]
[ 7894.800448]  [<ffffffffa02d9b33>] svc_process_common+0x453/0x6f0 [sunrpc]
[ 7894.800473]  [<ffffffffa02d9ed3>] svc_process+0x103/0x170 [sunrpc]
[ 7894.800494]  [<ffffffffa03137a7>] nfsd+0xe7/0x150 [nfsd]
[ 7894.800513]  [<ffffffffa03136c0>] ? nfsd_destroy+0x80/0x80 [nfsd]
[ 7894.800534]  [<ffffffff8109726f>] kthread+0xcf/0xe0
[ 7894.800550]  [<ffffffff810971a0>] ? kthread_create_on_node+0x140/0x140
[ 7894.800571]  [<ffffffff816140bc>] ret_from_fork+0x7c/0xb0
[ 7894.800587]  [<ffffffff810971a0>] ? kthread_create_on_node+0x140/0x140
[ 7894.800606] Code: 8b 05 45 87 00 00 85 c0 48 8b 97 e8 0c 00 00 74 28 3b 02 77 24 83 e8 01 48 98 48 8b 44 c2 18 48 85 c0 74 17 48 8d 90 a8 00 00 00 <48> 39 90 a8 00 00 00 0f 95 c0 0f b6 c0 c3 0f 0b 55 48 89 e5 e8
[ 7894.800710] RIP  [<ffffffffa0298620>] locks_in_grace+0x30/0x50 [lockd]
[ 7894.800735]  RSP <ffff88023342fc78>
~~~


Version-Release number of selected component (if applicable):
* Red Hat Enterprise Linux 7 (NFS server)
  * seen on 3.10.0-229.4.2.el7
* NFS4
* xfs in use as exported filesystem
* exported filesystem used for diskless NFS clients with read-only root


How reproducible:
Multiple times by the same customer
Claims this only has happend when using xfs.  Previously the exported filesystem was using ext4.

Steps to Reproduce:
TBD


Additional info:
We have a vmcore, and other data from the customer case I will attach to this bug.

Comment 2 J. Bruce Fields 2015-06-19 16:30:56 UTC

I can't find a kernel-3.10.0-229.4.2.el7 tag in any of the usual repositories.  Where's it from?

Anyway, so the "unrecognized reply" messages mean we got a reply with an xid that doesn't match any on the &xprt->recv list.  That means no such request was transmitted (by xprt_transmit()).  Or it was removed by xprt_release or xprt_complete_rqst.  I *think* this could happen just because we gave up waiting for the reply.  (In which case maybe that printk should be a dprintk.)  But it's a sign there may be lots of delegation recalls going on.

"testing state ID with incorrect client ID" means the server thinks a TEST_STATEID op was sent for a stateid associated with a client different from the client associated with the session over which the TEST_STATEID was sent.  Perhaps this could be the result of some confusion in the server's data structures but the most straightforward explanation would be just that that's really what the client did (perhaps as a result of a bug in client recovery code?)  Again, maybe this should be a dprintk.

So the list corruption warning is the first really interesting thing.  We tried to add a new delegation to either the per-file or per-client list and found the list was corrupted (the list head still points to the first item but it point to itself?).

Comment 3 Dave Wysochanski 2015-06-19 17:02:26 UTC

(In reply to J. Bruce Fields from comment #2)
> I can't find a kernel-3.10.0-229.4.2.el7 tag in any of the usual
> repositories.  Where's it from?
> 
It's one of our CVE kernels.  There's only one change past the 229.7.1.el7 kernel and it is to fix a pipe corruption:

* Fri May 15 2015 Phillip Lougher <plougher> [3.10.0-229.7.2.el7]
- [fs] pipe: fix pipe corruption and iovec overrun on partial copy (Seth Jennings) [1202861 1198843] {CVE-2015-1805}

* Fri May 15 2015 Phillip Lougher <plougher> [3.10.0-229.7.1.el7]

Comment 4 Dave Wysochanski 2015-06-19 17:04:33 UTC

Sorry - wrong version:
* Fri Apr 24 2015 Phillip Lougher <plougher> [3.10.0-229.4.2.el7]
- [x86] crypto: aesni - fix memory usage in GCM decryption (Kurt Stutsman) [1213331 1212178] {CVE-2015-3331}

* Tue Apr 14 2015 Phillip Lougher <plougher> [3.10.0-229.4.1.el7]

Comment 7 J. Bruce Fields 2015-06-19 18:23:14 UTC

Christoph also did some callback-related fixes upstream: see the three commits 8287f009bd95a5e548059dba62a67727bb9549cd..4bd9e9b77fc6787c45b8bb439f6511aa3478606c.  I don't yet see any explanation there for these symptoms, but there might be something.  And the last ("skip CB_NULL probes...") might at least mitigate the problem by cutting down on the number of callbacks.

Comment 10 J. Bruce Fields 2015-09-14 20:19:30 UTC

I suspect there are actually two different problems.

The oops will probably be fixed by

  e85687393f3e "nfsd: ensure that the ol stateid hash reference is only put once"
  3fcbbd244ed1 "nfsd: ensure that delegation stateid hash references are only put once"

We've had other reports of list corruption upstream and in Fedora, so I think it's critical to backport those now.

The "unrecognized reply" warnings are less critical, but we should look into those too.  Would the customer be willing to test backported callback fixes?

Comment 17 J. Bruce Fields 2015-10-13 18:48:04 UTC

I've opened https://bugzilla.redhat.com/show_bug.cgi?id=1271366 for the "unrecognized reply" messages.  (Which are probably unrelated to the cause of the actual crash here.)

Comment 19 Rafael Aquini 2015-10-19 10:23:48 UTC

Patch(es) available on kernel-3.10.0-325.el7

Comment 25 J. Bruce Fields 2015-11-12 19:50:00 UTC

*** Bug 1277610 has been marked as a duplicate of this bug. ***

Comment 26 errata-xmlrpc 2015-11-19 22:42:51 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-2152.html