Bug 2181403

Summary:	[RHEL 9] BUG nfsd_file: Objects remaining in nfsd_file on __kmem_cache_shutdown()
Product:	Red Hat Enterprise Linux 9	Reporter:	Zhi Li <yieli>
Component:	kernel	Assignee:	Jeff Layton <jlayton>
kernel sub component:	NFS	QA Contact:	Zhi Li <yieli>
Status:	CLOSED DUPLICATE	Docs Contact:
Severity:	unspecified
Priority:	unspecified	CC:	chuck.lever, jiyin, nfs-team, xzhou, yoyang
Version:	9.2	Keywords:	Triaged
Target Milestone:	rc
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-05-15 14:38:18 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Zhi Li 2023-03-24 03:08:49 UTC

Description of problem:
Triggered the following panic when testing nfs regression against rhel9.

[ 3090.983152] BUG nfsd_file (Tainted: GF          OE    --------  --- ): Objects remaining in nfsd_file on __kmem_cache_shutdown() 
[ 3090.986217] ----------------------------------------------------------------------------- 
[ 3090.986217]  
[ 3090.988741] Slab 0x00000000d6efcf19 objects=39 used=1 fp=0x00000000cc8fb7f9 flags=0x17ffffc0000200(slab|node=0|zone=2|lastcpupid=0x1fffff) 
[ 3090.992082] CPU: 0 PID: 64612 Comm: rpc.nfsd Kdump: loaded Tainted: GF          OE    --------  ---  5.14.0-284.2.1.el9_2.x86_64 #1 
[ 3090.994975] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 
[ 3090.996372] Call Trace: 
[ 3090.997018]  <TASK> 
[ 3090.997599]  dump_stack_lvl+0x34/0x48 
[ 3090.998604]  slab_err.cold+0x53/0x67 
[ 3090.999580]  ? cpumask_next+0x1f/0x30 
[ 3091.000609]  __kmem_cache_shutdown+0x16e/0x320 
[ 3091.001826]  kmem_cache_destroy+0x51/0x160 
[ 3091.002945]  nfsd_file_cache_shutdown+0xa0/0x170 [nfsd] 
[ 3091.004343]  nfsd_put+0x123/0x140 [nfsd] 
[ 3091.005366]  nfsd_svc+0x15c/0x190 [nfsd] 
[ 3091.006423]  write_threads+0x95/0x100 [nfsd] 
[ 3091.007561]  ? _copy_from_user+0x3a/0x60 
[ 3091.008581]  ? simple_transaction_get+0xc4/0xf0 
[ 3091.009720]  ? write_pool_threads+0x230/0x230 [nfsd] 
[ 3091.010974]  nfsctl_transaction_write+0x43/0x80 [nfsd] 
[ 3091.012263]  vfs_write+0xb2/0x280 
[ 3091.013200]  ksys_write+0x5f/0xe0 
[ 3091.014023]  do_syscall_64+0x59/0x90 
[ 3091.014907]  ? exit_to_user_mode_prepare+0xb6/0x100 
[ 3091.016176]  ? syscall_exit_to_user_mode+0x12/0x30 
[ 3091.017470]  ? do_syscall_64+0x69/0x90 
[ 3091.018419]  ? exc_page_fault+0x62/0x150 
[ 3091.019461]  entry_SYSCALL_64_after_hwframe+0x63/0xcd 
[ 3091.020761] RIP: 0033:0x7fb87a73eb97 
[ 3091.021720] Code: 0b 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24 
[ 3091.026547] RSP: 002b:00007fff9006c9b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 
[ 3091.028529] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fb87a73eb97 
[ 3091.030249] RDX: 0000000000000002 RSI: 00005569f5b03c20 RDI: 0000000000000003 
[ 3091.032049] RBP: 0000000000000003 R08: 0000000000000000 R09: 00007fff9006c850 
[ 3091.033924] R10: 0000000000000000 R11: 0000000000000246 R12: 00005569f5b03c20 
[ 3091.035644] R13: 00007fb87a9596c0 R14: 00007fff9006ca80 R15: 00000000ffffffff 
[ 3091.037384]  </TASK> 
[ 3091.038004] Object 0x00000000d069a2e4 @offset=104 

<... snip ...>
[ 3091.314495] ------------[ cut here ]------------ 
[ 3091.315428] kernel BUG at lib/list_debug.c:23! 
[ 3091.316330] invalid opcode: 0000 [#1] PREEMPT SMP PTI 
[ 3091.317391] CPU: 1 PID: 64623 Comm: rpc.nfsd Kdump: loaded Tainted: GF   B   W  OE    --------  ---  5.14.0-284.2.1.el9_2.x86_64 #1 
[ 3091.319636] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 
[ 3091.320756] RIP: 0010:__list_add_valid.cold+0xf/0x3f 
[ 3091.321719] Code: 48 c7 c6 9d fa a4 92 48 89 ef e8 05 70 01 00 48 c7 c0 ea ff ff ff e9 40 3a a8 ff 4c 89 c1 48 c7 c7 08 01 a5 92 e8 44 f1 fe ff <0f> 0b 48 89 f2 4c 89 c1 48 89 fe 48 c7 c7 b8 01 a5 92 e8 2d f1 fe 
[ 3091.325255] RSP: 0018:ffffb7c8c1a63cd8 EFLAGS: 00010246 
[ 3091.326294] RAX: 0000000000000075 RBX: ffff8bcec5bf2868 RCX: 0000000000000000 
[ 3091.327602] RDX: 0000000000000000 RSI: ffff8bcefbd198a0 RDI: ffff8bcefbd198a0 
[ 3091.328980] RBP: ffff8bcec5bf2368 R08: 0000000000000000 R09: 00000000ffff7fff 
[ 3091.330319] R10: ffffb7c8c1a63b80 R11: ffffffff933e9608 R12: 0000000000000000 
[ 3091.331682] R13: 0000000000000000 R14: 0000000000000000 R15: ffffffffc0a5ed4c 
[ 3091.333100] FS:  00007f275f5cc740(0000) GS:ffff8bcefbd00000(0000) knlGS:0000000000000000 
[ 3091.334673] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 
[ 3091.335738] CR2: 0000562676743ac8 CR3: 00000001025d8006 CR4: 00000000007706e0 
[ 3091.337129] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 
[ 3091.338531] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 
[ 3091.339930] PKRU: 55555554 
[ 3091.340466] Call Trace: 
[ 3091.340992]  <TASK> 
[ 3091.341417]  kmem_cache_create_usercopy+0x1a5/0x2c0 
[ 3091.342413]  kmem_cache_create+0x12/0x20 
[ 3091.343194]  nfsd_file_cache_init+0x75/0x180 [nfsd] 
[ 3091.344149]  nfsd_startup_net+0x62/0x220 [nfsd] 
[ 3091.345080]  ? nfsd_create_serv+0x37/0x1f0 [nfsd] 
[ 3091.345953]  nfsd_svc+0xfb/0x190 [nfsd] 
[ 3091.346644]  write_threads+0x95/0x100 [nfsd] 
[ 3091.347502]  ? _copy_from_user+0x3a/0x60 
[ 3091.348282]  ? simple_transaction_get+0xc4/0xf0 
[ 3091.349174]  ? write_pool_threads+0x230/0x230 [nfsd] 
[ 3091.350172]  nfsctl_transaction_write+0x43/0x80 [nfsd] 
[ 3091.351210]  vfs_write+0xb2/0x280 
[ 3091.351891]  ksys_write+0x5f/0xe0 
[ 3091.352555]  do_syscall_64+0x59/0x90 
[ 3091.353291]  ? exit_to_user_mode_prepare+0xb6/0x100 
[ 3091.354228]  ? syscall_exit_to_user_mode+0x12/0x30 
[ 3091.355067]  ? do_syscall_64+0x69/0x90 
[ 3091.355767]  ? syscall_exit_to_user_mode+0x12/0x30 
[ 3091.356708]  ? do_syscall_64+0x69/0x90 
[ 3091.357459]  ? exc_page_fault+0x62/0x150 
[ 3091.358233]  entry_SYSCALL_64_after_hwframe+0x63/0xcd 
[ 3091.359249] RIP: 0033:0x7f275f33eb97 
[ 3091.359968] Code: 0b 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24 
Version-Release number of selected component (if applicable):
5.14.0-284.2.1.el9_2.x86_64 + debug

How reproducible:
reliable （2/5）

Steps to Reproduce:
clone https://beaker.engineering.redhat.com/jobs/7661119

http://lab-02.hosts.prod.psi.bos.redhat.com/beaker/logs/recipes/13609+/13609829/console.log


Actual results:
kernel BUG at lib/list_debug.c:23! 

Expected results:
No panic

Additional info:
beaker job:
https://beaker.engineering.redhat.com/jobs/7661118
console log:
http://lab-02.hosts.prod.upshift.rdu2.redhat.com/beaker/logs/recipes/13609+/13609827/console.log

Comment 8 Jeff Layton 2023-03-24 16:52:25 UTC

(In reply to JianHong Yin from comment #6)

> '''
> nfs]$ cat regression/bz1227851-open-loop-on-NFSv4/nfsd_read_bad_stateid.stp
> global c = 0
> probe module("nfsd").function("nfsd4_read").return
> {
> # BAD_STATEID == 10025;
>         if (c == 0) {
>                 $return = 0x29270000;
>                 exit()
>         }
> }
> '''

(cc'ing Chuck from oracle who is upstream nfsd maintainer)

Oh! That script looks unsafe, and could cause just the symptoms you're seeing here. This is the bottom bit of nfsd4_read:

-------------------8<---------------------
        /* check stateid */
        status = nfs4_preprocess_stateid_op(rqstp, cstate, &cstate->current_fh,
                                        &read->rd_stateid, RD_STATE,
                                        &read->rd_nf, NULL);

        read->rd_rqstp = rqstp;
        read->rd_fhp = &cstate->current_fh;
        return status;
}
-------------------8<---------------------

It calls nfs4_preprocess_stateid which, if successful, will then take a reference to an nfsd_file and then fill out rd_nf with a pointer to it. The expectation is that if nfs4_preprocess_stateid_op fails, then rd_nf will not be filled out. This is only overriding the return code however and not releasing the reference to rd_nf, which would leave outstanding references to those nfsd_files and cause this warning when we go to tear down the cache.

The nfsd code is not completely blameless here however. In the NFSv4 compound processing, the op_release function is not called if op_func returns an error! It looks that might also cause a memory leak in the layoutget code too if that hits an error in an inopportune place.

So while I suspect this systemtap script is the cause of the problem you're seeing, it sort of points out some potential memory leaks in other places. Let's turn this bug into one to fix that structural issue, and make sure we call op_release regardless of success or failure of op_func. We'll need to audit all of the op_release functions and make sure they're safe to call even when op_func fails, but there aren't that many of them.

That should also allow this systemtap script to work as expected.

Comment 9 Jeff Layton 2023-03-27 10:34:48 UTC

Patch posted to linux-nfs mailing list:

https://lore.kernel.org/linux-nfs/20230327102137.15412-1-jlayton@kernel.org/T/#u

Comment 10 Jeff Layton 2023-05-15 14:38:18 UTC


*** This bug has been marked as a duplicate of bug 2183621 ***