Bug 2183056

Summary: [RHEL 9] BUG: KASAN: use-after-free in unix_gid_show+0x2c4/0x340 [sunrpc]
Product: Red Hat Enterprise Linux 9 Reporter: Zhi Li <yieli>
Component: kernelAssignee: Jeff Layton <jlayton>
kernel sub component: NFS QA Contact: Zhi Li <yieli>
Status: CLOSED DUPLICATE Docs Contact:
Severity: unspecified    
Priority: unspecified CC: jiyin, jlayton, nfs-team, xzhou, yoyang
Version: 9.2Keywords: Triaged
Target Milestone: rcFlags: pm-rhel: mirror+
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-05-15 14:34:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Zhi Li 2023-03-30 10:22:52 UTC
Description of problem:
Encountered the following dmesg failure in kernel-5.14.0-293.el9 when running nfs regression test.

[ 2249.033301] BUG: KASAN: use-after-free in unix_gid_show+0x2c4/0x340 [sunrpc] 
[ 2249.033516] Read of size 4 at addr ffff88819cda49cc by task grep/17885 
[ 2249.033523]  
[ 2249.033527] CPU: 3 PID: 17885 Comm: grep Kdump: loaded Not tainted 5.14.0-293.el9.x86_64+debug #1 
[ 2249.033537] Hardware name: Intel Corporation Amberlake Client platform/AmberLake Y 42 LPDDR3 RVP3, BIOS KBLSE2R1.R00.X158.P01.1906111053 06/11/2019 
[ 2249.033544] Call Trace: 
[ 2249.033547]  <TASK> 
[ 2249.033553]  ? unix_gid_show+0x2c4/0x340 [sunrpc] 
[ 2249.033762]  dump_stack_lvl+0x57/0x81 
[ 2249.033777]  print_address_description.constprop.0+0x1f/0x1e0 
[ 2249.033793]  ? unix_gid_show+0x2c4/0x340 [sunrpc] 
[ 2249.034002]  print_report.cold+0x5c/0x24b 
[ 2249.034020]  kasan_report+0xc9/0x100 
[ 2249.034036]  ? unix_gid_show+0x2c4/0x340 [sunrpc] 
[ 2249.034237]  unix_gid_show+0x2c4/0x340 [sunrpc] 
[ 2249.034443]  ? unix_gid_upcall+0x10/0x10 [sunrpc] 
[ 2249.034644]  c_show+0x155/0x550 [sunrpc] 
[ 2249.034855]  ? cache_check+0x7f0/0x7f0 [sunrpc] 
[ 2249.035056]  ? cache_seq_start_rcu+0x43/0x310 [sunrpc] 
[ 2249.035165]  ? cache_seq_start_rcu+0x5/0x310 [sunrpc] 
[ 2249.035303]  seq_read_iter+0x995/0x1040 
[ 2249.035319]  seq_read+0x233/0x370 
[ 2249.035325]  ? seq_read_iter+0x1040/0x1040 
[ 2249.035347]  ? inode_security+0x54/0xf0 
[ 2249.035365]  proc_reg_read+0x1a9/0x280 
[ 2249.035378]  vfs_read+0x169/0x4c0 
[ 2249.035394]  ksys_read+0xf9/0x1d0 
[ 2249.035404]  ? __ia32_sys_pwrite64+0x1e0/0x1e0 
[ 2249.035412]  ? ktime_get_coarse_real_ts64+0x130/0x170 
[ 2249.035435]  do_syscall_64+0x59/0x90 
[ 2249.035446]  ? do_syscall_64+0x69/0x90 
[ 2249.035454]  ? lockdep_hardirqs_on+0x79/0x100 
[ 2249.035464]  ? do_syscall_64+0x69/0x90 
[ 2249.035484]  ? asm_exc_page_fault+0x22/0x30 
[ 2249.035493]  ? lockdep_hardirqs_on+0x79/0x100 
[ 2249.035502]  entry_SYSCALL_64_after_hwframe+0x63/0xcd 
[ 2249.035510] RIP: 0033:0x7f9b0713eaf2 
[ 2249.035516] Code: c0 e9 b2 fe ff ff 50 48 8d 3d ca 0c 08 00 e8 65 ea 01 00 0f 1f 44 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 0f 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 48 83 ec 28 48 89 54 24 
[ 2249.035521] RSP: 002b:00007ffc7836e1e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 
[ 2249.035531] RAX: ffffffffffffffda RBX: 0000000000018000 RCX: 00007f9b0713eaf2 
[ 2249.035538] RDX: 0000000000018000 RSI: 000055e795ffa012 RDI: 0000000000000003 
[ 2249.035542] RBP: 000055e795ffa012 R08: 0000000000019000 R09: 0000000000000013 
[ 2249.035547] R10: 0000000000001000 R11: 0000000000000246 R12: 00007ffc7836e2b0 
[ 2249.035553] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000003 
[ 2249.035575]  </TASK> 


Version-Release number of selected component (if applicable):
kernel-5.14.0-293.el9

How reproducible:
reliable (2/10)

Steps to Reproduce:
1. clone https://beaker.engineering.redhat.com/jobs/7683784

   console log:
   https://beaker-archive.hosts.prod.psi.bos.redhat.com/beaker-logs/2023/03/76837/7683784/13644856/console.log


Actual results:
BUG: KASAN: use-after-free in unix_gid_show+0x2c4/0x340 [sunrpc] 

Expected results:
No dmesg failure

Comment 1 Jeff Layton 2023-03-30 12:07:19 UTC
Given that this fired on the grouplist and not on the unix_gid object itself is a significant hint, I think. The unix_gid object is RCU freed, but the group list is not. I think we need to ensure that the grouplist is only freed after the RCU grace period.

I have a patch that I think might fix this. Is this reliably reproducible at all? I can give you a test kernel if so.

Comment 2 Zhi Li 2023-03-30 15:02:14 UTC
(In reply to Jeff Layton from comment #1)
> Given that this fired on the grouplist and not on the unix_gid object itself
> is a significant hint, I think. The unix_gid object is RCU freed, but the
> group list is not. I think we need to ensure that the grouplist is only
> freed after the RCU grace period.
> 
> I have a patch that I think might fix this. Is this reliably reproducible at
> all? I can give you a test kernel if so.

Yes, there is a 20% probability to reproduce this problem stably using the internal
existing test case. I will test it once I get a test kernel.

Comment 5 Jeff Layton 2023-03-31 09:45:52 UTC
Thanks for testing it! I sent the patch upstream yesterday:

https://lore.kernel.org/linux-nfs/D3F3D553-C252-47FB-9D41-9C9A254557DB@oracle.com/T/#m41c9e5ad9313c80d9e083de4d1263ebf2aa9abc2

I think we should aim for 9.3.0 for this patch. I don't believe this is a regression, just a long-standing race condition that's not easy to hit.