Bug 1997177
Summary: | ARK-ELN: WARNING: CPU: 2 PID: 516060 at fs/nfsd/nfs4state.c:5884 nfs4_preprocess_stateid_op+0x515/0x540 [nfsd] | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 9 | Reporter: | Bruno Goncalves <bgoncalv> |
Component: | kernel | Assignee: | Jeff Layton <jlayton> |
kernel sub component: | NFS | QA Contact: | Boyang Xue <bxue> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | unspecified | ||
Priority: | unspecified | CC: | dwysocha, jlayton, pvlasin, xzhou, yoyang |
Version: | 9.0 | Keywords: | Reopened, Triaged |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | kernel-5.14.0-284.3.1.el9_2 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2023-05-09 07:55:38 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Bruno Goncalves
2021-08-24 14:44:47 UTC
This is the WARN_ON_ONCE() in nfs4_find_file indicating that we're trying to use a stateid with sc_type NFS4_DELEG_STATEID but sc_file->fi_deleg_file NULL. I also saw the same last night on 291d1e21 from the stable tree (v5.14.17-21629-g291d1e21aa48). That's the only time I've seen it, so it looks difficult to reproduce. After long time we reproduced it again. [ 5146.584940] run fstests generic/451 at 2022-03-24 19:08:06 [ 5155.184183] ------------[ cut here ]------------ [ 5155.188832] WARNING: CPU: 7 PID: 1032474 at fs/nfsd/nfs4state.c:5889 nfs4_preprocess_stateid_op+0x5da/0x670 [nfsd] [ 5155.199280] Modules linked in: rpcrdma rdma_cm iw_cm ib_cm ib_core nfsd dm_log_writes squashfs binfmt_misc n_gsm pps_ldisc slcan ppp_synctty n_hdlc ppp_async ppp_generic slip slhc serport nfsv3 nfs_acl tun brd overlay exfat loop dm_mod tls rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs rfkill intel_rapl_msr intel_rapl_common isst_if_common nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp sunrpc coretemp ipmi_ssif kvm_intel kvm irqbypass mgag200 i2c_algo_bit rapl intel_cstate acpi_ipmi drm_kms_helper syscopyarea ses ipmi_si sysfillrect sysimgblt fb_sys_fops pcspkr enclosure ipmi_devintf intel_uncore mei_me cec hpilo ioatdma ipmi_msghandler acpi_tad ext4 dca mei acpi_power_meter lpc_ich intel_pch_thermal mbcache jbd2 vfat fat drm fuse xfs libcrc32c sd_mod t10_pi sg uas crct10dif_pclmul crc32_pclmul smartpqi crc32c_intel scsi_transport_sas tg3 ghash_clmulni_intel usb_storage hpwdt wmi [last unloaded: init_module] [ 5155.283386] CPU: 7 PID: 1032474 Comm: nfsd Kdump: loaded Tainted: G IOE --------- --- 5.14.0-70.4.1.el9_0.x86_64 #1 [ 5155.295022] Hardware name: HPE ProLiant DL360 Gen10/ProLiant DL360 Gen10, BIOS U32 05/21/2019 [ 5155.303595] RIP: 0010:nfs4_preprocess_stateid_op+0x5da/0x670 [nfsd] [ 5155.309923] Code: ff 74 1a e8 e8 23 fe ff 48 89 c5 e9 8d fd ff ff 4c 89 cf e8 18 4a 83 e6 e9 9d fb ff ff 4c 89 ef e8 0b 4a 83 e6 e9 3e fb ff ff <0f> 0b 48 c7 44 24 38 00 00 00 00 e9 37 fb ff ff 4c 89 df 4c 89 44 [ 5155.328807] RSP: 0018:ffffb23201053d78 EFLAGS: 00010246 [ 5155.334063] RAX: 0000000000000010 RBX: ffff9a82d88711c0 RCX: 0000000000000028 [ 5155.341240] RDX: 0000000000000028 RSI: ffff9a8402efa1c0 RDI: 0000000000000000 [ 5155.348420] RBP: 0000000000000004 R08: ffff9a8402efa130 R09: ffff9a83eaede760 [ 5155.355596] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9a833089c000 [ 5155.362775] R13: ffff9a8320a3bf00 R14: ffff9a8402fc0030 R15: 0000000000000004 [ 5155.369952] FS: 0000000000000000(0000) GS:ffff9a85f01c0000(0000) knlGS:0000000000000000 [ 5155.378091] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 5155.383872] CR2: 00007f998b772df8 CR3: 0000000128c98002 CR4: 00000000007706e0 [ 5155.391049] DR0: 00000000004288d0 DR1: 0000000000000000 DR2: 0000000000000000 [ 5155.398227] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600 [ 5155.405405] PKRU: 55555554 [ 5155.408127] Call Trace: [ 5155.410591] ? write_bytes_to_xdr_buf+0xca/0xe0 [sunrpc] [ 5155.415979] nfsd4_read+0x72/0xf0 [nfsd] [ 5155.419947] nfsd4_proc_compound+0x3b2/0x6c0 [nfsd] [ 5155.424875] nfsd_dispatch+0x149/0x270 [nfsd] [ 5155.429275] svc_process_common+0x3b8/0x5e0 [sunrpc] [ 5155.434305] ? nfsd_svc+0x180/0x180 [nfsd] [ 5155.438443] svc_process+0xb7/0xf0 [sunrpc] [ 5155.442682] nfsd+0xed/0x150 [nfsd] [ 5155.446209] ? nfsd_shutdown_threads+0x90/0x90 [nfsd] [ 5155.451308] kthread+0x149/0x170 [ 5155.454560] ? set_kthread_struct+0x40/0x40 [ 5155.458769] ret_from_fork+0x22/0x30 [ 5155.462368] ---[ end trace fc830d84166e7d17 ]--- I suspect this is this warning in nfs4_find_file: switch (s->sc_type) { case NFS4_DELEG_STID: if (WARN_ON_ONCE(!s->sc_file->fi_deleg_file)) return NULL; return nfsd_file_get(s->sc_file->fi_deleg_file); Basically, we got a delegation stateid from the NFS client, but there was no delegation attached to the open file we have for it. Probably, this is indicative of a race with a delegation break and return. It may even be a client bug (i.e. client issues a read with a deleg stid, but then ends up returning the delegation while the READ is still in flight). Opening this bug up, as I don't see a need to make this private. It could be related to this bug too: https://bugzilla.linux-nfs.org/show_bug.cgi?id=394 I have a patch queued up for v6.3 that drops this WARN_ON_ONCE (alongside a locking cleanup in this area). This is not a situation that warrants a warning, and the server can handle the case where this pointer turns out to be NULL. I'll plan to pull that patch into RHEL9 once it lands upstream. There is a patch upstream that removes this WARN_ON, since it's not terribly useful and the situation is handled anyway. Justification: The patch that removes this WARN_ON was just merged for v6.3. We want to take that patch anyway since it fixes a potential race condition. This problem is difficult to test too, and we'll probably need to go with a SanityOnly test anyway. Given that it won't substantially increase the QA workload, I move that we go ahead and declare this bug as fixed in 9.2. After evaluating this issue, there are no plans to address it further or fix it in an upcoming release. Therefore, it is being closed. If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened. Reopening bug since I had _just_ asked for the exception before the bot closed it. TEST PASS Running generic/451 didn't reproduce this bug. Verify this by running regression tests against the fixed MR kernel. No regression was found in the tests. Link to Beaker jobs: https://url.corp.redhat.com/bz1997177-1 TEST PASS Running generic/451 didn't reproduce this bug. Verify this by running regression tests against the fixed kernel. No regression was found in the tests. Link to Beaker jobs: https://url.corp.redhat.com/bz1997177-2 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: kernel security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:2458 |