Description of problem: The latest RHEL-9 kernel crashes when running xfstests generic/650 on nfs. We have 2 reproduction so far, one is on -336.el9 and the other is on -339.el9, but unfortunately no vmcore has been saved. We have loop run generic/650 for hundreds times, but 0 of them reproduced this bug. log ``` [51935.482019] XFS (vda3): Filesystem has been shut down due to log error (0x2). [51935.482020] XFS (vda3): Please unmount the filesystem and rectify the problem(s). [51935.482550] vda3: writeback error on inode 25320400, offset 2097152, sector 58684120 [51935.497820] ------------[ cut here ]------------ [51935.497821] kernel BUG at fs/nfsd/xdr4.h:780! [51935.497828] invalid opcode: 0000 [#1] PREEMPT SMP PTI [51935.512241] CPU: 3 PID: 56031 Comm: nfsd Kdump: loaded Not tainted 5.14.0-339.el9.x86_64 #1 [51935.515313] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [51935.517807] RIP: 0010:do_open_lookup+0x1e3/0x1f0 [nfsd] [51935.520261] smpboot: CPU 0 is now offline [51935.530115] Code: 5d 41 5e c3 cc cc cc cc 8b 16 48 8b bb 08 01 00 00 48 83 c6 04 89 57 58 48 83 c7 5c e8 86 a8 f3 d6 48 8b 75 00 e9 4d ff ff ff <0f> 0b b8 00 00 27 18 e9 0e ff ff ff 90 0f 1f 44 00 00 41 57 41 56 [51935.530119] RSP: 0018:ffffb7454099bdc8 EFLAGS: 00010246 [51935.530122] RAX: 0000000000000000 RBX: ffff88e504d12280 RCX: 0000000000000070 [51935.530124] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000000 [51935.530125] RBP: ffffb7454099bdf8 R08: ffffb7454099bcf4 R09: 0000000000000000 [51935.530126] R10: ffff88e51a8af1e0 R11: 0000000000000000 R12: ffff88e527e1c028 [51935.530128] R13: ffff88e503bb8000 R14: ffff88e503bb8000 R15: ffffffff99c65d40 [51935.530130] FS: 0000000000000000(0000) GS:ffff88e53bd80000(0000) knlGS:0000000000000000 [51935.530137] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [51935.530139] CR2: 00007f23ab5fb000 CR3: 0000000110356005 CR4: 00000000000206e0 [51935.530141] Call Trace: [51935.530143] <TASK> [51935.530146] ? show_trace_log_lvl+0x1c4/0x2df [51935.530152] ? show_trace_log_lvl+0x1c4/0x2df [51935.530156] ? nfsd4_open+0x26f/0x4b0 [nfsd] [51935.530193] ? __die_body.cold+0x8/0xd [51935.530196] ? die+0x2b/0x50 [51935.530200] ? do_trap+0xce/0x120 [51935.530201] ? do_open_lookup+0x1e3/0x1f0 [nfsd] [51935.530236] ? do_error_trap+0x65/0x80 [51935.530238] ? do_open_lookup+0x1e3/0x1f0 [nfsd] [51935.530271] ? exc_invalid_op+0x4e/0x70 [51935.530274] ? do_open_lookup+0x1e3/0x1f0 [nfsd] [51935.530308] ? asm_exc_invalid_op+0x16/0x20 [51935.530312] ? do_open_lookup+0x1e3/0x1f0 [nfsd] [51935.530345] ? do_open_lookup+0x169/0x1f0 [nfsd] [51935.530391] nfsd4_open+0x26f/0x4b0 [nfsd] [51935.530426] nfsd4_proc_compound+0x44e/0x700 [nfsd] [51935.608662] nfsd_dispatch+0x53/0x170 [nfsd] [51935.610697] svc_process_common+0x357/0x640 [sunrpc] [51935.612865] ? nfsd_svc+0x170/0x170 [nfsd] [51935.614767] ? nfsd_shutdown_threads+0xa0/0xa0 [nfsd] [51935.616983] svc_process+0x12d/0x180 [sunrpc] [51935.618969] nfsd+0xd5/0x190 [nfsd] [51935.620685] kthread+0xe0/0x100 [51935.622334] ? kthread_complete_and_exit+0x20/0x20 [51935.624482] ret_from_fork+0x22/0x30 [51935.626210] </TASK> [51935.627577] Modules linked in: loop dm_mod tls rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache netfs rpcrdma rdma_cm iw_cm ib_cm ib_core nfsd auth_rpcgss nfs_acl lockd grace snd_hda_codec_generic ledtrig_audio snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hda_core kvm_intel snd_hwdep qxl rfkill sunrpc snd_pcm drm_ttm_helper kvm ttm irqbypass drm_kms_helper snd_timer syscopyarea snd sysfillrect soundcore i2c_piix4 virtio_balloon pcspkr joydev sysimgblt drm fuse xfs libcrc32c ata_generic crct10dif_pclmul virtio_net crc32_pclmul ata_piix net_failover crc32c_intel virtio_console failover virtio_blk libata ghash_clmulni_intel serio_raw ``` reproduction on -336.el9 https://beaker.engineering.redhat.com/recipes/14196563 reproduction on -339.el9 https://beaker.engineering.redhat.com/recipes/14235733 Version-Release number of selected component (if applicable): kernel-5.14.0-339.el9 How reproducible: 2 of hundreds runs Steps to Reproduce: 1. Run xfstests generic/650 on nfs 2. 3. Actual results: Kernel crash Expected results: No kernel crash Additional info: N/A
I'm pretty sure the call chain where we tripped the BUG is: nfs4_open do_open_lookup set_change_info The problem is that I can't see any way that we could end up calling set_change_info with fh_pre_saved set to false. All of the paths that lead to that call ensure that it's set to true, AFAICT. Another possibility is that something raced in and called fh_clear_pre_post_attrs while we were processing here. That seems unlikely though, as nothing else should be operating on this compound. Either way, it seems plausible that the apparent disk failure in the log messages above this point may have been a factor here. A core would be really helpful if one turns up!
Oh! I do see one way that this can happen. If the OPEN was not a create request, and the nfsd_lookup fails, we'll call fh_fill_both_attrs(), which does this: void fh_fill_both_attrs(struct svc_fh *fhp) { fh_fill_post_attrs(fhp); if (!fhp->fh_post_saved) return; fhp->fh_pre_change = fhp->fh_post_change; fhp->fh_pre_mtime = fhp->fh_post_attr.mtime; fhp->fh_pre_ctime = fhp->fh_post_attr.ctime; fhp->fh_pre_size = fhp->fh_post_attr.size; fhp->fh_pre_saved = true; } ...if fh_fill_post_attrs also fails at that point, then fh_post_saved will be false and fh_pre_saved will never be set to true. So this is likely related to the disk failure that happened just before. Both operations failed, and fh_pre_saved never got set. The code to do this changed in the last few years to lean more heavily on getattr operations to fetch this info. Prior to that, it would fetch this info directly out of the inode, so we could assert on fh_pre_saved being set like this. Now, that's no longer the case -- we need to be able to cope with this failing. I think we probably just need to rip out the BUG_ON, but we may need to do some further work to ensure that we comply with the protocol. It's not clear to me what you're supposed to do when you can't fetch the change info. I'll have to dive into the spec.
As I had not successfully reproduce this bug by loop running generic/650 sololy, I submitted 10 jobs to run full xfstests (assumed previous tests affecting generic/650) on the latest -340.el9, but there's no reproduction unfortunately. Link to Beaker jobs - full xfstests runs: https://url.corp.redhat.com/8c2741d Link to Beaker jobs - generic/650 solo runs: https://url.corp.redhat.com/44a7d4b
Yeah, I think this is going to be a hard one to reliably reproduce. You basically need a disk failure to coincide with a non-O_CREAT open request on a dentry that is already in cache. The current upstream patch and discussion is here. https://lore.kernel.org/linux-nfs/963cc8896ff6a6759c57f1d97e51f35972d4fc6d.camel@kernel.org/T/#t I'll see if I can get this in for v6.5.