2223560 – [xfstests/nfs generic/650] kernel BUG at fs/nfsd/xdr4.h:780!

This bug has been migrated to another issue tracking site. It has been closed here and may no longer be being monitored.

If you would like to get updates for this issue, or to participate in it, you may do so at Red Hat Issue Tracker .

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2223560 - [xfstests/nfs generic/650] kernel BUG at fs/nfsd/xdr4.h:780!

Summary: [xfstests/nfs generic/650] kernel BUG at fs/nfsd/xdr4.h:780!

Keywords:
Status:	CLOSED MIGRATED
Alias:	None
Product:	Red Hat Enterprise Linux 9
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	9.3
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Jeff Layton
QA Contact:	Boyang Xue
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2209174
TreeView+	depends on / blocked

Reported:	2023-07-18 08:40 UTC by Boyang Xue
Modified:	2023-09-23 11:55 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-09-23 11:55:41 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHEL-7983	0	None	Migrated	None	2023-09-23 11:55:40 UTC
Red Hat Issue Tracker	RHELPLAN-162542	0	None	None	None	2023-07-18 08:41:12 UTC

Description Boyang Xue 2023-07-18 08:40:35 UTC

Description of problem:
The latest RHEL-9 kernel crashes when running xfstests generic/650 on nfs. We have 2 reproduction so far, one is on -336.el9 and the other is on -339.el9, but unfortunately no vmcore has been saved. We have loop run generic/650 for hundreds times, but 0 of them reproduced this bug.

log
```
[51935.482019] XFS (vda3): Filesystem has been shut down due to log error (0x2). 
[51935.482020] XFS (vda3): Please unmount the filesystem and rectify the problem(s). 
[51935.482550] vda3: writeback error on inode 25320400, offset 2097152, sector 58684120 
[51935.497820] ------------[ cut here ]------------ 
[51935.497821] kernel BUG at fs/nfsd/xdr4.h:780! 
[51935.497828] invalid opcode: 0000 [#1] PREEMPT SMP PTI 
[51935.512241] CPU: 3 PID: 56031 Comm: nfsd Kdump: loaded Not tainted 5.14.0-339.el9.x86_64 #1 
[51935.515313] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 
[51935.517807] RIP: 0010:do_open_lookup+0x1e3/0x1f0 [nfsd] 
[51935.520261] smpboot: CPU 0 is now offline 
[51935.530115] Code: 5d 41 5e c3 cc cc cc cc 8b 16 48 8b bb 08 01 00 00 48 83 c6 04 89 57 58 48 83 c7 5c e8 86 a8 f3 d6 48 8b 75 00 e9 4d ff ff ff <0f> 0b b8 00 00 27 18 e9 0e ff ff ff 90 0f 1f 44 00 00 41 57 41 56 
[51935.530119] RSP: 0018:ffffb7454099bdc8 EFLAGS: 00010246 
[51935.530122] RAX: 0000000000000000 RBX: ffff88e504d12280 RCX: 0000000000000070 
[51935.530124] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000000 
[51935.530125] RBP: ffffb7454099bdf8 R08: ffffb7454099bcf4 R09: 0000000000000000 
[51935.530126] R10: ffff88e51a8af1e0 R11: 0000000000000000 R12: ffff88e527e1c028 
[51935.530128] R13: ffff88e503bb8000 R14: ffff88e503bb8000 R15: ffffffff99c65d40 
[51935.530130] FS:  0000000000000000(0000) GS:ffff88e53bd80000(0000) knlGS:0000000000000000 
[51935.530137] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 
[51935.530139] CR2: 00007f23ab5fb000 CR3: 0000000110356005 CR4: 00000000000206e0 
[51935.530141] Call Trace: 
[51935.530143]  <TASK> 
[51935.530146]  ? show_trace_log_lvl+0x1c4/0x2df 
[51935.530152]  ? show_trace_log_lvl+0x1c4/0x2df 
[51935.530156]  ? nfsd4_open+0x26f/0x4b0 [nfsd] 
[51935.530193]  ? __die_body.cold+0x8/0xd 
[51935.530196]  ? die+0x2b/0x50 
[51935.530200]  ? do_trap+0xce/0x120 
[51935.530201]  ? do_open_lookup+0x1e3/0x1f0 [nfsd] 
[51935.530236]  ? do_error_trap+0x65/0x80 
[51935.530238]  ? do_open_lookup+0x1e3/0x1f0 [nfsd] 
[51935.530271]  ? exc_invalid_op+0x4e/0x70 
[51935.530274]  ? do_open_lookup+0x1e3/0x1f0 [nfsd] 
[51935.530308]  ? asm_exc_invalid_op+0x16/0x20 
[51935.530312]  ? do_open_lookup+0x1e3/0x1f0 [nfsd] 
[51935.530345]  ? do_open_lookup+0x169/0x1f0 [nfsd] 
[51935.530391]  nfsd4_open+0x26f/0x4b0 [nfsd] 
[51935.530426]  nfsd4_proc_compound+0x44e/0x700 [nfsd] 
[51935.608662]  nfsd_dispatch+0x53/0x170 [nfsd] 
[51935.610697]  svc_process_common+0x357/0x640 [sunrpc] 
[51935.612865]  ? nfsd_svc+0x170/0x170 [nfsd] 
[51935.614767]  ? nfsd_shutdown_threads+0xa0/0xa0 [nfsd] 
[51935.616983]  svc_process+0x12d/0x180 [sunrpc] 
[51935.618969]  nfsd+0xd5/0x190 [nfsd] 
[51935.620685]  kthread+0xe0/0x100 
[51935.622334]  ? kthread_complete_and_exit+0x20/0x20 
[51935.624482]  ret_from_fork+0x22/0x30 
[51935.626210]  </TASK> 
[51935.627577] Modules linked in: loop dm_mod tls rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache netfs rpcrdma rdma_cm iw_cm ib_cm ib_core nfsd auth_rpcgss nfs_acl lockd grace snd_hda_codec_generic ledtrig_audio snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hda_core kvm_intel snd_hwdep qxl rfkill sunrpc snd_pcm drm_ttm_helper kvm ttm irqbypass drm_kms_helper snd_timer syscopyarea snd sysfillrect soundcore i2c_piix4 virtio_balloon pcspkr joydev sysimgblt drm fuse xfs libcrc32c ata_generic crct10dif_pclmul virtio_net crc32_pclmul ata_piix net_failover crc32c_intel virtio_console failover virtio_blk libata ghash_clmulni_intel serio_raw 
```

reproduction on -336.el9
https://beaker.engineering.redhat.com/recipes/14196563
reproduction on -339.el9
https://beaker.engineering.redhat.com/recipes/14235733

Version-Release number of selected component (if applicable):
kernel-5.14.0-339.el9

How reproducible:
2 of hundreds runs

Steps to Reproduce:
1. Run xfstests generic/650 on nfs
2.
3.

Actual results:
Kernel crash

Expected results:
No kernel crash

Additional info:
N/A

Comment 1 Jeff Layton 2023-07-18 19:06:56 UTC

I'm pretty sure the call chain where we tripped the BUG is:

nfs4_open
    do_open_lookup
        set_change_info

The problem is that I can't see any way that we could end up calling set_change_info with fh_pre_saved set to false. All of the paths that lead to that call ensure that it's set to true, AFAICT.
Another possibility is that something raced in and called fh_clear_pre_post_attrs while we were processing here. That seems unlikely though, as nothing else should be operating on this compound.

Either way, it seems plausible that the apparent disk failure in the log messages above this point may have been a factor here.

A core would be really helpful if one turns up!

Comment 2 Jeff Layton 2023-07-20 12:09:49 UTC

Oh! I do see one way that this can happen. If the OPEN was not a create request, and the nfsd_lookup fails, we'll call fh_fill_both_attrs(), which does this:

void fh_fill_both_attrs(struct svc_fh *fhp)
{
        fh_fill_post_attrs(fhp);
        if (!fhp->fh_post_saved)
                return;
        fhp->fh_pre_change = fhp->fh_post_change;
        fhp->fh_pre_mtime = fhp->fh_post_attr.mtime;
        fhp->fh_pre_ctime = fhp->fh_post_attr.ctime;
        fhp->fh_pre_size = fhp->fh_post_attr.size;
        fhp->fh_pre_saved = true;
}

...if fh_fill_post_attrs also fails at that point, then fh_post_saved will be false and fh_pre_saved will never be set to true. So this is likely related to the disk failure that happened just before. Both operations failed, and fh_pre_saved never got set.

The code to do this changed in the last few years to lean more heavily on getattr operations to fetch this info. Prior to that, it would fetch this info directly out of the inode, so we could assert on fh_pre_saved being set like this. Now, that's no longer the case -- we need to be able to cope with this failing.

I think we probably just need to rip out the BUG_ON, but we may need to do some further work to ensure that we comply with the protocol. It's not clear to me what you're supposed to do when you can't fetch the change info. I'll have to dive into the spec.

Comment 4 Boyang Xue 2023-07-24 03:26:49 UTC

As I had not successfully reproduce this bug by loop running generic/650 sololy, I submitted 10 jobs to run full xfstests (assumed previous tests affecting generic/650) on the latest -340.el9, but there's no reproduction unfortunately.

Link to Beaker jobs - full xfstests runs:
https://url.corp.redhat.com/8c2741d

Link to Beaker jobs - generic/650 solo runs:
https://url.corp.redhat.com/44a7d4b

Comment 5 Jeff Layton 2023-07-24 12:35:08 UTC

Yeah, I think this is going to be a hard one to reliably reproduce. You basically need a disk failure to coincide with a non-O_CREAT open request on a dentry that is already in cache.

The current upstream patch and discussion is here.

    https://lore.kernel.org/linux-nfs/963cc8896ff6a6759c57f1d97e51f35972d4fc6d.camel@kernel.org/T/#t

I'll see if I can get this in for v6.5.

Comment 7 RHEL Program Management 2023-09-23 11:53:07 UTC

Issue migration from Bugzilla to Jira is in process at this time. This will be the last message in Jira copied from the Bugzilla bug.

Comment 8 RHEL Program Management 2023-09-23 11:55:41 UTC

This BZ has been automatically migrated to the issues.redhat.com Red Hat Issue Tracker. All future work related to this report will be managed there.

Due to differences in account names between systems, some fields were not replicated.  Be sure to add yourself to Jira issue's "Watchers" field to continue receiving updates and add others to the "Need Info From" field to continue requesting information.

To find the migrated issue, look in the "Links" section for a direct link to the new issue location. The issue key will have an icon of 2 footprints next to it, and begin with "RHEL-" followed by an integer.  You can also find this issue by visiting https://issues.redhat.com/issues/?jql= and searching the "Bugzilla Bug" field for this BZ's number, e.g. a search like:

"Bugzilla Bug" = 1234567

In the event you have trouble locating or viewing this issue, you can file an issue by sending mail to rh-issues. You can also visit https://access.redhat.com/articles/7032570 for general account information.

Note You need to log in before you can comment on or make changes to this bug.