Bug 2223560
Summary: | [xfstests/nfs generic/650] kernel BUG at fs/nfsd/xdr4.h:780! | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 9 | Reporter: | Boyang Xue <bxue> |
Component: | kernel | Assignee: | Jeff Layton <jlayton> |
kernel sub component: | NFS | QA Contact: | Boyang Xue <bxue> |
Status: | CLOSED MIGRATED | Docs Contact: | |
Severity: | high | ||
Priority: | medium | CC: | jlayton, nfs-team, xzhou, yoyang |
Version: | 9.3 | Keywords: | MigratedToJIRA, Triaged |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2023-09-23 11:55:41 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 2209174 |
Description
Boyang Xue
2023-07-18 08:40:35 UTC
I'm pretty sure the call chain where we tripped the BUG is: nfs4_open do_open_lookup set_change_info The problem is that I can't see any way that we could end up calling set_change_info with fh_pre_saved set to false. All of the paths that lead to that call ensure that it's set to true, AFAICT. Another possibility is that something raced in and called fh_clear_pre_post_attrs while we were processing here. That seems unlikely though, as nothing else should be operating on this compound. Either way, it seems plausible that the apparent disk failure in the log messages above this point may have been a factor here. A core would be really helpful if one turns up! Oh! I do see one way that this can happen. If the OPEN was not a create request, and the nfsd_lookup fails, we'll call fh_fill_both_attrs(), which does this: void fh_fill_both_attrs(struct svc_fh *fhp) { fh_fill_post_attrs(fhp); if (!fhp->fh_post_saved) return; fhp->fh_pre_change = fhp->fh_post_change; fhp->fh_pre_mtime = fhp->fh_post_attr.mtime; fhp->fh_pre_ctime = fhp->fh_post_attr.ctime; fhp->fh_pre_size = fhp->fh_post_attr.size; fhp->fh_pre_saved = true; } ...if fh_fill_post_attrs also fails at that point, then fh_post_saved will be false and fh_pre_saved will never be set to true. So this is likely related to the disk failure that happened just before. Both operations failed, and fh_pre_saved never got set. The code to do this changed in the last few years to lean more heavily on getattr operations to fetch this info. Prior to that, it would fetch this info directly out of the inode, so we could assert on fh_pre_saved being set like this. Now, that's no longer the case -- we need to be able to cope with this failing. I think we probably just need to rip out the BUG_ON, but we may need to do some further work to ensure that we comply with the protocol. It's not clear to me what you're supposed to do when you can't fetch the change info. I'll have to dive into the spec. As I had not successfully reproduce this bug by loop running generic/650 sololy, I submitted 10 jobs to run full xfstests (assumed previous tests affecting generic/650) on the latest -340.el9, but there's no reproduction unfortunately. Link to Beaker jobs - full xfstests runs: https://url.corp.redhat.com/8c2741d Link to Beaker jobs - generic/650 solo runs: https://url.corp.redhat.com/44a7d4b Yeah, I think this is going to be a hard one to reliably reproduce. You basically need a disk failure to coincide with a non-O_CREAT open request on a dentry that is already in cache. The current upstream patch and discussion is here. https://lore.kernel.org/linux-nfs/963cc8896ff6a6759c57f1d97e51f35972d4fc6d.camel@kernel.org/T/#t I'll see if I can get this in for v6.5. Issue migration from Bugzilla to Jira is in process at this time. This will be the last message in Jira copied from the Bugzilla bug. This BZ has been automatically migrated to the issues.redhat.com Red Hat Issue Tracker. All future work related to this report will be managed there. Due to differences in account names between systems, some fields were not replicated. Be sure to add yourself to Jira issue's "Watchers" field to continue receiving updates and add others to the "Need Info From" field to continue requesting information. To find the migrated issue, look in the "Links" section for a direct link to the new issue location. The issue key will have an icon of 2 footprints next to it, and begin with "RHEL-" followed by an integer. You can also find this issue by visiting https://issues.redhat.com/issues/?jql= and searching the "Bugzilla Bug" field for this BZ's number, e.g. a search like: "Bugzilla Bug" = 1234567 In the event you have trouble locating or viewing this issue, you can file an issue by sending mail to rh-issues. You can also visit https://access.redhat.com/articles/7032570 for general account information. |