Bug 1763208
| Summary: | fuse log registers error with misleading "No such file or directory" when we interrupt a file copy | |||
|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Nag Pavan Chilakam <nchilaka> | |
| Component: | fuse | Assignee: | Csaba Henk <csaba> | |
| Status: | CLOSED ERRATA | QA Contact: | Bala Konda Reddy M <bmekala> | |
| Severity: | high | Docs Contact: | ||
| Priority: | unspecified | |||
| Version: | rhgs-3.5 | CC: | bmekala, csaba, puebele, rhs-bugs, rkothiya, sheggodu, storage-qa-internal, ubansal, vdas | |
| Target Milestone: | --- | |||
| Target Release: | RHGS 3.5.z Batch Update 1 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | glusterfs-6.0-26 | Doc Type: | No Doc Update | |
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1790997 (view as bug list) | Environment: | ||
| Last Closed: | 2020-01-30 06:42:48 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1790997 | |||
|
Description
Nag Pavan Chilakam
2019-10-18 13:42:14 UTC
Quoting https://bugzilla.redhat.com/1699869#c61 : > This is an error that can occur with a reverse entry invalidation request. Such a request needs to specify a directory inode and a name. The effect of the request is that the cached lookup data associating the name in the dir inode with some inode is dropped. > > There is no guarantee though that the nodeid via which the reverse entry invalidation request intends to specify the dir inode actually resolves to a dir inode: it can be the case that it does not resolve to an inode, or it does resolve to an inode but that's not the inode of a directory. In former case the write(2) to /dev/fuse on behalf of the request fails with ENOENT; in latter case, it fails with ENOTDIR. > > This is the latter case. Probably there is a race? I can imagine that kernel forgets the directory inode on its own (ie. on its own agenda and not due to an invalidation request) and simultaneously invalidation of some entry of this dir is triggered in glusterfs, taking effect earlier than processing the respective FORGET / BATCH_FORGET messages. If that's the case, then it's most likely a harmless phenomenon. So the phenomenon is harmless and the error message is correct -- we do need to report "No such file or directory" if the system fails the syscall with ENOENT. (Note that this is admittedly odd, as write(2) / write(3p) does not specify ENOENT as possible errno for the write syscall; it's a FUSE idiosyncrasy to use ENOENT for indication of certain system states as part of communication through /dev/fuse.) However, we acknowledge that flooding the logs with this message is annoying. The following upstream patches seek to mitigate this issue: https://review.gluster.org/23187, "fuse: Set limit on invalidate queue size" https://review.gluster.org/23461, "glusterfs/fuse: Reduce the default lru-limit value" Please let me know what action plan to follow (should I provide a custom build with backports of these? Or wait while these patches hit downstream? Actually the first change is the important one, as the effect of the second can be reached also by setting a volume option). *** Bug 1773901 has been marked as a duplicate of this bug. *** I investigated the case. The conjecture put forward in comment #3 was not correct. It's not invalidation what happens, but interruption. If a system call is interrupted, glusterfs gets an INTERRUPT message. From that point on handling the interrupted message and handling the interrupt are in race (that's why number of threads makes a difference). If handling the interrupted message wins the race, then by the time the answer is sent to the interrupt the reference to the original message is dangling. This is indicated by the kernel by failing the write to /dev/fuse with ENOENT. It's a harmless phenomenon. So even if the analysis in comment #3 was incorrect, the conclusion is still valid: it's a FUSE idiosyncrasy. To improve the UX, I'll will send a patch to upstream that degrades the log message from Error to Warning and conceals the errno string which feels counter-intuitive in this situation. There will be no need to backport the patch. I will close the bug with NOTABUG on QA ack. Upstream change: https://review.gluster.org/23974 (In reply to Csaba Henk from comment #17) > I investigated the case. The conjecture put forward in comment #3 was not > correct. It's not invalidation what happens, but interruption. If a system > call is interrupted, glusterfs gets an INTERRUPT message. From that point on > handling the interrupted message and handling the interrupt are in race > (that's why number of threads makes a difference). If handling the > interrupted message wins the race, then by the time the answer is sent to > the interrupt the reference to the original message is dangling. This is > indicated by the kernel by failing the write to /dev/fuse with ENOENT. It's > a harmless phenomenon. So even if the analysis in comment #3 was incorrect, > the conclusion is still valid: it's a FUSE idiosyncrasy. > > To improve the UX, I'll will send a patch to upstream that degrades the log > message from Error to Warning and conceals the errno string which feels > counter-intuitive in this situation. There will be no need to backport the > patch. > > I will close the bug with NOTABUG on QA ack. Hi Csaba, After putting all the effort required from engineering side to get the bug to onqa and QE spending test cycle to validate the fix, closing as "NOTABUG" is quite unproductive. Also, I don't see a reason closing this as NOTABUG, as this is a bug given that we are seeing messages which can be false alarms. If we believe these messages occur due to fuse limitations, then we should either suppress such messages, without suppressing genuinely expected messages. For Now we will have to failqa the bug, and then we can take the necessary action (In reply to Csaba Henk from comment #17) > I investigated the case. The conjecture put forward in comment #3 was not > correct. It's not invalidation what happens, but interruption. If a system > call is interrupted, glusterfs gets an INTERRUPT message. From that point on > handling the interrupted message and handling the interrupt are in race > (that's why number of threads makes a difference). If handling the > interrupted message wins the race, then by the time the answer is sent to > the interrupt the reference to the original message is dangling. This is > indicated by the kernel by failing the write to /dev/fuse with ENOENT. It's > a harmless phenomenon. So even if the analysis in comment #3 was incorrect, > the conclusion is still valid: it's a FUSE idiosyncrasy. > > To improve the UX, I'll will send a patch to upstream that degrades the log > message from Error to Warning and conceals the errno string which feels > counter-intuitive in this situation. There will be no need to backport the > patch. Also, in the case of moving the logs to warning, are we making sure, that we won't be supressing genuine errors (ie in case of really missing files/dirs)? If the answer is yes, then you can get the patch merged and this same bug can have qe to retest, and see if the logs are being suppressed or not(ie moved to warning) as part of validation > > I will close the bug with NOTABUG on QA ack. We decided to backport the change mentioned in comment #18. Then obviously the resolution will be something other than NOTABUG. However, the patch needs to be revised/refined to address some concerns (of reviewers and of myself). When the new revision is submitted, I'll answer the remaining questions in light of that. New revision of patch addresses the following concerns: - suppressing genuinely expected messages: The degradation of /dev/fuse write errors happens context sensitively. That is, caller of the convenience function can specify which errnos are to be treated as non-errors for glusterfs. - flooding of logs: degraded log messages are degraded to DEBUG so they won't be seen under normal conditions. (Note that the frequency reported in comment #14 and the 1-1 mapping of discussed log messages with SIGINT-s does not sound to be a real flooding situation, but it will be even better now.) Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0288 |