Bug 1762161
Summary: | ./tests/basic/playground/template-xlator-sanity.t fails on RHEL 8 fuse mount | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Ravishankar N <ravishankar> |
Component: | fuse | Assignee: | Csaba Henk <csaba> |
Status: | CLOSED ERRATA | QA Contact: | Prasanth <pprakash> |
Severity: | urgent | Docs Contact: | |
Priority: | urgent | ||
Version: | unspecified | CC: | csaba, nchilaka, pprakash, puebele, rcyriac, rhs-bugs, sabose, sheggodu, srakonde, storage-qa-internal, vdas |
Target Milestone: | --- | ||
Target Release: | RHGS 3.5.z Batch Update 2 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | glusterfs-6.0-32, kernel-4.18.0-167.el8 | Doc Type: | No Doc Update |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-06-16 06:19:37 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Ravishankar N
2019-10-16 04:59:17 UTC
Tried to reproduce on a system I could get at by myself so that it's closest to the specs, CentOS 8 with glusterfs build from current rhgs-3.5.0 branch (tag v6.0-21). It did not occur. Please provide a system where it does occur. Managed to reproduce the issue on my CentOS 8 VM. Mistake I made so far in reproduction was issuing $ stat -c '%h' * rather than $ stat -c '%h' FILE With latter now issue is properly reproduced. I chekced what happens across a range of distros (Centos, Fedora, Arch) and kernels (from 3.10 to 5.3) and it was only CentOS 8 where there was no GETATTR following the LINK message. So the nlink=3 is a reflection of the inode according kernel's own internal state after the link(2), the Glusterfs client was not consulted about it. I continue the investigation regarding - why the CentOS/RHEL* 8 kernel is not sending GETATTR; - why the CentOS/RHEL 8 kernel thinks nlink=3 for this inode; - is it the CentOS/RHEL 8 kernel that's buggy, or rather the Glusterfs code relies on some tacit assumption which is met by all other kernels but is not guaranteed? * I can confirm the nlink=3 phenomenon also on RHEL 8, but I did not take fusedump there, so I don't know for sure how the communication between glusterfs and fuse vfs looks like. However, it's highly likely it's the same as in CentOS 8 because these two OS-es are very similar and exactly these two produce nlink=3. I checked RHEL8 kernel changes related to fuse. I managed to narrow down the cause of the issue to two commits. Trying to verify my findings on a live system. Current status -- upstream: I found the kernel commit which brought in this behavior. Original commit to linux-fuse: https://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git/commit/?h=for-next&id=2f1e81965fd0f672c3246e751385cdfe8f86bbee Commit that has merged it upstream: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9b5cf826ef8b607d452ba7bf683ae5510a745232 This merge then has made it into kernel 4.20. However, recent kernels don't exhibit the described behavior, so probably a fix was provided later for it in upstream. Continuing research to clarify this. In comment #10 I claimed "However, recent kernels don't exhibit the described behavior". I can now narrow this down: the behavior doesn't occur on 4.20 either. So by the end of the release cycle which brought in the issue, upstream already delivered a fix and there is no upstream kernel release that is affected by this issue. I'm still working to find out what would that fix be. Kernel patches 18127429 (v4.20-rc1~45^2~5)¹ and 2f1e8196 (v4.20-rc1~45^2~3)² together produce correct behavior. They bot got merged upstream together via 9b5cf826 (v4.20-rc1~45)³. So upstream there was no issue with these changes debuting in v4.20. However, 2f1e8196 without 18127429 causes the buggy behavior indicated by subject testcase (and further stale cache phenomena can occur). 2f1e8196 was cherry-picked to the 4.18 based RHEL/CentOS 8 kernel, while 2f1e8196 wasn't. So RHEL/CentOS 8 kernels are affected by the bug; it is expected to be resolved by an upcoming kernel update including 18127429. *It is unadvised to use Glusterfs fuse client on affected kernels as the bug might produce various stale cache related issues.* ¹: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=18127429, bitops: protect variables in set_mask_bits() macro ²: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2f1e8196, fuse: allow fine grained attr cache invaldation ³: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9b5cf826, Merge tag 'fuse-update-4.20' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse (In reply to Csaba Henk from comment #15) > Kernel patches 18127429 (v4.20-rc1~45^2~5)¹ and 2f1e8196 (v4.20-rc1~45^2~3)² > together produce correct behavior. They bot got merged upstream together via > 9b5cf826 (v4.20-rc1~45)³. So upstream there was no issue with these changes > debuting in v4.20. However, 2f1e8196 without 18127429 causes the buggy > behavior indicated by subject testcase (and further stale cache phenomena > can occur). 2f1e8196 was cherry-picked to the 4.18 based RHEL/CentOS 8 > kernel, while 2f1e8196 wasn't. So RHEL/CentOS 8 kernels are affected by the > bug; it is expected to be resolved by an upcoming kernel update including > 18127429. When? Which RHEL BZ is tracking this? We need a bug for RHEL 8.2 + a clone to backport to RHEL 8.1.z, right? Or do we assume we'll support RHEL 8.2 only? > > *It is unadvised to use Glusterfs fuse client on affected kernels as the bug > might produce various stale cache related issues.* > > ¹: > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/ > ?id=18127429, bitops: protect variables in set_mask_bits() macro > ²: > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/ > ?id=2f1e8196, fuse: allow fine grained attr cache invaldation > ³: > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/ > ?id=9b5cf826, Merge tag 'fuse-update-4.20' of > git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Hi Yaniv, there is no RHEL bz for it yet, we'll get that covered. Can you please give a pointer that sheds light on how RHEL 8.* subreleases map to el8 kernels? (In reply to Csaba Henk from comment #15) > Kernel patches 18127429 (v4.20-rc1~45^2~5)¹ and 2f1e8196 (v4.20-rc1~45^2~3)² > together produce correct behavior. They bot got merged upstream together via > 9b5cf826 (v4.20-rc1~45)³. So upstream there was no issue with these changes > debuting in v4.20. However, 2f1e8196 without 18127429 causes the buggy > behavior indicated by subject testcase (and further stale cache phenomena > can occur). 2f1e8196 was cherry-picked to the 4.18 based RHEL/CentOS 8 > kernel, while 2f1e8196 wasn't. So RHEL/CentOS 8 kernels are affected by the > bug; it is expected to be resolved by an upcoming kernel update including > 18127429. Ofc this should read as "2f1e8196 was cherry-picked to the 4.18 based RHEL/CentOS 8 kernel, while 18127429 wasn't". (In reply to Yaniv Kaul from comment #17) > (In reply to Csaba Henk from comment #15) > > Kernel patches 18127429 (v4.20-rc1~45^2~5)¹ and 2f1e8196 (v4.20-rc1~45^2~3)² > > together produce correct behavior. They bot got merged upstream together via > > 9b5cf826 (v4.20-rc1~45)³. So upstream there was no issue with these changes > > debuting in v4.20. However, 2f1e8196 without 18127429 causes the buggy > > behavior indicated by subject testcase (and further stale cache phenomena > > can occur). 2f1e8196 was cherry-picked to the 4.18 based RHEL/CentOS 8 > > kernel, while 2f1e8196 wasn't. So RHEL/CentOS 8 kernels are affected by the > > bug; it is expected to be resolved by an upcoming kernel update including > > 18127429. > > When? Which RHEL BZ is tracking this? We need a bug for RHEL 8.2 + a clone > to backport to RHEL 8.1.z, right? > Or do we assume we'll support RHEL 8.2 only? Backport request has been raised at BZ 1694161 to backport the fix to RHEL-8.1.z. > > > > > > *It is unadvised to use Glusterfs fuse client on affected kernels as the bug > > might produce various stale cache related issues.* > > > > ¹: > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/ > > ?id=18127429, bitops: protect variables in set_mask_bits() macro > > ²: > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/ > > ?id=2f1e8196, fuse: allow fine grained attr cache invaldation > > ³: > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/ > > ?id=9b5cf826, Merge tag 'fuse-update-4.20' of > > git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse (In reply to Sunil Kumar Acharya from comment #20) > (In reply to Yaniv Kaul from comment #17) > > (In reply to Csaba Henk from comment #15) > > > Kernel patches 18127429 (v4.20-rc1~45^2~5)¹ and 2f1e8196 (v4.20-rc1~45^2~3)² > > > together produce correct behavior. They bot got merged upstream together via > > > 9b5cf826 (v4.20-rc1~45)³. So upstream there was no issue with these changes > > > debuting in v4.20. However, 2f1e8196 without 18127429 causes the buggy > > > behavior indicated by subject testcase (and further stale cache phenomena > > > can occur). 2f1e8196 was cherry-picked to the 4.18 based RHEL/CentOS 8 > > > kernel, while 2f1e8196 wasn't. So RHEL/CentOS 8 kernels are affected by the > > > bug; it is expected to be resolved by an upcoming kernel update including > > > 18127429. > > > > When? Which RHEL BZ is tracking this? We need a bug for RHEL 8.2 + a clone > > to backport to RHEL 8.1.z, right? > > Or do we assume we'll support RHEL 8.2 only? > > Backport request has been raised at BZ 1694161 to backport the fix to > RHEL-8.1.z. With https://bugzilla.redhat.com/show_bug.cgi?id=1694161#c33 our only option is to support with RHEL-8.2 only. > > > > > > > > > > > *It is unadvised to use Glusterfs fuse client on affected kernels as the bug > > > might produce various stale cache related issues.* > > > > > > ¹: > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/ > > > ?id=18127429, bitops: protect variables in set_mask_bits() macro > > > ²: > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/ > > > ?id=2f1e8196, fuse: allow fine grained attr cache invaldation > > > ³: > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/ > > > ?id=9b5cf826, Merge tag 'fuse-update-4.20' of > > > git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2572 |