Description of problem: ./tests/basic/playground/template-xlator-sanity.t is failing when run on a RHEL-8 machine. The test creates a custom volfile and invokes ./tests/basic/rpc-coverage.sh on the fuse mount. In rpc-coverage.sh, the test that fails is function test_hardlink() where we get a hardlink count of 3 instead of the expected count 2. Version-Release number of selected component (if applicable): rhgs-3.5.0 on rhel-8 machine. How reproducible: Always. Steps to Reproduce: Either run the .t and debug or run this simplified reproducer: 1.Create a custom glusterfs volfile with only the following xlators: ------ volume posix type storage/posix option directory $B0/single-brick end-volume volume template type playground/template subvolumes posix end-volume --------------- 2. Fuse mount it 3. Create'FILE' on the fuse mount: 'touch /mnt/glusterfs/0/FILE' 3. Create and stat a hardlink: `ln FILE HARDLINK && stat -c '%h' FILE` Actual results: We get '3' as the hardlink count. Expected results: We expect '2' as the hardlink count. Additional info: 1.If we add a sleep, it works fine: `ln FILE HARDLINK && sleep && stat -c '%h' FILE` 2.Another workaround is to use the --entry-timeout=0 --attribute-timeout=0 fuse mount options. 3. The .t/ test passes without the above workarounds on my fedora 30 or rhel-7.4 VMS on fuse mounts. On rhel-8, it passed on gnfs mount too. So wondering if this is a rhgs fuse bug or a rhel-8 fuse kernel bug.
Tried to reproduce on a system I could get at by myself so that it's closest to the specs, CentOS 8 with glusterfs build from current rhgs-3.5.0 branch (tag v6.0-21). It did not occur. Please provide a system where it does occur.
Managed to reproduce the issue on my CentOS 8 VM. Mistake I made so far in reproduction was issuing $ stat -c '%h' * rather than $ stat -c '%h' FILE With latter now issue is properly reproduced.
I chekced what happens across a range of distros (Centos, Fedora, Arch) and kernels (from 3.10 to 5.3) and it was only CentOS 8 where there was no GETATTR following the LINK message. So the nlink=3 is a reflection of the inode according kernel's own internal state after the link(2), the Glusterfs client was not consulted about it. I continue the investigation regarding - why the CentOS/RHEL* 8 kernel is not sending GETATTR; - why the CentOS/RHEL 8 kernel thinks nlink=3 for this inode; - is it the CentOS/RHEL 8 kernel that's buggy, or rather the Glusterfs code relies on some tacit assumption which is met by all other kernels but is not guaranteed? * I can confirm the nlink=3 phenomenon also on RHEL 8, but I did not take fusedump there, so I don't know for sure how the communication between glusterfs and fuse vfs looks like. However, it's highly likely it's the same as in CentOS 8 because these two OS-es are very similar and exactly these two produce nlink=3.
I checked RHEL8 kernel changes related to fuse. I managed to narrow down the cause of the issue to two commits. Trying to verify my findings on a live system.
Current status -- upstream: I found the kernel commit which brought in this behavior. Original commit to linux-fuse: https://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git/commit/?h=for-next&id=2f1e81965fd0f672c3246e751385cdfe8f86bbee Commit that has merged it upstream: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9b5cf826ef8b607d452ba7bf683ae5510a745232 This merge then has made it into kernel 4.20. However, recent kernels don't exhibit the described behavior, so probably a fix was provided later for it in upstream. Continuing research to clarify this.
In comment #10 I claimed "However, recent kernels don't exhibit the described behavior". I can now narrow this down: the behavior doesn't occur on 4.20 either. So by the end of the release cycle which brought in the issue, upstream already delivered a fix and there is no upstream kernel release that is affected by this issue. I'm still working to find out what would that fix be.
Kernel patches 18127429 (v4.20-rc1~45^2~5)¹ and 2f1e8196 (v4.20-rc1~45^2~3)² together produce correct behavior. They bot got merged upstream together via 9b5cf826 (v4.20-rc1~45)³. So upstream there was no issue with these changes debuting in v4.20. However, 2f1e8196 without 18127429 causes the buggy behavior indicated by subject testcase (and further stale cache phenomena can occur). 2f1e8196 was cherry-picked to the 4.18 based RHEL/CentOS 8 kernel, while 2f1e8196 wasn't. So RHEL/CentOS 8 kernels are affected by the bug; it is expected to be resolved by an upcoming kernel update including 18127429. *It is unadvised to use Glusterfs fuse client on affected kernels as the bug might produce various stale cache related issues.* ¹: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=18127429, bitops: protect variables in set_mask_bits() macro ²: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2f1e8196, fuse: allow fine grained attr cache invaldation ³: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9b5cf826, Merge tag 'fuse-update-4.20' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
(In reply to Csaba Henk from comment #15) > Kernel patches 18127429 (v4.20-rc1~45^2~5)¹ and 2f1e8196 (v4.20-rc1~45^2~3)² > together produce correct behavior. They bot got merged upstream together via > 9b5cf826 (v4.20-rc1~45)³. So upstream there was no issue with these changes > debuting in v4.20. However, 2f1e8196 without 18127429 causes the buggy > behavior indicated by subject testcase (and further stale cache phenomena > can occur). 2f1e8196 was cherry-picked to the 4.18 based RHEL/CentOS 8 > kernel, while 2f1e8196 wasn't. So RHEL/CentOS 8 kernels are affected by the > bug; it is expected to be resolved by an upcoming kernel update including > 18127429. When? Which RHEL BZ is tracking this? We need a bug for RHEL 8.2 + a clone to backport to RHEL 8.1.z, right? Or do we assume we'll support RHEL 8.2 only? > > *It is unadvised to use Glusterfs fuse client on affected kernels as the bug > might produce various stale cache related issues.* > > ¹: > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/ > ?id=18127429, bitops: protect variables in set_mask_bits() macro > ²: > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/ > ?id=2f1e8196, fuse: allow fine grained attr cache invaldation > ³: > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/ > ?id=9b5cf826, Merge tag 'fuse-update-4.20' of > git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Hi Yaniv, there is no RHEL bz for it yet, we'll get that covered. Can you please give a pointer that sheds light on how RHEL 8.* subreleases map to el8 kernels?
(In reply to Csaba Henk from comment #15) > Kernel patches 18127429 (v4.20-rc1~45^2~5)¹ and 2f1e8196 (v4.20-rc1~45^2~3)² > together produce correct behavior. They bot got merged upstream together via > 9b5cf826 (v4.20-rc1~45)³. So upstream there was no issue with these changes > debuting in v4.20. However, 2f1e8196 without 18127429 causes the buggy > behavior indicated by subject testcase (and further stale cache phenomena > can occur). 2f1e8196 was cherry-picked to the 4.18 based RHEL/CentOS 8 > kernel, while 2f1e8196 wasn't. So RHEL/CentOS 8 kernels are affected by the > bug; it is expected to be resolved by an upcoming kernel update including > 18127429. Ofc this should read as "2f1e8196 was cherry-picked to the 4.18 based RHEL/CentOS 8 kernel, while 18127429 wasn't".
(In reply to Yaniv Kaul from comment #17) > (In reply to Csaba Henk from comment #15) > > Kernel patches 18127429 (v4.20-rc1~45^2~5)¹ and 2f1e8196 (v4.20-rc1~45^2~3)² > > together produce correct behavior. They bot got merged upstream together via > > 9b5cf826 (v4.20-rc1~45)³. So upstream there was no issue with these changes > > debuting in v4.20. However, 2f1e8196 without 18127429 causes the buggy > > behavior indicated by subject testcase (and further stale cache phenomena > > can occur). 2f1e8196 was cherry-picked to the 4.18 based RHEL/CentOS 8 > > kernel, while 2f1e8196 wasn't. So RHEL/CentOS 8 kernels are affected by the > > bug; it is expected to be resolved by an upcoming kernel update including > > 18127429. > > When? Which RHEL BZ is tracking this? We need a bug for RHEL 8.2 + a clone > to backport to RHEL 8.1.z, right? > Or do we assume we'll support RHEL 8.2 only? Backport request has been raised at BZ 1694161 to backport the fix to RHEL-8.1.z. > > > > > > *It is unadvised to use Glusterfs fuse client on affected kernels as the bug > > might produce various stale cache related issues.* > > > > ¹: > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/ > > ?id=18127429, bitops: protect variables in set_mask_bits() macro > > ²: > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/ > > ?id=2f1e8196, fuse: allow fine grained attr cache invaldation > > ³: > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/ > > ?id=9b5cf826, Merge tag 'fuse-update-4.20' of > > git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
(In reply to Sunil Kumar Acharya from comment #20) > (In reply to Yaniv Kaul from comment #17) > > (In reply to Csaba Henk from comment #15) > > > Kernel patches 18127429 (v4.20-rc1~45^2~5)¹ and 2f1e8196 (v4.20-rc1~45^2~3)² > > > together produce correct behavior. They bot got merged upstream together via > > > 9b5cf826 (v4.20-rc1~45)³. So upstream there was no issue with these changes > > > debuting in v4.20. However, 2f1e8196 without 18127429 causes the buggy > > > behavior indicated by subject testcase (and further stale cache phenomena > > > can occur). 2f1e8196 was cherry-picked to the 4.18 based RHEL/CentOS 8 > > > kernel, while 2f1e8196 wasn't. So RHEL/CentOS 8 kernels are affected by the > > > bug; it is expected to be resolved by an upcoming kernel update including > > > 18127429. > > > > When? Which RHEL BZ is tracking this? We need a bug for RHEL 8.2 + a clone > > to backport to RHEL 8.1.z, right? > > Or do we assume we'll support RHEL 8.2 only? > > Backport request has been raised at BZ 1694161 to backport the fix to > RHEL-8.1.z. With https://bugzilla.redhat.com/show_bug.cgi?id=1694161#c33 our only option is to support with RHEL-8.2 only. > > > > > > > > > > > *It is unadvised to use Glusterfs fuse client on affected kernels as the bug > > > might produce various stale cache related issues.* > > > > > > ¹: > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/ > > > ?id=18127429, bitops: protect variables in set_mask_bits() macro > > > ²: > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/ > > > ?id=2f1e8196, fuse: allow fine grained attr cache invaldation > > > ³: > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/ > > > ?id=9b5cf826, Merge tag 'fuse-update-4.20' of > > > git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2572