Description of problem: 6*3 Distributed-Replicate Volume mounting 4 clients via 4 different VIP's (v4).Create 3 directories inside mount point.Create ~3 Lakhs files each of 1MB (1 Lakhs/directory). Trigger the following from 4 clients- Client 1 : du -sh (single iteration) Client 2 : ls -laRt in loop Client 3: Finds in loop (while true;do find . -mindepth 1 -type f;done) Client 4: rm -rf * (on mount point) At each iteration of du -sh,the size was increasing/inconsistent Following was the output in each iteration- -------------------- root@rhs-client6 ganesha]# du -sh 18G . [root@rhs-client6 ganesha]# du -sh 44G . [root@rhs-client6 ganesha]# du -sh 69G . [root@rhs-client6 ganesha]# du -sh 38G . [root@rhs-client6 ganesha]# du -sh 29G . ---------------------- rm -rf exited with message "directory not empty" which was expected.At this point of time around "97173" files and "3" directories were present on mount point Took output of du -sh. Lookups were still running from other 2 clients.Still inconsistent output of du was been observed. Following was the output at each iteration ---------------------- # du -sh 79G . [root@rhs-client6 ganesha]# du -sh 98G . [root@rhs-client6 ganesha]# du -sh 142G . [root@rhs-client6 ganesha]# du -sh 114G . [root@rhs-client6 ganesha]# du -sh 93G . --------------------------- # df -hT Filesystem Type Size Used Avail Use% Mounted on /dev/mapper/rhel_rhs--client6-root xfs 50G 1.8G 49G 4% / devtmpfs devtmpfs 7.8G 0 7.8G 0% /dev tmpfs tmpfs 7.8G 0 7.8G 0% /dev/shm tmpfs tmpfs 7.8G 18M 7.8G 1% /run tmpfs tmpfs 7.8G 0 7.8G 0% /sys/fs/cgroup /dev/sda1 xfs 1014M 215M 800M 22% /boot /dev/mapper/rhel_rhs--client6-home xfs 1.8T 149M 1.8T 1% /home tmpfs tmpfs 1.6G 0 1.6G 0% /run/user/0 rhsqe-repo.lab.eng.blr.redhat.com:/ nfs4 50G 9.8G 41G 20% /opt 10.70.34.91:/mani2 nfs4 4.6T 1.8T 2.9T 39% /mnt/ganesha Version-Release number of selected component (if applicable): # rpm -qa | grep ganesha nfs-ganesha-2.5.5-9.el7rhgs.x86_64 nfs-ganesha-debuginfo-2.5.5-9.el7rhgs.x86_64 glusterfs-ganesha-3.12.2-15.el7rhgs.x86_64 nfs-ganesha-gluster-2.5.5-9.el7rhgs.x86_64 How reproducible: 1/1 Steps to Reproduce: 1.Create 6 node ganesha cluster 2.Create 6*3 Distributed-Replicate Volume.export the volume via ganesha 3.Mount the volume on 4 clients via 4 different VIP's (v4 protocol) 4.Create 3 directories on mount point 5.Create 1 Lakh file in each directory from 3 clients.Run unnamed fin'd from client 4. 6.When IO's are completed from trigger du -sh from 1 client along with rm -rf, ls -laRt,find'd from other 3 clients Actual results: At each iteration of du -sh,the size was increasing/inconsistent. With only "du -sh" along with parallel lookups also,the size was inconsistent at each iteration. Expected results: du -sh should show the consistent/correct output at each iteration. Additional info: ---------- Mounted the same volume on fuse via glusterfs. Output was consistent ---------- [root@rhs-client6 fuse_mount]# du -sh 1.8T . [root@rhs-client6 fuse_mount]# du -sh 1.8T . --------- Mounted the same volume on Ganesha again.Performed du -sh without running parallel lookups. Still showing inconsistent output at each iteration. --------- [root@rhs-client6 ganesha]# du -sh 202G . [root@rhs-client6 ganesha]# du -sh 163G . [root@rhs-client6 ganesha]# du -sh 232G .
Gluster-Heath-check report ------------ [root@moonshine gluster-health-report-master]# gluster-health-report Loaded reports: coredump, disk_usage, errors_in_logs, firewall-check, georep, gfid-mismatch-dht-report, glusterd-op-version, glusterd-peer-disconnect, glusterd, glusterd_volume_version_cksum_errors, kernel_issues, memory_usage, errors_in_logs, ifconfig, nic-health, process_status [ OK] Disk used percentage path=/ percentage=23 [ OK] Disk used percentage path=/var percentage=23 [ OK] Disk used percentage path=/tmp percentage=23 [ OK] All peers are in connected state connected_count=5 total_peer_count=5 [ OK] no gfid mismatch [ OK] op-version is up to date op_version=31302 max_op_version=31302 [ OK] The maximum size of core files created is set to unlimted. [ OK] Ports open for glusterd: tcp 0 0 0.0.0.0:24007 0.0.0.0:* LISTEN 14172/glusterd [ OK] Ports open for glusterfsd: 3:tcp 0 0 0.0.0.0:49152 0.0.0.0:* LISTEN 2724/glusterfsd 4:tcp 0 0 0.0.0.0:49154 0.0.0.0:* LISTEN 19962/glusterfsd [ ERROR] Report failure report=report_check_worker_restarts [ OK] Glusterd is running uptime_sec=506203 [WARNING] Errors in Glusterd log file num_errors=22 [WARNING] Warnings in Glusterd log file num_warning=9 [ OK] No errors seen at network card [ OK] No errors seen at network card 0 [WARNING] Errors in Glusterd log file num_errors=38 [WARNING] Warnings in Glusterd log file num_warnings=16 .... ---------------- [root@tettnang gluster-health-report-master]# gluster-health-report Loaded reports: coredump, disk_usage, errors_in_logs, firewall-check, georep, gfid-mismatch-dht-report, glusterd-op-version, glusterd-peer-disconnect, glusterd, glusterd_volume_version_cksum_errors, kernel_issues, memory_usage, errors_in_logs, ifconfig, nic-health, process_status [ OK] Disk used percentage path=/ percentage=44 [ OK] Disk used percentage path=/var percentage=44 [ OK] Disk used percentage path=/tmp percentage=44 [ OK] All peers are in connected state connected_count=5 total_peer_count=5 [ OK] no gfid mismatch [ NOT OK] Failed to check op-version [ OK] The maximum size of core files created is set to unlimted. [ OK] Ports open for glusterd: tcp 0 0 0.0.0.0:24007 0.0.0.0:* LISTEN 16591/glusterd [ OK] Ports open for glusterfsd: 3:tcp 0 0 0.0.0.0:49153 0.0.0.0:* LISTEN 20713/glusterfsd 4:tcp 0 0 0.0.0.0:49154 0.0.0.0:* LISTEN 19648/glusterfsd [ ERROR] Report failure report=report_check_worker_restarts [ OK] Glusterd is running uptime_sec=506203 [WARNING] Errors in Glusterd log file num_errors=27 [WARNING] Warnings in Glusterd log file num_warning=9 [ OK] No errors seen at network card [ OK] No errors seen at network card 0 [WARNING] Errors in Glusterd log file num_errors=43 [WARNING] Warnings in Glusterd log file num_warnings=16 .... ---------------------- root@zod gluster-health-report-master]# gluster-health-report Loaded reports: coredump, disk_usage, errors_in_logs, firewall-check, georep, gfid-mismatch-dht-report, glusterd-op-version, glusterd-peer-disconnect, glusterd, glusterd_volume_version_cksum_errors, kernel_issues, memory_usage, errors_in_logs, ifconfig, nic-health, process_status [ OK] Disk used percentage path=/ percentage=29 [ OK] Disk used percentage path=/var percentage=29 [ OK] Disk used percentage path=/tmp percentage=29 [ OK] All peers are in connected state connected_count=5 total_peer_count=5 [ OK] no gfid mismatch [ OK] op-version is up to date op_version=31302 max_op_version=31302 [ OK] The maximum size of core files created is set to unlimted. [ OK] Ports open for glusterd: tcp 0 0 0.0.0.0:24007 0.0.0.0:* LISTEN 13327/glusterd [ OK] Ports open for glusterfsd: 3:tcp 0 0 0.0.0.0:49153 0.0.0.0:* LISTEN 26480/glusterfsd 4:tcp 0 0 0.0.0.0:49154 0.0.0.0:* LISTEN 16570/glusterfsd [ ERROR] Report failure report=report_check_worker_restarts [ OK] Glusterd is running uptime_sec=506203 [WARNING] Errors in Glusterd log file num_errors=22 [WARNING] Warnings in Glusterd log file num_warning=9 [ OK] No errors seen at network card [ OK] No errors seen at network card 0 [WARNING] Errors in Glusterd log file num_errors=38 [WARNING] Warnings in Glusterd log file num_warnings=16 .... ------------------------------- root@yarrow gluster-health-report-master]# gluster-health-report Loaded reports: coredump, disk_usage, errors_in_logs, firewall-check, georep, gfid-mismatch-dht-report, glusterd-op-version, glusterd-peer-disconnect, glusterd, glusterd_volume_version_cksum_errors, kernel_issues, memory_usage, errors_in_logs, ifconfig, nic-health, process_status [ NOT OK] Disk used percentage is exceeding threshold, consider deleting unnecessary data path=/ percentage=90 [ NOT OK] Disk used percentage is exceeding threshold, consider deleting unnecessary data path=/var percentage=90 [ NOT OK] Disk used percentage is exceeding threshold, consider deleting unnecessary data path=/tmp percentage=90 [ OK] All peers are in connected state connected_count=5 total_peer_count=5 [ OK] no gfid mismatch [ NOT OK] Failed to check op-version [ OK] The maximum size of core files created is set to unlimted. [ OK] Ports open for glusterd: tcp 0 0 0.0.0.0:24007 0.0.0.0:* LISTEN 1909/glusterd [ OK] Ports open for glusterfsd: 3:tcp 0 0 0.0.0.0:49152 0.0.0.0:* LISTEN 31317/glusterfsd [ ERROR] Report failure report=report_check_worker_restarts [ OK] Glusterd is running uptime_sec=506203 [WARNING] Errors in Glusterd log file num_errors=35 [WARNING] Warnings in Glusterd log file num_warning=9 [ OK] No errors seen at network card [ OK] No errors seen at network card 0 [WARNING] Errors in Glusterd log file num_errors=51 [WARNING] Warnings in Glusterd log file num_warnings=16 .... --------------------------
Changing the summary as this BZ as this can be hit with just creating 3 Lakhs files on Ganesha and performing du -sh.Each time its showing different disk usage, [root@rhs-client6 ganesha]# du -sh 294G . [root@rhs-client6 ganesha]# du -sh 168G . [root@rhs-client6 ganesha]# du -sh 97G . [root@rhs-client6 ganesha]# du -sh 64G . [root@rhs-client6 ganesha]# du -sh 113G . # df -hT Filesystem Type Size Used Avail Use% Mounted on /dev/mapper/rhel_rhs--client6-root xfs 50G 2.0G 49G 4% / devtmpfs devtmpfs 7.8G 0 7.8G 0% /dev tmpfs tmpfs 7.8G 0 7.8G 0% /dev/shm tmpfs tmpfs 7.8G 18M 7.8G 1% /run tmpfs tmpfs 7.8G 0 7.8G 0% /sys/fs/cgroup /dev/sda1 xfs 1014M 215M 800M 22% /boot /dev/mapper/rhel_rhs--client6-home xfs 1.8T 149M 1.8T 1% /home tmpfs tmpfs 1.6G 0 1.6G 0% /run/user/0 rhsqe-repo.lab.eng.blr.redhat.com:/ nfs4 50G 9.8G 41G 20% /opt 10.70.34.91:/mani2 nfs4 4.6T 341G 4.3T 8% /mnt/ganesha moonshine.lab.eng.blr.redhat.com:/mani2 fuse.glusterfs 4.6T 341G 4.3T 8% /mnt/fuse_mount ----------------------- Stats on Fuse mount with same volume and data set- ----------------------- # du -sh 293G . [root@rhs-client6 fuse_mount]# du -sh 293G .
Proposing this as a blocker since its an application side impact which shows incorrect usage each time on ganesha mount
Marking it as regression as the issue is not been observed with 3.3.1 ganesha bits
Clearing the needinfo as discussed over IRC
I think there's a misunderstanding. The recent patch didn't enable readdir_plus usage in Ganesha; that's been in for a while, and is a compile-time option. Instead, it enabled readdir chunk reloading using readdir_plus, if it was built in. This is also a compile-time option, and can't be disabled. Disabling readdir_plus on the cluster won't change ganesha's behavior at all; it will still call the readdir_plus() API, and will still reload chunks using that API. I tested by commenting out FSAL/FSAL_GLUSTER/main.c:75 and this fixed du for me, so it's somehow related to the chunk reloading. I'll try and debug.
So this (and bug #1613890) are caused by the readdir_plus chunking changes, somehow. It seems, in my limited testing, that Gluster is returning different dirents for the same readdir offsets on subsequent readdirs. That is, it's re-ordering dirents in a directory with no directory mutations. I'm not positive of this yet, but if this is the case, then the readdir_plus chunking optimization cannot be used with gluster. My time is limited this week, so I won't be able to work on this until next week.
Bug #1558974 is also likely the same issue
Update on this issue. This is from the POSIX definition for seekdir: The seekdir() function shall set the position of the next readdir() operation on the directory stream specified by dirp to the position specified by loc. The value of loc should have been returned from an earlier call to telldir() using the same directory stream. The new position reverts to the one associated with the directory stream when telldir() was performed. If the value of loc was not obtained from an earlier call to telldir(), or if a call to rewinddir() occurred between the call to telldir() and the call to seekdir(), the results of subsequent calls to readdir() are unspecified. From this, it's clear to me that d_off (which is not defined in POSIX, but is defined on Linux as the value that telldir() would return at that point in the directory stream) is not required to be stable across directory streams. Currently, Ganesha uses a separate directory stream for each readdir call. This clearly works on Linux and Ceph, but is technically a violation of POSIX. This is not something we can fix, either in Ganesha or in Gluster, in the very short timeframe for 3.4, so readdir chunking will have to be disabled for this release. A proper fix can be done and, if necessary, backported.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:2610