Bug 1613273

Summary:	[Ganesha] Inconsistent output of du -sh on each iteration
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Manisha Saini <msaini>
Component:	nfs-ganesha	Assignee:	Kaleb KEITHLEY <kkeithle>
Status:	CLOSED ERRATA	QA Contact:	Manisha Saini <msaini>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.4	CC:	apaladug, dang, ffilz, grajoria, jthottan, mbenjamin, msaini, rhs-bugs, sankarshan, sheggodu, skoduri, storage-qa-internal, vdas
Target Milestone:	---	Keywords:	Regression
Target Release:	RHGS 3.4.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	nfs-ganesha-2.5.5-10	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-09-04 06:55:16 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1503137, 1613890

Description Manisha Saini 2018-08-07 10:59:29 UTC

Description of problem:

6*3 Distributed-Replicate Volume mounting 4 clients via 4 different VIP's (v4).Create 3 directories inside mount point.Create ~3 Lakhs files each of 1MB  (1 Lakhs/directory).

Trigger the following from 4 clients-
Client 1 : du -sh (single iteration)
Client 2 : ls -laRt in loop
Client 3: Finds in loop (while true;do find . -mindepth 1 -type f;done)
Client 4: rm -rf * (on mount point)

At each iteration of du -sh,the size was increasing/inconsistent

Following was the output in each iteration-

--------------------
root@rhs-client6 ganesha]# du -sh
18G     .
[root@rhs-client6 ganesha]# du -sh
44G     .
[root@rhs-client6 ganesha]# du -sh
69G     .
[root@rhs-client6 ganesha]# du -sh
38G     .
[root@rhs-client6 ganesha]# du -sh
29G     .
----------------------

rm -rf exited with message "directory not empty" which was expected.At this point of time around "97173" files and "3" directories were present on mount point

Took output of du -sh. Lookups were still running from other 2 clients.Still inconsistent output of du was been observed.

Following was the output at each iteration

----------------------
# du -sh
79G     .

[root@rhs-client6 ganesha]# du -sh
98G     .

[root@rhs-client6 ganesha]# du -sh
142G    .

[root@rhs-client6 ganesha]# du -sh
114G    .

[root@rhs-client6 ganesha]# du -sh
93G     .

---------------------------
# df -hT
Filesystem                          Type      Size  Used Avail Use% Mounted on
/dev/mapper/rhel_rhs--client6-root  xfs        50G  1.8G   49G   4% /
devtmpfs                            devtmpfs  7.8G     0  7.8G   0% /dev
tmpfs                               tmpfs     7.8G     0  7.8G   0% /dev/shm
tmpfs                               tmpfs     7.8G   18M  7.8G   1% /run
tmpfs                               tmpfs     7.8G     0  7.8G   0% /sys/fs/cgroup
/dev/sda1                           xfs      1014M  215M  800M  22% /boot
/dev/mapper/rhel_rhs--client6-home  xfs       1.8T  149M  1.8T   1% /home
tmpfs                               tmpfs     1.6G     0  1.6G   0% /run/user/0
rhsqe-repo.lab.eng.blr.redhat.com:/ nfs4       50G  9.8G   41G  20% /opt
10.70.34.91:/mani2                  nfs4      4.6T  1.8T  2.9T  39% /mnt/ganesha


Version-Release number of selected component (if applicable):

# rpm -qa | grep ganesha
nfs-ganesha-2.5.5-9.el7rhgs.x86_64
nfs-ganesha-debuginfo-2.5.5-9.el7rhgs.x86_64
glusterfs-ganesha-3.12.2-15.el7rhgs.x86_64
nfs-ganesha-gluster-2.5.5-9.el7rhgs.x86_64


How reproducible:
1/1

Steps to Reproduce:
1.Create 6 node ganesha cluster
2.Create 6*3 Distributed-Replicate Volume.export the volume via ganesha
3.Mount the volume on 4 clients via 4 different VIP's (v4 protocol)
4.Create 3 directories on mount point
5.Create 1 Lakh file in each directory from 3 clients.Run unnamed fin'd from client 4.
6.When IO's are completed from trigger du -sh from 1 client along with rm -rf, ls -laRt,find'd from other 3 clients

Actual results:

At each iteration of du -sh,the size was increasing/inconsistent.

With only "du -sh" along with parallel lookups also,the size was inconsistent at each iteration.


Expected results:

du -sh should show the consistent/correct output at each iteration.



Additional info:

----------
Mounted the same volume on fuse via glusterfs. Output was consistent
----------

[root@rhs-client6 fuse_mount]# du -sh
1.8T	.
[root@rhs-client6 fuse_mount]# du -sh
1.8T	.

---------
Mounted the same volume on Ganesha again.Performed du -sh without running parallel lookups. Still showing inconsistent output at each iteration.
---------

[root@rhs-client6 ganesha]# du -sh
202G    .

[root@rhs-client6 ganesha]# du -sh
163G    .

[root@rhs-client6 ganesha]# du -sh
232G    .

Comment 3 Manisha Saini 2018-08-07 11:09:06 UTC

Gluster-Heath-check report


------------

[root@moonshine gluster-health-report-master]#  gluster-health-report

Loaded reports: coredump, disk_usage, errors_in_logs, firewall-check, georep, gfid-mismatch-dht-report, glusterd-op-version, glusterd-peer-disconnect, glusterd, glusterd_volume_version_cksum_errors, kernel_issues, memory_usage, errors_in_logs, ifconfig, nic-health, process_status

[     OK] Disk used percentage  path=/  percentage=23
[     OK] Disk used percentage  path=/var  percentage=23
[     OK] Disk used percentage  path=/tmp  percentage=23
[     OK] All peers are in connected state  connected_count=5  total_peer_count=5
[     OK] no gfid mismatch
[     OK] op-version is up to date  op_version=31302  max_op_version=31302
[     OK] The maximum size of core files created is set to unlimted.
[     OK] Ports open for glusterd:
tcp        0      0 0.0.0.0:24007           0.0.0.0:*               LISTEN      14172/glusterd      

[     OK] Ports open for glusterfsd:
3:tcp        0      0 0.0.0.0:49152           0.0.0.0:*               LISTEN      2724/glusterfsd     
4:tcp        0      0 0.0.0.0:49154           0.0.0.0:*               LISTEN      19962/glusterfsd    

[  ERROR] Report failure  report=report_check_worker_restarts
[     OK] Glusterd is running  uptime_sec=506203
[WARNING] Errors in Glusterd log file  num_errors=22
[WARNING] Warnings in Glusterd log file  num_warning=9
[     OK] No errors seen at network card
[     OK] No errors seen at network card
0
[WARNING] Errors in Glusterd log file num_errors=38
[WARNING] Warnings in Glusterd log file num_warnings=16

....

----------------




[root@tettnang gluster-health-report-master]#  gluster-health-report

Loaded reports: coredump, disk_usage, errors_in_logs, firewall-check, georep, gfid-mismatch-dht-report, glusterd-op-version, glusterd-peer-disconnect, glusterd, glusterd_volume_version_cksum_errors, kernel_issues, memory_usage, errors_in_logs, ifconfig, nic-health, process_status

[     OK] Disk used percentage  path=/  percentage=44
[     OK] Disk used percentage  path=/var  percentage=44
[     OK] Disk used percentage  path=/tmp  percentage=44
[     OK] All peers are in connected state  connected_count=5  total_peer_count=5
[     OK] no gfid mismatch
[ NOT OK] Failed to check op-version
[     OK] The maximum size of core files created is set to unlimted.
[     OK] Ports open for glusterd:
tcp        0      0 0.0.0.0:24007           0.0.0.0:*               LISTEN      16591/glusterd      

[     OK] Ports open for glusterfsd:
3:tcp        0      0 0.0.0.0:49153           0.0.0.0:*               LISTEN      20713/glusterfsd    
4:tcp        0      0 0.0.0.0:49154           0.0.0.0:*               LISTEN      19648/glusterfsd    

[  ERROR] Report failure  report=report_check_worker_restarts
[     OK] Glusterd is running  uptime_sec=506203
[WARNING] Errors in Glusterd log file  num_errors=27
[WARNING] Warnings in Glusterd log file  num_warning=9
[     OK] No errors seen at network card
[     OK] No errors seen at network card
0
[WARNING] Errors in Glusterd log file num_errors=43
[WARNING] Warnings in Glusterd log file num_warnings=16

....
----------------------


root@zod gluster-health-report-master]#  gluster-health-report

Loaded reports: coredump, disk_usage, errors_in_logs, firewall-check, georep, gfid-mismatch-dht-report, glusterd-op-version, glusterd-peer-disconnect, glusterd, glusterd_volume_version_cksum_errors, kernel_issues, memory_usage, errors_in_logs, ifconfig, nic-health, process_status

[     OK] Disk used percentage  path=/  percentage=29
[     OK] Disk used percentage  path=/var  percentage=29
[     OK] Disk used percentage  path=/tmp  percentage=29
[     OK] All peers are in connected state  connected_count=5  total_peer_count=5
[     OK] no gfid mismatch
[     OK] op-version is up to date  op_version=31302  max_op_version=31302
[     OK] The maximum size of core files created is set to unlimted.
[     OK] Ports open for glusterd:
tcp        0      0 0.0.0.0:24007           0.0.0.0:*               LISTEN      13327/glusterd      

[     OK] Ports open for glusterfsd:
3:tcp        0      0 0.0.0.0:49153           0.0.0.0:*               LISTEN      26480/glusterfsd    
4:tcp        0      0 0.0.0.0:49154           0.0.0.0:*               LISTEN      16570/glusterfsd    

[  ERROR] Report failure  report=report_check_worker_restarts
[     OK] Glusterd is running  uptime_sec=506203
[WARNING] Errors in Glusterd log file  num_errors=22
[WARNING] Warnings in Glusterd log file  num_warning=9
[     OK] No errors seen at network card
[     OK] No errors seen at network card
0
[WARNING] Errors in Glusterd log file num_errors=38
[WARNING] Warnings in Glusterd log file num_warnings=16

....


-------------------------------

root@yarrow gluster-health-report-master]#  gluster-health-report

Loaded reports: coredump, disk_usage, errors_in_logs, firewall-check, georep, gfid-mismatch-dht-report, glusterd-op-version, glusterd-peer-disconnect, glusterd, glusterd_volume_version_cksum_errors, kernel_issues, memory_usage, errors_in_logs, ifconfig, nic-health, process_status

[ NOT OK] Disk used percentage is exceeding threshold, consider deleting unnecessary data  path=/  percentage=90
[ NOT OK] Disk used percentage is exceeding threshold, consider deleting unnecessary data  path=/var  percentage=90
[ NOT OK] Disk used percentage is exceeding threshold, consider deleting unnecessary data  path=/tmp  percentage=90
[     OK] All peers are in connected state  connected_count=5  total_peer_count=5
[     OK] no gfid mismatch
[ NOT OK] Failed to check op-version
[     OK] The maximum size of core files created is set to unlimted.
[     OK] Ports open for glusterd:
tcp        0      0 0.0.0.0:24007           0.0.0.0:*               LISTEN      1909/glusterd       

[     OK] Ports open for glusterfsd:
3:tcp        0      0 0.0.0.0:49152           0.0.0.0:*               LISTEN      31317/glusterfsd    

[  ERROR] Report failure  report=report_check_worker_restarts
[     OK] Glusterd is running  uptime_sec=506203
[WARNING] Errors in Glusterd log file  num_errors=35
[WARNING] Warnings in Glusterd log file  num_warning=9
[     OK] No errors seen at network card
[     OK] No errors seen at network card
0
[WARNING] Errors in Glusterd log file num_errors=51
[WARNING] Warnings in Glusterd log file num_warnings=16

....
--------------------------

Comment 4 Manisha Saini 2018-08-08 08:44:27 UTC

Changing the summary as this BZ as this can be hit with just creating 3 Lakhs files on Ganesha and performing du -sh.Each time its showing different disk usage,



[root@rhs-client6 ganesha]# du -sh
294G    .

[root@rhs-client6 ganesha]# du -sh
168G    .

[root@rhs-client6 ganesha]# du -sh
97G     .

[root@rhs-client6 ganesha]# du -sh
64G     .

[root@rhs-client6 ganesha]# du -sh
113G    .


# df -hT
Filesystem                              Type            Size  Used Avail Use% Mounted on
/dev/mapper/rhel_rhs--client6-root      xfs              50G  2.0G   49G   4% /
devtmpfs                                devtmpfs        7.8G     0  7.8G   0% /dev
tmpfs                                   tmpfs           7.8G     0  7.8G   0% /dev/shm
tmpfs                                   tmpfs           7.8G   18M  7.8G   1% /run
tmpfs                                   tmpfs           7.8G     0  7.8G   0% /sys/fs/cgroup
/dev/sda1                               xfs            1014M  215M  800M  22% /boot
/dev/mapper/rhel_rhs--client6-home      xfs             1.8T  149M  1.8T   1% /home
tmpfs                                   tmpfs           1.6G     0  1.6G   0% /run/user/0
rhsqe-repo.lab.eng.blr.redhat.com:/     nfs4             50G  9.8G   41G  20% /opt
10.70.34.91:/mani2                      nfs4            4.6T  341G  4.3T   8% /mnt/ganesha
moonshine.lab.eng.blr.redhat.com:/mani2 fuse.glusterfs  4.6T  341G  4.3T   8% /mnt/fuse_mount


-----------------------
Stats on Fuse mount with same volume and data set- 
-----------------------

# du -sh
293G    .
[root@rhs-client6 fuse_mount]# du -sh
293G    .

Comment 5 Manisha Saini 2018-08-08 08:46:11 UTC

Proposing this as a blocker since its an application side impact which shows incorrect usage each time on ganesha mount

Comment 7 Manisha Saini 2018-08-08 10:49:55 UTC

Marking it as regression as the issue is not been observed with 3.3.1 ganesha bits

Comment 9 Manisha Saini 2018-08-09 06:23:52 UTC

Clearing the needinfo as discussed over IRC

Comment 13 Daniel Gryniewicz 2018-08-09 18:30:33 UTC

I think there's a misunderstanding.  The recent patch didn't enable readdir_plus usage in Ganesha; that's been in for a while, and is a compile-time option.

Instead, it enabled readdir chunk reloading using readdir_plus, if it was built in. This is also a compile-time option, and can't be disabled. 

Disabling readdir_plus on the cluster won't change ganesha's behavior at all; it will still call the readdir_plus() API, and will still reload chunks using that API.

I tested by commenting out FSAL/FSAL_GLUSTER/main.c:75 and this fixed du for me, so it's somehow related to the chunk reloading.  I'll try and debug.

Comment 18 Daniel Gryniewicz 2018-08-14 12:28:55 UTC

So this (and bug #1613890) are caused by the readdir_plus chunking changes, somehow.  It seems, in my limited testing, that Gluster is returning different dirents for the same readdir offsets on subsequent readdirs.  That is, it's re-ordering dirents in a directory with no directory mutations.  I'm not positive of this yet, but if this is the case, then the readdir_plus chunking optimization cannot be used with gluster.

My time is limited this week, so I won't be able to work on this until next week.

Comment 19 Frank Filz 2018-08-14 16:58:15 UTC

Bug #1558974 is also likely the same issue

Comment 22 Daniel Gryniewicz 2018-08-17 13:39:29 UTC

Update on this issue.  This is from the POSIX definition for seekdir:


The seekdir() function shall set the position of the next readdir() operation on the directory stream specified by dirp to the position specified by loc. The value of loc should have been returned from an earlier call to telldir() using the same directory stream. The new position reverts to the one associated with the directory stream when telldir() was performed.

If the value of loc was not obtained from an earlier call to telldir(), or if a call to rewinddir() occurred between the call to telldir() and the call to seekdir(), the results of subsequent calls to readdir() are unspecified.


From this, it's clear to me that d_off (which is not defined in POSIX, but is defined on Linux as the value that telldir() would return at that point in the directory stream) is not required to be stable across directory streams.  Currently, Ganesha uses a separate directory stream for each readdir call.  This clearly works on Linux and Ceph, but is technically a violation of POSIX.  This is not something we can fix, either in Ganesha or in Gluster, in the very short timeframe for 3.4, so readdir chunking will have to be disabled for this release.  A proper fix can be done and, if necessary, backported.

Comment 25 errata-xmlrpc 2018-09-04 06:55:16 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2610