1379662 – IO hang on ganesha mount during remove brick operation.

Bug 1379662 - IO hang on ganesha mount during remove brick operation.

Summary: IO hang on ganesha mount during remove brick operation.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	nfs-ganesha
Sub Component:
Version:	rhgs-3.2
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.2.0
Assignee:	Kaleb KEITHLEY
QA Contact:	Arthy Loganathan
Docs Contact:
URL:
Whiteboard:
Depends On:	1365626
Blocks:	1351528
TreeView+	depends on / blocked

Reported:	2016-09-27 11:32 UTC by Shashank Raj
Modified:	2017-03-23 06:23 UTC (History)
CC List:	12 users (show)
Fixed In Version:	nfs-ganesha-2.4.1-1
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1365626
Environment:
Last Closed:	2017-03-23 06:23:17 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2017:0493	0	normal	SHIPPED_LIVE	Red Hat Gluster Storage 3.2.0 nfs-ganesha bug fix and enhancement update	2017-03-23 09:19:13 UTC

Description Shashank Raj 2016-09-27 11:32:23 UTC

+++ This bug was initially created as a clone of Bug #1365626 +++

Description of problem:

IO hang on ganesha mount during remove brick operation

Version-Release number of selected component (if applicable):

[root@dhcp43-133 ~]# rpm -qa|grep glusterfs
glusterfs-libs-3.8.1-0.4.git56fcf39.el7rhgs.x86_64
glusterfs-fuse-3.8.1-0.4.git56fcf39.el7rhgs.x86_64
glusterfs-3.8.1-0.4.git56fcf39.el7rhgs.x86_64
glusterfs-api-3.8.1-0.4.git56fcf39.el7rhgs.x86_64
glusterfs-cli-3.8.1-0.4.git56fcf39.el7rhgs.x86_64
glusterfs-ganesha-3.8.1-0.4.git56fcf39.el7rhgs.x86_64
glusterfs-client-xlators-3.8.1-0.4.git56fcf39.el7rhgs.x86_64
glusterfs-server-3.8.1-0.4.git56fcf39.el7rhgs.x86_64
glusterfs-geo-replication-3.8.1-0.4.git56fcf39.el7rhgs.x86_64

[root@dhcp43-133 ~]# rpm -qa|grep ganesha
nfs-ganesha-gluster-2.4-0.dev.26.el7rhgs.x86_64
nfs-ganesha-2.4-0.dev.26.el7rhgs.x86_64
glusterfs-ganesha-3.8.1-0.4.git56fcf39.el7rhgs.x86_64

How reproducible:

Once

Steps to Reproduce:
1.Create a 6x2 dist-rep volume and enable ganesha on the volume.

2.Do a subdir v4 mount on the client

mount -t nfs -o vers=4 10.70.40.192:/newvolume/subdir /mnt1470753422.46

3.Start creating nested dir and files

for i in {1..30}; do mkdir /mnt1470753422.46/a$i;  for j in {1..50}; do mkdir /mnt1470753422.46/a$i/b$j; for k in {1..50}; do touch /mnt1470753422.46/a$i/b$j/c$k; done done done

4.Start the remove brick operation:

gluster volume remove-brick newvolume replica 2  dhcp43-133.lab.eng.blr.redhat.com:/bricks/brick1/newvolume_brick0 dhcp41-206.lab.eng.blr.redhat.com:/bricks/brick1/newvolume_brick1 start

5. Once the remove brick operation is complete, commit the brick removal

gluster volume  remove-brick newvolume replica 2  dhcp43-133.lab.eng.blr.redhat.com:/bricks/brick1/newvolume_brick0 dhcp41-206.lab.eng.blr.redhat.com:/bricks/brick1/newvolume_brick1 commit 

6. Observe that the IO hangs on the client and following messages are seen in /var/log/ganesha.log

[root@dhcp46-206 ~]# ps -ef|grep mkdir
root      9288  9283  0 20:00 ?        00:00:02 bash -c cd /root && for i in {1..30}; do mkdir /mnt1470753422.46/a$i;  for j in {1..50}; do mkdir /mnt1470753422.46/a$i/b$j; for k in {1..50}; do touch /mnt1470753422.46/a$i/b$j/c$k; done done done

09/08/2016 19:29:53 : epoch 57a9cca6 : dhcp43-133.lab.eng.blr.redhat.com : ganesha.nfsd-26092[dbus_heartbeat] posix2fsal_error :FSAL :CRIT :Mapping 107(default) to ERR_FSAL_SERVERFAULT
09/08/2016 19:29:53 : epoch 57a9cca6 : dhcp43-133.lab.eng.blr.redhat.com : ganesha.nfsd-26092[dbus_heartbeat] glusterfs_close_my_fd :FSAL :CRIT :Error : close returns with Transport endpoint is not connected
09/08/2016 19:29:53 : epoch 57a9cca6 : dhcp43-133.lab.eng.blr.redhat.com : ganesha.nfsd-26092[dbus_heartbeat] mdcache_lru_clean :INODE LRU :CRIT :Error closing file in cleanup: Undefined server error
09/08/2016 19:29:53 : epoch 57a9cca6 : dhcp43-133.lab.eng.blr.redhat.com : ganesha.nfsd-26092[dbus_heartbeat] posix2fsal_error :FSAL :CRIT :Mapping 107(default) to ERR_FSAL_SERVERFAULT
09/08/2016 19:29:53 : epoch 57a9cca6 : dhcp43-133.lab.eng.blr.redhat.com : ganesha.nfsd-26092[dbus_heartbeat] glusterfs_close_my_fd :FSAL :CRIT :Error : close returns with Transport endpoint is not connected
09/08/2016 19:29:53 : epoch 57a9cca6 : dhcp43-133.lab.eng.blr.redhat.com : ganesha.nfsd-26092[dbus_heartbeat] mdcache_lru_clean :INODE LRU :CRIT :Error closing file in cleanup: Undefined server error

Actual results:

IO hang on ganesha mount during remove brick operation.

Expected results:

Additional info:

sosreport and logs will be attached

--- Additional comment from Shashank Raj on 2016-08-09 13:37:01 EDT ---

sosreport and logs can be found under http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1365626

--- Additional comment from Niels de Vos on 2016-09-12 01:39:46 EDT ---

All 3.8.x bugs are now reported against version 3.8 (without .x). For more information, see http://www.gluster.org/pipermail/gluster-devel/2016-September/050859.html

Comment 6 Jiffin 2016-10-25 04:58:53 UTC

I cannot reproduce this with latest downstream gluster and ganesha.
Steps which I used
1.) created 2x2 volume and export it via ganesha
2.) Mount the volume using v4 and initiated I/O
3.) Performed remove brick operation.
I/O hanged for 1s or 2s during gluster remove brick ... start(I guess this is expected behavior) and resumed after that. 

This issue is not seen in v3 as well.

Hence I request QA to retest this scenario again.

Comment 10 Arthy Loganathan 2016-11-22 06:57:06 UTC

I have executed the same steps in "Steps to reproduce" section with nfs v4 mount, and I am not hitting the issue in the latest downstream build,

nfs-ganesha-2.4.1-1.el7rhgs.x86_64
glusterfs-ganesha-3.8.4-5.el7rhgs.x86_64

Comment 12 Arthy Loganathan 2016-11-23 11:26:04 UTC

Verified the fix in build,

nfs-ganesha-2.4.1-1.el7rhgs.x86_64
glusterfs-ganesha-3.8.4-5.el7rhgs.x86_64

Comment 14 errata-xmlrpc 2017-03-23 06:23:17 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2017-0493.html

Note You need to log in before you can comment on or make changes to this bug.