Bug 1163209

Summary:	[USS]: cd to snap directory from fuse/nfs hungs OR takes too long when a node is brought offline
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Rahul Hinduja <rhinduja>
Component:	snapshot	Assignee:	Raghavendra Bhat <rabhat>
Status:	CLOSED ERRATA	QA Contact:	Rahul Hinduja <rhinduja>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	rhgs-3.0	CC:	amainkar, nsathyan, rabhat, rhs-bugs, rjoseph, senaik, storage-qa-internal, surs, vagarwal, vmallika
Target Milestone:	---	Keywords:	ZStream
Target Release:	RHGS 3.0.3
Hardware:	x86_64
OS:	Linux
Whiteboard:	USS
Fixed In Version:	glusterfs-3.6.0.39-1	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2015-01-15 13:42:19 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1174205, 1175751
Bug Blocks:	1162694

Description Rahul Hinduja 2014-11-12 13:30:04 UTC

Description of problem:
=======================

In a cluster of 4 nodes, when one of the node is brought offline and glusterd on other node is brought offline. cd to snap directory from fuse/nfs mount either hungs or takes too long


Version-Release number of selected component (if applicable):
=============================================================

glusterfs-3.6.0.32-1.el6rhs.x86_64


How reproducible:
=================
always


Steps to Reproduce:
===================
1. Create 4 node cluster (node1 to node4) 
2. Create and start a volume (2*2) consisting brick from each node (node1 to node4)
3. Mount the volume on a client from node1 and populate data to it (example /mnt/)
4. Create 2 snapshots of a volume
5. Bring down node2 
6. Kill glusterd on node4
7. Change the snapshot-directory to snap-directory
8. Enable USS on the volume
9. From client access snap-directory (cd /mnt/snap-directory) from fuse and nfs 

Actual results:
===============

In two tries observed following result:

1. cd from fuse hungs and cd from nfs took too long (more than 2mins)
2. cd from fuse and nfs took too long
3. Once we are in snap-directory, cd to snapshots took too long


Expected results:
=================

cd from either fuse or nfs should be successful without observing delay or hung


Additional info:
================


Bricks on node1 and node2 are replicate pair and so as on node3 and node4

Comment 4 senaik 2014-11-13 11:48:36 UTC

Version : glusterfs 3.6.0.32
=======

Another scenario where cd to .snaps from NFS mount hangs .

1)Fuse and NFS mount a 2x2 dist-rep volume , and enble USS

2) Create 256 snapshots in a loop while IO is going on 
 for i in {1..150} ; do cp -rvf /var/log/glusterfs f_log.$i ; done
 for i in {1..150} ; do cp -rvf /var/log/glusterfs n_log.$i ; done

3) After snapshot creation is complete, cd to .snaps from fuse and NFS mount
 From fuse mount, .snaps was accessible , then while accessing .snaps from NFS mount, failed with IO error

4) Checked gluster v status of the volume, showed snapd on the server (thro which the volume was mounted) was down

Log messages reported :
~~~~~~~~~~~~~~~~~~~~~~
[2014-11-12 13:32:35.074996] E [rpcsvc.c:617:rpcsvc_handle_rpc_call] 0-glusterd: Request received from non-privileged port. Failing request
[2014-11-12 13:32:35.106171] I [glusterd-pmap.c:271:pmap_registry_remove] 0-pmap: removing brick snapd-vol1 on port 49170
[2014-11-12 13:32:35.957462] W [socket.c:529:__socket_rwv] 0-management: readv on /var/run/22f16287a2b97835e475c3bbf5501834.socket failed (No data available)
[2014-11-12 13:32:36.109356] I [MSGID: 106006] [glusterd-handler.c:4238:__glusterd_snapd_rpc_notify] 0-management: snapd for volume vol1 has disconnected from glusterd.

5) Restarted glusterd and accessed .snaps - successful

6) Access .snaps from fuse and nfs mount again, while trying to cd to .snaps from NFS mount , snapd on the server always went down

7) Tried to stop the volume, start it again and then access .snaps . From Fuse mount, it was successful, but from NFS mount cd  to .snaps was hung

Comment 6 Vijaikumar Mallikarjuna 2014-12-02 13:54:17 UTC

We were not able to re-create this problem with the below setup:

Installed glusterfs-3.6.0.35
Created 4 node cluster
Created 2x2 volume
Followed the instruction mentioned in the description

Comment 7 Vijaikumar Mallikarjuna 2014-12-03 09:33:22 UTC

Patch https://code.engineering.redhat.com/gerrit/#/c/37398/ has fixed this issue.

Comment 8 Rahul Hinduja 2014-12-08 11:40:36 UTC

Able to recreate the issue with exactly same steps on build: glusterfs-3.6.0.36-1.el6.x86_64

From Fuse it took more than a minute, and from NFS it took more than 3 mins

From Fuse:
==========
[root@wingo vol0]# pwd
/mnt/vol0
[root@wingo vol0]# time cd .snaps

real    1m3.043s
user    0m0.000s
sys     0m0.000s
[root@wingo .snaps]#


From NFS:
=========
[root@wingo ~]# cd /mnt/nvol0
[root@wingo nvol0]# 
[root@wingo nvol0]# time cd .snaps

real    3m3.043s
user    0m0.000s
sys     0m0.002s
[root@wingo .snaps]# 
[root@wingo .snaps]# rpm -qa | grep glusterfs-3.6.0.36-1.el6.x86_64
glusterfs-3.6.0.36-1.el6.x86_64
[root@wingo .snaps]# 


In general, do uss on when a node was done, and cd to .snaps it takes too long.

Moving back to assigned state

Comment 9 senaik 2014-12-09 11:51:15 UTC

Version : glusterfs 3.6.0.36 
========

Another scenario where cd to .snaps hangs and sometimes fails with "Transport endpoint not connected" from Fuse mount and "I/O Error" from NFS mount


- Create a 2x2 dist-rep volume
- Fuse and NFS mount the volume & enable USS 
- Create some IO
- Take few snapshots
- Bring down glusterd on node2
- Activate one of the snapshots 
- From both fuse and nfs mounts cd to .snaps and list the snaps --> it hangs 
- From a different terminal cd to .snaps and list the snaps , it fails with "Transport endpoint not connected" from Fuse mount and "I/O Error" from NFS mount


[root@dhcp-0-97 .snaps]# ll
ls: reading directory .: Transport endpoint is not connected
total 0
[root@dhcp-0-97 .snaps]# ll
ls: cannot open directory .: Transport endpoint is not connected


[root@dhcp-0-97 .snaps]# ll
ls: cannot open directory .: Input/output error
[root@dhcp-0-97 .snaps]# pwd
/mnt/vol0_nfs/nfs_etc.1/.snaps


Based on Comment8 and Comment9 , changing the severity of this bug to Urgent since the issue is reproduced quite often

Comment 12 senaik 2014-12-30 07:22:14 UTC

Version :glusterfs 3.6.0.40
=======
Repeated the steps as mentioned in Description, Comment8 and Comment 9 , unable to reproduce the issue.
The issue mentioned in Comment4 is tracked by bz 1163750

Marking the bug as 'Verified'

Comment 14 errata-xmlrpc 2015-01-15 13:42:19 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-0038.html