Bug 1163209 - [USS]: cd to snap directory from fuse/nfs hungs OR takes too long when a node is brought offline
Summary: [USS]: cd to snap directory from fuse/nfs hungs OR takes too long when a nod...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: snapshot
Version: rhgs-3.0
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: ---
: RHGS 3.0.3
Assignee: Raghavendra Bhat
QA Contact: Rahul Hinduja
URL:
Whiteboard: USS
Depends On: 1174205 1175751
Blocks: 1162694
TreeView+ depends on / blocked
 
Reported: 2014-11-12 13:30 UTC by Rahul Hinduja
Modified: 2016-09-17 13:03 UTC (History)
10 users (show)

Fixed In Version: glusterfs-3.6.0.39-1
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-01-15 13:42:19 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2015:0038 0 normal SHIPPED_LIVE Red Hat Storage 3.0 enhancement and bug fix update #3 2015-01-15 18:35:28 UTC

Description Rahul Hinduja 2014-11-12 13:30:04 UTC
Description of problem:
=======================

In a cluster of 4 nodes, when one of the node is brought offline and glusterd on other node is brought offline. cd to snap directory from fuse/nfs mount either hungs or takes too long


Version-Release number of selected component (if applicable):
=============================================================

glusterfs-3.6.0.32-1.el6rhs.x86_64


How reproducible:
=================
always


Steps to Reproduce:
===================
1. Create 4 node cluster (node1 to node4) 
2. Create and start a volume (2*2) consisting brick from each node (node1 to node4)
3. Mount the volume on a client from node1 and populate data to it (example /mnt/)
4. Create 2 snapshots of a volume
5. Bring down node2 
6. Kill glusterd on node4
7. Change the snapshot-directory to snap-directory
8. Enable USS on the volume
9. From client access snap-directory (cd /mnt/snap-directory) from fuse and nfs 

Actual results:
===============

In two tries observed following result:

1. cd from fuse hungs and cd from nfs took too long (more than 2mins)
2. cd from fuse and nfs took too long
3. Once we are in snap-directory, cd to snapshots took too long


Expected results:
=================

cd from either fuse or nfs should be successful without observing delay or hung


Additional info:
================


Bricks on node1 and node2 are replicate pair and so as on node3 and node4

Comment 4 senaik 2014-11-13 11:48:36 UTC
Version : glusterfs 3.6.0.32
=======

Another scenario where cd to .snaps from NFS mount hangs .

1)Fuse and NFS mount a 2x2 dist-rep volume , and enble USS

2) Create 256 snapshots in a loop while IO is going on 
 for i in {1..150} ; do cp -rvf /var/log/glusterfs f_log.$i ; done
 for i in {1..150} ; do cp -rvf /var/log/glusterfs n_log.$i ; done

3) After snapshot creation is complete, cd to .snaps from fuse and NFS mount
 From fuse mount, .snaps was accessible , then while accessing .snaps from NFS mount, failed with IO error

4) Checked gluster v status of the volume, showed snapd on the server (thro which the volume was mounted) was down

Log messages reported :
~~~~~~~~~~~~~~~~~~~~~~
[2014-11-12 13:32:35.074996] E [rpcsvc.c:617:rpcsvc_handle_rpc_call] 0-glusterd: Request received from non-privileged port. Failing request
[2014-11-12 13:32:35.106171] I [glusterd-pmap.c:271:pmap_registry_remove] 0-pmap: removing brick snapd-vol1 on port 49170
[2014-11-12 13:32:35.957462] W [socket.c:529:__socket_rwv] 0-management: readv on /var/run/22f16287a2b97835e475c3bbf5501834.socket failed (No data available)
[2014-11-12 13:32:36.109356] I [MSGID: 106006] [glusterd-handler.c:4238:__glusterd_snapd_rpc_notify] 0-management: snapd for volume vol1 has disconnected from glusterd.

5) Restarted glusterd and accessed .snaps - successful

6) Access .snaps from fuse and nfs mount again, while trying to cd to .snaps from NFS mount , snapd on the server always went down

7) Tried to stop the volume, start it again and then access .snaps . From Fuse mount, it was successful, but from NFS mount cd  to .snaps was hung

Comment 6 Vijaikumar Mallikarjuna 2014-12-02 13:54:17 UTC
We were not able to re-create this problem with the below setup:

Installed glusterfs-3.6.0.35
Created 4 node cluster
Created 2x2 volume
Followed the instruction mentioned in the description

Comment 7 Vijaikumar Mallikarjuna 2014-12-03 09:33:22 UTC
Patch https://code.engineering.redhat.com/gerrit/#/c/37398/ has fixed this issue.

Comment 8 Rahul Hinduja 2014-12-08 11:40:36 UTC
Able to recreate the issue with exactly same steps on build: glusterfs-3.6.0.36-1.el6.x86_64

From Fuse it took more than a minute, and from NFS it took more than 3 mins

From Fuse:
==========
[root@wingo vol0]# pwd
/mnt/vol0
[root@wingo vol0]# time cd .snaps

real    1m3.043s
user    0m0.000s
sys     0m0.000s
[root@wingo .snaps]#


From NFS:
=========
[root@wingo ~]# cd /mnt/nvol0
[root@wingo nvol0]# 
[root@wingo nvol0]# time cd .snaps

real    3m3.043s
user    0m0.000s
sys     0m0.002s
[root@wingo .snaps]# 
[root@wingo .snaps]# rpm -qa | grep glusterfs-3.6.0.36-1.el6.x86_64
glusterfs-3.6.0.36-1.el6.x86_64
[root@wingo .snaps]# 


In general, do uss on when a node was done, and cd to .snaps it takes too long.

Moving back to assigned state

Comment 9 senaik 2014-12-09 11:51:15 UTC
Version : glusterfs 3.6.0.36 
========

Another scenario where cd to .snaps hangs and sometimes fails with "Transport endpoint not connected" from Fuse mount and "I/O Error" from NFS mount


- Create a 2x2 dist-rep volume
- Fuse and NFS mount the volume & enable USS 
- Create some IO
- Take few snapshots
- Bring down glusterd on node2
- Activate one of the snapshots 
- From both fuse and nfs mounts cd to .snaps and list the snaps --> it hangs 
- From a different terminal cd to .snaps and list the snaps , it fails with "Transport endpoint not connected" from Fuse mount and "I/O Error" from NFS mount


[root@dhcp-0-97 .snaps]# ll
ls: reading directory .: Transport endpoint is not connected
total 0
[root@dhcp-0-97 .snaps]# ll
ls: cannot open directory .: Transport endpoint is not connected


[root@dhcp-0-97 .snaps]# ll
ls: cannot open directory .: Input/output error
[root@dhcp-0-97 .snaps]# pwd
/mnt/vol0_nfs/nfs_etc.1/.snaps


Based on Comment8 and Comment9 , changing the severity of this bug to Urgent since the issue is reproduced quite often

Comment 12 senaik 2014-12-30 07:22:14 UTC
Version :glusterfs 3.6.0.40
=======
Repeated the steps as mentioned in Description, Comment8 and Comment 9 , unable to reproduce the issue.
The issue mentioned in Comment4 is tracked by bz 1163750

Marking the bug as 'Verified'

Comment 14 errata-xmlrpc 2015-01-15 13:42:19 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-0038.html


Note You need to log in before you can comment on or make changes to this bug.