Description of problem: ======================= In a cluster of 4 nodes, when one of the node is brought offline and glusterd on other node is brought offline. cd to snap directory from fuse/nfs mount either hungs or takes too long Version-Release number of selected component (if applicable): ============================================================= glusterfs-3.6.0.32-1.el6rhs.x86_64 How reproducible: ================= always Steps to Reproduce: =================== 1. Create 4 node cluster (node1 to node4) 2. Create and start a volume (2*2) consisting brick from each node (node1 to node4) 3. Mount the volume on a client from node1 and populate data to it (example /mnt/) 4. Create 2 snapshots of a volume 5. Bring down node2 6. Kill glusterd on node4 7. Change the snapshot-directory to snap-directory 8. Enable USS on the volume 9. From client access snap-directory (cd /mnt/snap-directory) from fuse and nfs Actual results: =============== In two tries observed following result: 1. cd from fuse hungs and cd from nfs took too long (more than 2mins) 2. cd from fuse and nfs took too long 3. Once we are in snap-directory, cd to snapshots took too long Expected results: ================= cd from either fuse or nfs should be successful without observing delay or hung Additional info: ================ Bricks on node1 and node2 are replicate pair and so as on node3 and node4
Version : glusterfs 3.6.0.32 ======= Another scenario where cd to .snaps from NFS mount hangs . 1)Fuse and NFS mount a 2x2 dist-rep volume , and enble USS 2) Create 256 snapshots in a loop while IO is going on for i in {1..150} ; do cp -rvf /var/log/glusterfs f_log.$i ; done for i in {1..150} ; do cp -rvf /var/log/glusterfs n_log.$i ; done 3) After snapshot creation is complete, cd to .snaps from fuse and NFS mount From fuse mount, .snaps was accessible , then while accessing .snaps from NFS mount, failed with IO error 4) Checked gluster v status of the volume, showed snapd on the server (thro which the volume was mounted) was down Log messages reported : ~~~~~~~~~~~~~~~~~~~~~~ [2014-11-12 13:32:35.074996] E [rpcsvc.c:617:rpcsvc_handle_rpc_call] 0-glusterd: Request received from non-privileged port. Failing request [2014-11-12 13:32:35.106171] I [glusterd-pmap.c:271:pmap_registry_remove] 0-pmap: removing brick snapd-vol1 on port 49170 [2014-11-12 13:32:35.957462] W [socket.c:529:__socket_rwv] 0-management: readv on /var/run/22f16287a2b97835e475c3bbf5501834.socket failed (No data available) [2014-11-12 13:32:36.109356] I [MSGID: 106006] [glusterd-handler.c:4238:__glusterd_snapd_rpc_notify] 0-management: snapd for volume vol1 has disconnected from glusterd. 5) Restarted glusterd and accessed .snaps - successful 6) Access .snaps from fuse and nfs mount again, while trying to cd to .snaps from NFS mount , snapd on the server always went down 7) Tried to stop the volume, start it again and then access .snaps . From Fuse mount, it was successful, but from NFS mount cd to .snaps was hung
We were not able to re-create this problem with the below setup: Installed glusterfs-3.6.0.35 Created 4 node cluster Created 2x2 volume Followed the instruction mentioned in the description
Patch https://code.engineering.redhat.com/gerrit/#/c/37398/ has fixed this issue.
Able to recreate the issue with exactly same steps on build: glusterfs-3.6.0.36-1.el6.x86_64 From Fuse it took more than a minute, and from NFS it took more than 3 mins From Fuse: ========== [root@wingo vol0]# pwd /mnt/vol0 [root@wingo vol0]# time cd .snaps real 1m3.043s user 0m0.000s sys 0m0.000s [root@wingo .snaps]# From NFS: ========= [root@wingo ~]# cd /mnt/nvol0 [root@wingo nvol0]# [root@wingo nvol0]# time cd .snaps real 3m3.043s user 0m0.000s sys 0m0.002s [root@wingo .snaps]# [root@wingo .snaps]# rpm -qa | grep glusterfs-3.6.0.36-1.el6.x86_64 glusterfs-3.6.0.36-1.el6.x86_64 [root@wingo .snaps]# In general, do uss on when a node was done, and cd to .snaps it takes too long. Moving back to assigned state
Version : glusterfs 3.6.0.36 ======== Another scenario where cd to .snaps hangs and sometimes fails with "Transport endpoint not connected" from Fuse mount and "I/O Error" from NFS mount - Create a 2x2 dist-rep volume - Fuse and NFS mount the volume & enable USS - Create some IO - Take few snapshots - Bring down glusterd on node2 - Activate one of the snapshots - From both fuse and nfs mounts cd to .snaps and list the snaps --> it hangs - From a different terminal cd to .snaps and list the snaps , it fails with "Transport endpoint not connected" from Fuse mount and "I/O Error" from NFS mount [root@dhcp-0-97 .snaps]# ll ls: reading directory .: Transport endpoint is not connected total 0 [root@dhcp-0-97 .snaps]# ll ls: cannot open directory .: Transport endpoint is not connected [root@dhcp-0-97 .snaps]# ll ls: cannot open directory .: Input/output error [root@dhcp-0-97 .snaps]# pwd /mnt/vol0_nfs/nfs_etc.1/.snaps Based on Comment8 and Comment9 , changing the severity of this bug to Urgent since the issue is reproduced quite often
Version :glusterfs 3.6.0.40 ======= Repeated the steps as mentioned in Description, Comment8 and Comment 9 , unable to reproduce the issue. The issue mentioned in Comment4 is tracked by bz 1163750 Marking the bug as 'Verified'
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-0038.html