Hide Forgot
Description of problem: ======================= Deleting the directory from the fuse mount, on which quota is set, and recreating the directory, followed by 'gluster volume <quota> <vol-name> list', killed all the bricks, resulting in 'Transport endpoint is not connected' from fuse mount. Version-Release number of selected component (if applicable): ============================================================= glusterfs 3.4.0.20rhsquota5 built on Aug 26 2013 02:56:39 How reproducible: ================= Tried 3 times and hit it all the time (3/3) Steps to Reproduce: =================== 1. On a RHS Cluster of 4 nodes, create a distribute volume with 2 bricks (i.e) gluster volume create <vol-name> <brick1> <brick2> 2. Start the volume (i.e) gluster volume start <vol-name> 3. Enable quota on the volume (i.e) gluster volume quota <vol-name> enable 4. Set the quota on the non-existing directory (i.e) gluster volume quota <vol-name> limit-usage <non-existent-dir> 2GB NOTE: This step would fail with error message 5. Fuse mount the volume and create the directory, which is tried in step 4 6. Repeat step 4. [setting quota limit] NOTE: quota will be set on that dir 7. List the quota on the volume (i.e) gluster volume quota <vol-name> list 8. Delete the directory 9. Repeat step 7 [listing the quota] NOTE: no quota entries are listed 10. Recreate the directory on the fuse mount,[create a directory, with the same name, which is deleted in step 8 ] 11. List the quota on volume (i.e) gluster volume quota <vol-name> list Actual results: =============== All Brick processes are killed Expected results: ================ Not sure about the ideal/expected behaviour, but brick processes should not get killed Additional info: ================ Console logs ============ [Thu Aug 29 09:31:39 UTC 2013 root.37.174:~ ] # gluster volume create dogvol 10.70.37.174:/rhs/brick1/dogdir1 10.70.37.185:/rhs/brick1/dogdir1 volume create: dogvol: success: please start the volume to access data [Thu Aug 29 09:37:29 UTC 2013 root.37.174:~ ] # gluster volume start dogvol volume start: dogvol: success [Thu Aug 29 09:37:39 UTC 2013 root.37.174:~ ] # gluster volume status Status of volume: dogvol Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.37.174:/rhs/brick1/dogdir1 49160 Y 27596 Brick 10.70.37.185:/rhs/brick1/dogdir1 49160 Y 14854 NFS Server on localhost 2049 Y 27608 NFS Server on 10.70.37.118 2049 Y 10212 NFS Server on 10.70.37.185 2049 Y 14868 NFS Server on 10.70.37.95 2049 Y 10163 There are no active volume tasks [Thu Aug 29 09:38:32 UTC 2013 root.37.174:~ ] # gluster v info Volume Name: dogvol Type: Distribute Volume ID: 0350f1f9-75bd-4e1d-ac88-4eb00378740f Status: Started Number of Bricks: 2 Transport-type: tcp Bricks: Brick1: 10.70.37.174:/rhs/brick1/dogdir1 Brick2: 10.70.37.185:/rhs/brick1/dogdir1 [Thu Aug 29 09:38:37 UTC 2013 root.37.174:~ ] # gluster v status Status of volume: dogvol Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.37.174:/rhs/brick1/dogdir1 49160 Y 27596 Brick 10.70.37.185:/rhs/brick1/dogdir1 49160 Y 14854 NFS Server on localhost 2049 Y 27666 NFS Server on 10.70.37.95 2049 Y 10206 NFS Server on 10.70.37.118 2049 Y 10248 NFS Server on 10.70.37.185 2049 Y 14911 There are no active volume tasks [Thu Aug 29 09:38:40 UTC 2013 root.37.174:~ ] # ps aux | grep quotad root 27714 0.0 0.0 103244 804 pts/0 R+ 15:08 0:00 grep quotad [Thu Aug 29 09:39:06 UTC 2013 root.37.174:~ ] # gluster volume quota dogvol enable volume quota : success [Thu Aug 29 09:39:13 UTC 2013 root.37.174:~ ] # ps aux | grep quotad root 27758 0.4 0.8 187988 18028 ? Ssl 15:09 0:00 /usr/sbin/glusterfs -s localhost --volfile-id gluster/quotad -p /var/lib/glusterd/quotad/run/quotad.pid -l /var/log/glusterfs/quotad.log -S /var/run/7e63030677df5afe2fa7a9f790189502.socket --xlator-option *replicate*.data-self-heal=off --xlator-option *replicate*.metadata-self-heal=off --xlator-option *replicate*.entry-self-heal=off root 27771 0.0 0.0 103244 812 pts/0 S+ 15:09 0:00 grep quotad [Thu Aug 29 09:39:17 UTC 2013 root.37.174:~ ] # gluster volume quota dogvol list [Thu Aug 29 09:39:23 UTC 2013 root.37.174:~ ] # gluster volume quota dogvol limit-usage /master 2GB quota command failed : Failed to get trusted.gfid attribute on path /master. Reason : No such file or directory <CREATED THE DIRECTORY FROM FUSE MOUNT> [Thu Aug 29 09:39:40 UTC 2013 root.37.174:~ ] # gluster volume quota dogvol limit-usage /master 2GB volume quota : success [Thu Aug 29 09:41:03 UTC 2013 root.37.174:~ ] # gluster volume quota dogvol list Path Hard-limit Soft-limit Used Available -------------------------------------------------------------------------------- /master 2.0GB 9130191673159152629 0Bytes 2.0GB <REMOVE THE DIRECTORY FROM FUSE MOUNT> [Thu Aug 29 09:41:05 UTC 2013 root.37.174:~ ] # gluster volume quota dogvol list Path Hard-limit Soft-limit Used Available -------------------------------------------------------------------------------- [Thu Aug 29 09:41:21 UTC 2013 root.37.174:~ ] # gluster volume status Status of volume: dogvol Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.37.174:/rhs/brick1/dogdir1 49160 Y 27596 Brick 10.70.37.185:/rhs/brick1/dogdir1 49160 Y 14854 NFS Server on localhost 2049 Y 27666 Quota Daemon on localhost N/A Y 27758 NFS Server on 10.70.37.185 2049 Y 14911 Quota Daemon on 10.70.37.185 N/A Y 14948 NFS Server on 10.70.37.118 2049 Y 10248 Quota Daemon on 10.70.37.118 N/A Y 10281 NFS Server on 10.70.37.95 2049 Y 10206 Quota Daemon on 10.70.37.95 N/A Y 10239 There are no active volume tasks <REMOVE THE DIRECTORY, FROM FUSE MOUNT AFTER SETTING QUOTA ON IT> [Thu Aug 29 09:41:27 UTC 2013 root.37.174:~ ] # gluster volume quota dogvol limit-usage /master 2GB quota command failed : Failed to get trusted.gfid attribute on path /master. Reason : No such file or directory [Thu Aug 29 09:41:40 UTC 2013 root.37.174:~ ] # gluster volume status Status of volume: dogvol Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.37.174:/rhs/brick1/dogdir1 49160 Y 27596 Brick 10.70.37.185:/rhs/brick1/dogdir1 49160 Y 14854 NFS Server on localhost 2049 Y 27666 Quota Daemon on localhost N/A Y 27758 NFS Server on 10.70.37.118 2049 Y 10248 Quota Daemon on 10.70.37.118 N/A Y 10281 NFS Server on 10.70.37.185 2049 Y 14911 Quota Daemon on 10.70.37.185 N/A Y 14948 NFS Server on 10.70.37.95 2049 Y 10206 Quota Daemon on 10.70.37.95 N/A Y 10239 There are no active volume tasks [Thu Aug 29 09:41:42 UTC 2013 root.37.174:~ ] # gluster volume status Status of volume: dogvol Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.37.174:/rhs/brick1/dogdir1 49160 Y 27596 Brick 10.70.37.185:/rhs/brick1/dogdir1 49160 Y 14854 NFS Server on localhost 2049 Y 27666 Quota Daemon on localhost N/A Y 27758 NFS Server on 10.70.37.185 2049 Y 14911 Quota Daemon on 10.70.37.185 N/A Y 14948 NFS Server on 10.70.37.118 2049 Y 10248 Quota Daemon on 10.70.37.118 N/A Y 10281 NFS Server on 10.70.37.95 2049 Y 10206 Quota Daemon on 10.70.37.95 N/A Y 10239 There are no active volume tasks <RECREATING THE SAME DIRECTORY FROM FUSE MOUNT> [Thu Aug 29 09:42:08 UTC 2013 root.37.174:~ ] # gluster volume quota dogvol list Path Hard-limit Soft-limit Used Available -------------------------------------------------------------------------------- [Thu Aug 29 09:42:14 UTC 2013 root.37.174:~ ] # gluster volume status Status of volume: dogvol Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.37.174:/rhs/brick1/dogdir1 N/A N 27596 Brick 10.70.37.185:/rhs/brick1/dogdir1 N/A N 14854 NFS Server on localhost 2049 Y 27666 Quota Daemon on localhost N/A Y 27758 NFS Server on 10.70.37.185 2049 Y 14911 Quota Daemon on 10.70.37.185 N/A Y 14948 NFS Server on 10.70.37.95 2049 Y 10206 Quota Daemon on 10.70.37.95 N/A Y 10239 NFS Server on 10.70.37.118 2049 Y 10248 Quota Daemon on 10.70.37.118 N/A Y 10281 There are no active volume tasks
Created attachment 791760 [details] tar-ed sosreports
Additional Info =============== 1. Volume is fuse mounted on 10.70.36.32 ( RHEL 6.4 ) 2. Mount point - /mnt/distvol 3. All commands are executed from RHS Node - 10.70.37.174 4. sosreports are attached 5. Observation =============== I could see following in brick logs in 10.70.37.174 () <snip> patchset: git://git.gluster.com/glusterfs.git signal received: 11 time of crash: 2013-08-29 09:42:14configuration details: argp 1 backtrace 1 dlfcn 1 fdatasync 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.4.0.20rhsquota5 /lib64/libc.so.6[0x3abc432920] /lib64/libc.so.6[0x3abc481321] /usr/lib64/glusterfs/3.4.0.20rhsquota5/xlator/storage/posix.so(posix_make_ancestryfromgfid+0x239)[0x7fd492a97629] /usr/lib64/glusterfs/3.4.0.20rhsquota5/xlator/storage/posix.so(posix_get_ancestry_directory+0xdd)[0x7fd492a913dd] /usr/lib64/glusterfs/3.4.0.20rhsquota5/xlator/storage/posix.so(+0x1b216)[0x7fd492a94216] /usr/lib64/libglusterfs.so.0(dict_foreach+0x45)[0x31de014025] /usr/lib64/glusterfs/3.4.0.20rhsquota5/xlator/storage/posix.so(posix_lookup_xattr_fill+0x85)[0x7fd492a938f5] /usr/lib64/glusterfs/3.4.0.20rhsquota5/xlator/storage/posix.so(posix_lookup+0x871)[0x7fd492a90401] /usr/lib64/libglusterfs.so.0(default_lookup+0x6d)[0x31de01befd] /usr/lib64/glusterfs/3.4.0.20rhsquota5/xlator/features/access-control.so(posix_acl_lookup+0x1a2)[0x7fd4924638c2] /usr/lib64/glusterfs/3.4.0.20rhsquota5/xlator/features/locks.so(pl_lookup+0x222)[0x7fd49224b892] /usr/lib64/glusterfs/3.4.0.20rhsquota5/xlator/performance/io-threads.so(iot_lookup_wrapper+0x12c)[0x7fd492037ebc] /usr/lib64/libglusterfs.so.0(call_resume+0x122)[0x31de030172] /usr/lib64/glusterfs/3.4.0.20rhsquota5/xlator/performance/io-threads.so(iot_worker+0x158)[0x7fd49203c9f8] /lib64/libpthread.so.0[0x3abcc07851] /lib64/libc.so.6(clone+0x6d)[0x3abc4e890d] --------- </snip>
https://code.engineering.redhat.com/gerrit/#/c/12036/ fixes the issue. When readlink was done on the gfid handle for a directory (while building the ancestory upon getting a nameless lookup on the gfid) the failure of the readlink call was not handled. The return value of the readlink call was collected in an unsigned variable (readlink returns -1 upon failure). And the return value was not checked. The patch mentioned above fixes the issue.
Verified with RHS 2.1 containing glusterfs-3.4.0.33rhs-1.el6rhs
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-1769.html