Description of problem: ========================= Glusterd crashed after performing remove brick Version-Release number of selected component (if applicable): ============================================================= 3.4.0.13rhs-1.el6rhs.x86_64 How reproducible: Steps to Reproduce: ====================== 1.Create a distribute volume with 5 bricks on 6.4 client: -------------- 2.Create directory and files inside these directories 3.Remove one brick from the volume gluster volume remove-brick Vol13 10.70.34.85:/rhs/brick1/p1 start volume remove-brick start: success ID: 9071b330-da3e-4bf5-9f9b-28ab6dcbe6e2 gluster volume remove-brick Vol13 10.70.34.85:/rhs/brick1/p1 status Node Rebalanced-files size scanned failures status run-time in secs localhost 23 23.0MB 123 0 completed 1.00 10.70.34.88 0 0Bytes 0 0 not started 0.00 10.70.34.86 0 0Bytes 0 0 not started 0.00 10.70.34.87 0 0Bytes 0 0 not started 0.00 [root@boost brick1]# gluster volume remove-brick Vol13 10.70.34.85:/rhs/brick1/p1 commit Removing brick(s) can result in data loss. Do you want to Continue? (y/n) y volume remove-brick commit: success gluster v i Vol13 Volume Name: Vol13 Type: Distribute Volume ID: ec7a3214-2c23-4c2e-be2a-c803312628f2 Status: Started Number of Bricks: 4 Transport-type: tcp Bricks: Brick1: 10.70.34.86:/rhs/brick1/p2 Brick2: 10.70.34.87:/rhs/brick1/p3 Brick3: 10.70.34.88:/rhs/brick1/p4 Brick4: 10.70.34.85:/rhs/brick1/p5 4. perform fix-layout gluster v rebalance Vol13 fix-layout start volume rebalance: Vol13: success: Starting rebalance on volume Vol13 has been successful. ID: f1e53c20-b5aa-4c59-8645-87842eba00bb Checked the hash range from the backend for the directory created 5. deleted files from the mount point and umounted the volume on 5.9 client : -------------- 6. mounted the volume again and created directory and files inside the directory 7. Remove brick 10.70.34.85:/rhs/brick1/p2 from the volume gluster volume remove-brick Vol13 10.70.34.86:/rhs/brick1/p2 start volume remove-brick start: success ID: 3b3fa27c-5ddb-46fb-8222-05845301a782 gluster volume remove-brick Vol13 10.70.34.86:/rhs/brick1/p2 status Node Rebalanced-files size scanned failures status run-time in secs localhost 0 0Bytes 0 0 not started 0.00 10.70.34.88 0 0Bytes 0 0 not started 0.00 10.70.34.86 22 22.0MB 122 0 completed 0.00 10.70.34.87 7 0Bytes 0 0 not started 0.00 gluster volume remove-brick Vol13 10.70.34.86:/rhs/brick1/p2 commit Removing brick(s) can result in data loss. Do you want to Continue? (y/n) y volume remove-brick commit: success gluster v i Vol13 Volume Name: Vol13 Type: Distribute Volume ID: ec7a3214-2c23-4c2e-be2a-c803312628f2 Status: Started Number of Bricks: 3 Transport-type: tcp Bricks: Brick1: 10.70.34.87:/rhs/brick1/p3 Brick2: 10.70.34.88:/rhs/brick1/p4 Brick3: 10.70.34.85:/rhs/brick1/p5 8. Stop the volume gluster v stop Vol13 Stopping volume will make its data inaccessible. Do you want to continue? (y/n) y Connection failed. Please check if gluster daemon is operational. service glusterd status glusterd dead but pid file exists Actual results: Expected results: Additional info: =================== part of the log : --------------------- [2013-07-30 12:15:36.151673] I [socket.c:3487:socket_init] 0-management: SSL support is NOT enabled [2013-07-30 12:15:36.151691] I [socket.c:3502:socket_init] 0-management: using system polling thread pending frames: frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) patchset: git://git.gluster.com/glusterfs.git signal received: 11 time of crash: 2013-07-30 12:15:36configuration details: argp 1 backtrace 1 dlfcn 1 fdatasync 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.4.0.13rhs /lib64/libc.so.6[0x3b0ea32920] /usr/lib64/glusterfs/3.4.0.13rhs/xlator/mgmt/glusterd.so(__glusterd_brick_rpc_notify+0x9a)[0x7f2a96da4b1a] /usr/lib64/glusterfs/3.4.0.13rhs/xlator/mgmt/glusterd.so(glusterd_big_locked_notify+0x60)[0x7f2a96d988e0] /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x177)[0x38b020df67] /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x28)[0x38b0209838] /usr/lib64/glusterfs/3.4.0.13rhs/rpc-transport/socket.so(+0xa491)[0x7f2a96b08491] /usr/lib64/libglusterfs.so.0[0x38afe5d3f7] /usr/sbin/glusterd(main+0x5c6)[0x406856] /lib64/libc.so.6(__libc_start_main+0xfd)[0x3b0ea1ecdd] /usr/sbin/glusterd[0x4045f9]
are equal check sum on the mount point failed with error : [root@localhost dir2]# /opt/qa/tools/arequal-checksum /mnt/vol13 ftw (/mnt/vol13) returned -1 (No such file or directory), terminating sosreports : http://rhsqe-repo.lab.eng.blr.redhat.com/bugs_necessary_info/990125/
From the logs this seems to be a crash caused by the volume stop command. There appears to be race in the cleanup of the rpc transport glusterd uses to connect with the brick, leading to a double free and the crash.
proposed fix @ http://review.gluster.org/5512
Downstream fix at https://code.engineering.redhat.com/gerrit/11341
Version : ======== Found this crash while trying to verify this bug . Followed the same steps as mentioned in steps to reproduce (Fuse and NFS mount) [root@junior glusterfs]# service glusterd status glusterd dead but pid file exists ---------------Part of the log--------------------- [2013-08-16 09:48:41.311540] I [glusterd-utils.c:3560:glusterd_nfs_pmap_deregister] 0-: De-registered MOUNTV3 successfully [2013-08-16 09:48:41.311728] I [glusterd-utils.c:3565:glusterd_nfs_pmap_deregister] 0-: De-registered MOUNTV1 successfully [2013-08-16 09:48:41.311908] I [glusterd-utils.c:3570:glusterd_nfs_pmap_deregister] 0-: De-registered NFSV3 successfully [2013-08-16 09:48:41.312088] I [glusterd-utils.c:3575:glusterd_nfs_pmap_deregister] 0-: De-registered NLM v4 successfully [2013-08-16 09:48:41.312268] I [glusterd-utils.c:3580:glusterd_nfs_pmap_deregister] 0-: De-registered NLM v1 successfully [2013-08-16 09:48:41.312503] I [glusterd-utils.c:3585:glusterd_nfs_pmap_deregister] 0-: De-registered ACL v3 successfully [2013-08-16 09:48:42.319659] E [glusterd-utils.c:3526:glusterd_nodesvc_unlink_socket_file] 0-management: Failed to remove /var/run/0d536d1ec2d14cfee8af0da42b3a6df3.socket error: No such file or directory [2013-08-16 09:48:42.319945] E [glusterd-hooks.c:291:glusterd_hooks_run_hooks] 0-management: Failed to open dir /var/lib/glusterd/hooks/1/stop/post, due to No such file or directory pending frames: patchset: git://git.gluster.com/glusterfs.git signal received: 11 time of crash: 2013-08-16 09:48:42configuration details: argp 1 backtrace 1 dlfcn 1 fdatasync 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.4.0.20rhs /lib64/libc.so.6[0x397d232920] /usr/lib64/glusterfs/3.4.0.20rhs/xlator/mgmt/glusterd.so(__glusterd_brick_rpc_notify+0x92)[0x7fb288d9b222] /usr/lib64/glusterfs/3.4.0.20rhs/xlator/mgmt/glusterd.so(glusterd_big_locked_notify+0x60)[0x7fb288d8e450] /usr/lib64/libglusterfs.so.0(gf_timer_proc+0xd0)[0x7fb28c827180] /lib64/libpthread.so.0[0x397da07851] /lib64/libc.so.6(clone+0x6d)[0x397d2e890d] ---------------------------------------------------------------------
sosreports : http://rhsqe-repo.lab.eng.blr.redhat.com/bugs_necessary_info/990125/16_Aug_990125/
Missed specifying the version in comment 7 : 3.4.0.20rhs-2.el6rhs.x86_64
Version : 3.4.0.20rhs-2.el6rhs.x86_64 ======== Faced glusterd crash again while stopping the volume . Steps : ------ 1)Created a 2x2 distributed replicate volume 2)Create some files on the mount point for i in {100..1000} ; do dd if=/dev/urandom of=f"$i" bs=10M count=1; done 3)While file creation is in progress , bring down one brick in the replica pair 4) After file creation is completed , bring back the brick online gluster v start <vol_name> force 5)Execute heal command gluster v heal Vol3 full gluster v heal Vol3 info 6)on the mount point , deleted all files , 7) Stop the volume gluster v stop Vol3 Stopping volume will make its data inaccessible. Do you want to continue? (y/n) y Connection failed. Please check if gluster daemon is operational. service glusterd status glusterd dead but pid file exists ---------------------Part of Log -------------------------- [2013-08-20 10:16:14.645926] I [mem-pool.c:539:mem_pool_destroy] 0-management: size=2236 max=0 total=0 [2013-08-20 10:16:14.645943] I [mem-pool.c:539:mem_pool_destroy] 0-management: size=124 max=0 total=0 [2013-08-20 10:16:14.646027] I [socket.c:2237:socket_event_handler] 0-transport: disconnecting now [2013-08-20 10:16:14.646062] I [mem-pool.c:539:mem_pool_destroy] 0-management: size=2236 max=1 total=2 [2013-08-20 10:16:14.646075] I [mem-pool.c:539:mem_pool_destroy] 0-management: size=124 max=1 total=2 [2013-08-20 10:16:14.646188] I [socket.c:2237:socket_event_handler] 0-transport: disconnecting now pending frames: frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) patchset: git://git.gluster.com/glusterfs.git signal received: 11 time of crash: 2013-08-20 10:16:14configuration details: argp 1 backtrace 1 dlfcn 1 fdatasync 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.4.0.20rhs /lib64/libc.so.6[0x3b0ea32920] /usr/lib64/glusterfs/3.4.0.20rhs/xlator/mgmt/glusterd.so(__glusterd_brick_rpc_notify+0x92)[0x7f3861bd0222] /usr/lib64/glusterfs/3.4.0.20rhs/xlator/mgmt/glusterd.so(glusterd_big_locked_notify+0x60)[0x7f3861bc3450] /usr/lib64/libglusterfs.so.0(gf_timer_proc+0xd0)[0x7f386565c180] /lib64/libpthread.so.0[0x3b0f207851] /lib64/libc.so.6(clone+0x6d)[0x3b0eae890d] --------------------------------------------------------------------
sosreports for comment 10 : http://rhsqe-repo.lab.eng.blr.redhat.com/bugs_necessary_info/990125/20_Aug_990125/
https://code.engineering.redhat.com/gerrit/#/c/11710/
Version : glusterfs-3.4.0.22rhs-1 ======== Repeated the steps as mentioned in 'Steps to Reproduce' and Comment 10 , did not face glusterd crash . Marking the bug as Verified
Followed the same steps as mentioned in the bug and Comment 10 , after which I stopped the volume and deleted it and glusterd crashed which did not occur last time . Moving the bug back to 'Assigned' ------------Part of log------------------ [2013-08-26 08:35:42.537781] E [glusterd-utils.c:1335:glusterd_brick_unlink_socket_file] 0-management: Failed to remove /var/run/e010aa1a569ea85f32dd5 9cc65072e7f.socket error: No such file or directory [2013-08-26 08:35:43.683117] E [glusterd-utils.c:3526:glusterd_nodesvc_unlink_socket_file] 0-management: Failed to remove /var/run/69da6e6bc4924ccbcf6 33c59b5fe3d25.socket error: Permission denied [2013-08-26 08:35:43.683467] I [glusterd-utils.c:3560:glusterd_nfs_pmap_deregister] 0-: De-registered MOUNTV3 successfully [2013-08-26 08:35:43.683702] I [glusterd-utils.c:3565:glusterd_nfs_pmap_deregister] 0-: De-registered MOUNTV1 successfully [2013-08-26 08:35:43.683890] I [glusterd-utils.c:3570:glusterd_nfs_pmap_deregister] 0-: De-registered NFSV3 successfully [2013-08-26 08:35:43.684087] I [glusterd-utils.c:3575:glusterd_nfs_pmap_deregister] 0-: De-registered NLM v4 successfully [2013-08-26 08:35:43.684288] I [glusterd-utils.c:3580:glusterd_nfs_pmap_deregister] 0-: De-registered NLM v1 successfully [2013-08-26 08:35:43.684493] I [glusterd-utils.c:3585:glusterd_nfs_pmap_deregister] 0-: De-registered ACL v3 successfully [2013-08-26 08:35:44.684800] E [glusterd-utils.c:3526:glusterd_nodesvc_unlink_socket_file] 0-management: Failed to remove /var/run/0d536d1ec2d14cfee8a f0da42b3a6df3.socket error: No such file or directory [2013-08-26 08:35:44.684998] I [glusterd-pmap.c:271:pmap_registry_remove] 0-pmap: removing brick /rhs/brick1/a5 on port 49272 pending frames: patchset: git://git.gluster.com/glusterfs.git signal received: 11 time of crash: 2013-08-26 08:35:44configuration details: argp 1 backtrace 1 dlfcn 1 fdatasync 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.4.0.22rhs /lib64/libc.so.6[0x397d232920] /usr/lib64/glusterfs/3.4.0.22rhs/xlator/mgmt/glusterd.so(__glusterd_brick_rpc_notify+0x92)[0x7fb5a17d42f2] /usr/lib64/glusterfs/3.4.0.22rhs/xlator/mgmt/glusterd.so(glusterd_big_locked_notify+0x60)[0x7fb5a17c7520] /usr/lib64/libglusterfs.so.0(gf_timer_proc+0xd0)[0x7fb5a5261120] /lib64/libpthread.so.0[0x397da07851] /lib64/libc.so.6(clone+0x6d)[0x397d2e890d] ----------------------------------------------------------------------- sos reports at : http://rhsqe-repo.lab.eng.blr.redhat.com/bugs_necessary_info/990125/990125_26_Aug/
https://code.engineering.redhat.com/gerrit/#/c/12188/
Version : ============ gluster --version glusterfs 3.4.0.30rhs built on Aug 30 2013 08:15:37 Repeated the steps as mentioned in 'Steps to Reproduce' and Comment 10 , did not face glusterd crash . Marking the bug as 'Verified'
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-1262.html