Description of problem: glusterd crashed after initiating "remove brick start" command to remove a pair of bricks from the volume Version-Release number of selected component (if applicable): glusterfs-3.4.0.52rhs.el6rhs How reproducible: Happened once, never tried to recreate it Steps to Reproduce: 1. Detach the host, which is serving brick for the volume, from the cluster 2. Retry detaching the same host again 3. Stopped and deleted the other volume from other RHSS Node 4. Retry the Detaching the same host again 5. Remove a pair of bricks ( with data migration ) Actual results: After initiating "remove brick" command glusterd crashed Expected results: glusterd should not crash, and remove-brick should be successful Additional info: SETUP INFORMATION ================== 1. All RHSS Nodes installed with,RHSS-2.1-20131223.n.0-RHS-x86_64-DVD1.iso 2. No additional packages installed 3. Cluster of 4 RHSS Nodes was created RHSS1 - 10.70.37.86 RHSS2 - 10.70.37.187 RHSS3 - 10.70.37.46 RHSS4 - 10.70.37.198 3. Distributed replicate volume (3X2) was created and optimized for virt store (i.e) gluster volume set <vol-name> group virt gluster volume set <vol-name> storage.owner-uid 36 gluster volume set <vol-name> storage.owner-gid 36 This volume was up and running. 4. Volume information [Thu Jan 2 06:56:19 UTC 2014 root.37.187:~ ] # gluster volume info Volume Name: distrep Type: Distributed-Replicate Volume ID: b6e2ed30-370d-46b1-9071-fe851aa57caf Status: Started Number of Bricks: 3 x 2 = 6 Transport-type: tcp Bricks: Brick1: 10.70.37.86:/rhs/brick1/drdir1 Brick2: 10.70.37.187:/rhs/brick1/drdir1 Brick3: 10.70.37.86:/rhs/brick2/drdir2 Brick4: 10.70.37.187:/rhs/brick2/drdir2 Brick5: 10.70.37.46:/rhs/brick1/add-disk1 Brick6: 10.70.37.198:/rhs/brick1/add-disk1 Options Reconfigured: cluster.server-quorum-type: server cluster.quorum-type: auto network.remote-dio: enable cluster.eager-lock: enable performance.stat-prefetch: off performance.io-cache: off performance.read-ahead: off performance.quick-read: off storage.owner-uid: 36 storage.owner-gid: 36 STEPS PERFORMED =============== 1. Created a 4 Node cluster (i.e) gluster peer probe <RHSS-Node> 2. Created 2 Distribute replicate volume. NOTE: Those are all the volumes which were created for serving VM Image store (i.e) gluster volume set <vol-name> group virt gluster volume set <vol-name> storage.owner-uid 36 gluster volume set <vol-name> storage.owner-uid 36 3. Detach the host which is serving the volume (i.e) gluster volume detach <RHSS-Node> This failed with suitable error message 4. Retried step 3 Again this failed 5. Now stopped and removed one of the volume, from other RHSS Node (i.e) gluster volume stop <vol-name> gluster volume delete <vol-name> 6. Retried step 3 Again this failed 7. Remove a pair of bricks. (i.e) gluster volume remove-brick <vol-name> <brick1> <brick2> start The command hung for around ~1.5 minutes and glusterd got crashed CONSOLE LOG INFORMATION ======================= 1. All commands are executed from RHSS2 - 10.70.37.187 The following is the console logs, when bug was encountered: [Thu Jan 2 06:55:12 UTC 2014 root.37.187:~ ] # gluster peer detach 10.70.37.46 peer detach: failed: One of the peers is probably down. Check with 'peer status' [Thu Jan 2 06:55:23 UTC 2014 root.37.187:~ ] # gluster pe s Number of Peers: 3 Hostname: 10.70.37.86 Uuid: 91a0bc99-3b1a-4c53-94b5-72864bce512d State: Peer in Cluster (Connected) Hostname: 10.70.37.46 Uuid: 41924551-eac0-4c22-98a1-adff049dbba8 State: Peer in Cluster (Connected) Hostname: 10.70.37.198 Uuid: 5c657325-6e34-4110-8c83-c2b7405a2403 State: Peer in Cluster (Disconnected) [Thu Jan 2 06:55:29 UTC 2014 root.37.187:~ ] # gluster peer detach 10.70.37.46 peer detach: failed: Brick(s) with the peer 10.70.37.46 exist in cluster [Thu Jan 2 06:55:58 UTC 2014 root.37.187:~ ] # gluster pe s Number of Peers: 3 Hostname: 10.70.37.86 Uuid: 91a0bc99-3b1a-4c53-94b5-72864bce512d State: Peer in Cluster (Connected) Hostname: 10.70.37.46 Uuid: 41924551-eac0-4c22-98a1-adff049dbba8 State: Peer in Cluster (Connected) Hostname: 10.70.37.198 Uuid: 5c657325-6e34-4110-8c83-c2b7405a2403 State: Peer in Cluster (Connected) [Thu Jan 2 06:56:01 UTC 2014 root.37.187:~ ] # gluster peer detach 10.70.37.46 peer detach: failed: Brick(s) with the peer 10.70.37.46 exist in cluster [Thu Jan 2 06:56:07 UTC 2014 root.37.187:~ ] # gluster pe s Number of Peers: 3 Hostname: 10.70.37.86 Uuid: 91a0bc99-3b1a-4c53-94b5-72864bce512d State: Peer in Cluster (Connected) Hostname: 10.70.37.46 Uuid: 41924551-eac0-4c22-98a1-adff049dbba8 State: Peer in Cluster (Connected) Hostname: 10.70.37.198 Uuid: 5c657325-6e34-4110-8c83-c2b7405a2403 State: Peer in Cluster (Connected) [Thu Jan 2 06:56:19 UTC 2014 root.37.187:~ ] # gluster volume info Volume Name: distrep Type: Distributed-Replicate Volume ID: b6e2ed30-370d-46b1-9071-fe851aa57caf Status: Started Number of Bricks: 3 x 2 = 6 Transport-type: tcp Bricks: Brick1: 10.70.37.86:/rhs/brick1/drdir1 Brick2: 10.70.37.187:/rhs/brick1/drdir1 Brick3: 10.70.37.86:/rhs/brick2/drdir2 Brick4: 10.70.37.187:/rhs/brick2/drdir2 Brick5: 10.70.37.46:/rhs/brick1/add-disk1 Brick6: 10.70.37.198:/rhs/brick1/add-disk1 Options Reconfigured: cluster.server-quorum-type: server cluster.quorum-type: auto network.remote-dio: enable cluster.eager-lock: enable performance.stat-prefetch: off performance.io-cache: off performance.read-ahead: off performance.quick-read: off storage.owner-uid: 36 storage.owner-gid: 36 Volume Name: distrep2 Type: Distributed-Replicate Volume ID: a64bd787-178e-44bc-9bae-1620cb665538 Status: Started Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: 10.70.37.46:/rhs/brick1/dir1 Brick2: 10.70.37.198:/rhs/brick1/dir1 Brick3: 10.70.37.46:/rhs/brick2/dir2 Brick4: 10.70.37.198:/rhs/brick2/dir2 Options Reconfigured: nfs.disable: off user.cifs: enable auth.allow: * performance.quick-read: off performance.read-ahead: off performance.io-cache: off performance.stat-prefetch: off cluster.eager-lock: enable network.remote-dio: enable cluster.quorum-type: auto cluster.server-quorum-type: server storage.owner-uid: 36 storage.owner-gid: 36 [Thu Jan 2 06:56:24 UTC 2014 root.37.187:~ ] # gluster volume info Volume Name: distrep Type: Distributed-Replicate Volume ID: b6e2ed30-370d-46b1-9071-fe851aa57caf Status: Started Number of Bricks: 3 x 2 = 6 Transport-type: tcp Bricks: Brick1: 10.70.37.86:/rhs/brick1/drdir1 Brick2: 10.70.37.187:/rhs/brick1/drdir1 Brick3: 10.70.37.86:/rhs/brick2/drdir2 Brick4: 10.70.37.187:/rhs/brick2/drdir2 Brick5: 10.70.37.46:/rhs/brick1/add-disk1 Brick6: 10.70.37.198:/rhs/brick1/add-disk1 Options Reconfigured: cluster.server-quorum-type: server cluster.quorum-type: auto network.remote-dio: enable cluster.eager-lock: enable performance.stat-prefetch: off performance.io-cache: off performance.read-ahead: off performance.quick-read: off storage.owner-uid: 36 storage.owner-gid: 36 [Thu Jan 2 06:57:17 UTC 2014 root.37.187:~ ] # gluster peer detach 10.70.37.46 peer detach: failed: Brick(s) with the peer 10.70.37.46 exist in cluster [Thu Jan 2 06:57:22 UTC 2014 root.37.187:~ ] # gluster volume remove-brick distrep 10.70.37.46:/rhs/brick1/add-disk1 10.70.37.198:/rhs/brick1/add-disk1 start Connection failed. Please check if gluster daemon is operational. [Thu Jan 2 06:59:52 UTC 2014 root.37.187:~ ] # service glusterd status glusterd dead but pid file exists [Thu Jan 2 07:00:09 UTC 2014 root.37.187:~ ] # ls /var/log/core core.3758.1388645992.dump [Thu Jan 2 07:00:48 UTC 2014 root.37.187:~ ] # ls /var/log/core -lh total 44M -rw------- 1 root root 110M Jan 2 01:59 core.3758.1388645992.dump
error snip from glusterd log file (/var/log/glusterd/etc-glusterfs-glusterd.vol.log) in 10.70.37.187. where the glusterd got crashed : [2014-01-02 06:57:02.090532] I [rpc-clnt.c:977:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 [2014-01-02 06:57:02.090606] I [socket.c:3505:socket_init] 0-management: SSL support is NOT enabled [2014-01-02 06:57:02.090630] I [socket.c:3520:socket_init] 0-management: using system polling thread [2014-01-02 06:57:03.092282] E [glusterd-utils.c:4006:glusterd_nodesvc_unlink_socket_file] 0-management: Failed to remove /var/run/c0dfc1a7171d0c097f 48b95e254f0809.socket error: No such file or directory [2014-01-02 06:57:03.100297] I [mem-pool.c:539:mem_pool_destroy] 0-management: size=2236 max=0 total=0 [2014-01-02 06:57:03.100376] I [mem-pool.c:539:mem_pool_destroy] 0-management: size=124 max=0 total=0 [2014-01-02 06:57:03.100461] I [rpc-clnt.c:977:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 [2014-01-02 06:57:03.100528] I [socket.c:3505:socket_init] 0-management: SSL support is NOT enabled [2014-01-02 06:57:03.100544] I [socket.c:3520:socket_init] 0-management: using system polling thread [2014-01-02 06:57:03.100545] I [socket.c:2235:socket_event_handler] 0-transport: disconnecting now [2014-01-02 06:57:03.100790] I [mem-pool.c:539:mem_pool_destroy] 0-management: size=2236 max=0 total=0 [2014-01-02 06:57:03.100805] I [mem-pool.c:539:mem_pool_destroy] 0-management: size=124 max=0 total=0 [2014-01-02 06:57:03.101018] I [socket.c:2235:socket_event_handler] 0-transport: disconnecting now [2014-01-02 06:57:17.383116] I [glusterd-handler.c:1073:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req [2014-01-02 06:57:17.384307] I [glusterd-handler.c:1073:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req [2014-01-02 06:57:22.555336] I [glusterd-handler.c:916:__glusterd_handle_cli_deprobe] 0-glusterd: Received CLI deprobe req [2014-01-02 06:57:25.938661] I [glusterd-handler.c:1018:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req [2014-01-02 06:57:26.228262] I [glusterd-handler.c:1073:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req [2014-01-02 06:57:26.229347] I [glusterd-handler.c:1073:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req [2014-01-02 06:58:22.128594] I [glusterd-handler.c:1018:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req [2014-01-02 06:58:22.454366] I [glusterd-handler.c:1073:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req [2014-01-02 06:58:22.455447] I [glusterd-handler.c:1073:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req [2014-01-02 06:59:06.511463] I [glusterd-handler.c:1018:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req [2014-01-02 06:59:06.836698] I [glusterd-handler.c:1073:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req [2014-01-02 06:59:06.837477] I [glusterd-handler.c:1073:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req [2014-01-02 06:59:28.571665] I [glusterd-handler.c:1018:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req [2014-01-02 06:59:28.855855] I [glusterd-handler.c:1073:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req [2014-01-02 06:59:28.856770] I [glusterd-handler.c:1073:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req [2014-01-02 06:59:34.040168] I [glusterd-handler.c:1018:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req [2014-01-02 06:59:34.329708] I [glusterd-handler.c:1073:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req [2014-01-02 06:59:34.330747] I [glusterd-handler.c:1073:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req [2014-01-02 06:59:39.534002] I [glusterd-handler.c:1018:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req [2014-01-02 06:59:39.850115] I [glusterd-handler.c:1073:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req [2014-01-02 06:59:39.850858] I [glusterd-handler.c:1073:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req [2014-01-02 06:59:52.557460] I [glusterd-brick-ops.c:663:__glusterd_handle_remove_brick] 0-management: Received rem brick req pending frames: frame : type(0) op(0) patchset: git://git.gluster.com/glusterfs.git signal received: 11 time of crash: 2014-01-02 06:59:52configuration details: argp 1 backtrace 1 dlfcn 1 fdatasync 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.4.0.52rhs /lib64/libc.so.6(+0x32960)[0x7fe1494a8960] /usr/lib64/glusterfs/3.4.0.52rhs/xlator/mgmt/glusterd.so(__glusterd_handle_remove_brick+0x78a)[0x7fe145c8230a] /usr/lib64/glusterfs/3.4.0.52rhs/xlator/mgmt/glusterd.so(glusterd_big_locked_handler+0x3f)[0x7fe145c1278f] /usr/lib64/libglusterfs.so.0(synctask_wrap+0x12)[0x7fe14a4410c2] /lib64/libc.so.6(+0x43bb0)[0x7fe1494b9bb0] --------- (END)
Created attachment 844372 [details] core dump Core dump that was available in the RHSS Node ( 10.70.37.187 ) where glusterd crashed
Created attachment 844374 [details] gluster log file glusterd log file available with the RHSS Node where glusterd crashed
Per triage 1/2, removing from list for corbett
remove-brick operation as such doesn't causes glusterd to crash but following the steps as described in comment 0 lead to glusterd crash. So, removing blocker for this bug
Crash occurs in the following scenario: 1. Create a dist-rep volume on a trusted storage pool. 2. Add a new node to the pool by peer-probing it. 3. Perform remove-brick operation from this new node. 4. This causes the glusterd in the new node to crash.
Downstream patch https://code.engineering.redhat.com/gerrit/17984
Based on https://bugzilla.redhat.com/show_bug.cgi?id=1047747#c7, adding this to back the list for u2
Tested with glusterfs-3.4.0.55rhs-1 with following steps 1. Created trusted storage pool with 2 RHSS Nodes 2. Created a distributed replicate volume with 2X2 3. Started the volume 4. Fuse mounted the volume and started writing few files on to the mount (i.e) mount.glusterfs <RHSS Node>:<vol-name> <mount-point> for i in {1..100}; do dd if=/dev/urandom of=<mount>/file$i bs=1024k count=100;done 5. Added a pair of bricks to make the volume as 3X2 distribute replicate 6. Started rebalance (i.e) gluster volume rebalance <vol-name> start 7. After rebalance has been completed successfully, tried to peer probe a new node (i.e) gluster peer probe <RHSS-Node> 8. Immediately after peer probe returns success, tried to remove a pair of bricks from the newly probed node (i.e) gluster volume remove-brick <vol-name> <brick1> <brick2> start remove brick completed successfully and commiting the removed-bricks also succeeded No glusterd crash was seen
Steps to reproduce, provided in comment 0, I was totally unaware that the iprules to block all incoming glusterd traffic was existing and that simulated the steps as described by Ravi in comment 7. Performing the steps mentioned in comment 0, for verification of this bug
Tested the following with glusterfs-3.4.0.55rhs-1 Performed the steps as follows, 1. Created a 4 Node Trusted Storage Pool 2. Created a 2 distributed replicate volume. one with 3x2 and other with 2X2 3. Blocked all glusterd traffic from all other nodes. (i.e) iptables -I INPUT 1 -p tcp --dport 24007 -j DROP 4. Removed/Deleted one of the volume after stopping it 5. Flushed iptables rules in the RHSS Node 6. Started remove-brick which includes brick from the node where iptables rules are just flushed 7. Remove brick was successful and no glusterd crashes were found Apart from the following performed the test steps in comment 10, 1. Tested the same scenario, with following operations on the newly probed peer, a. remove-brick b. remove-brick start c. remove-brick commit d. rebalance e. add-brick There is no glusterd crash
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHEA-2014-0208.html