Bug 1047747

Summary:

glusterd crashed, after initiating 'remove brick start'

Product:

[Red Hat Storage] Red Hat Gluster Storage

Reporter:

SATHEESARAN <sasundar>

Component:

glusterfs

Assignee:

Ravishankar N <ravishankar>

Status:

CLOSED ERRATA

QA Contact:

senaik

Severity:

high

Docs Contact:

Priority:

high

Version:

2.1

CC:

grajaiya, vagarwal, vbellur

Target Milestone:

---

Keywords:

ZStream

Target Release:

RHGS 2.1.2

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

glusterfs-3.4.0.54rhs

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Clones:

1047955 (view as bug list)

Environment:

Last Closed:

2014-02-25 08:13:43 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1047955

Attachments:

Description	Flags
core dump	none
gluster log file	none

Description SATHEESARAN 2014-01-02 06:23:29 UTC

Description of problem:
glusterd crashed after initiating "remove brick start" command to remove a pair of bricks from the volume

Version-Release number of selected component (if applicable):
glusterfs-3.4.0.52rhs.el6rhs

How reproducible:
Happened once, never tried to recreate it

Steps to Reproduce:
1. Detach the host, which is serving brick for the volume, from the cluster
2. Retry detaching the same host again
3. Stopped and deleted the other volume from other RHSS Node
4. Retry the Detaching the same host again
5. Remove a pair of bricks ( with data migration ) 

Actual results:
After initiating "remove brick" command glusterd crashed

Expected results:
glusterd should not crash, and remove-brick should be successful

Additional info:

SETUP INFORMATION
==================
1. All RHSS Nodes installed with,RHSS-2.1-20131223.n.0-RHS-x86_64-DVD1.iso

2. No additional packages installed

3. Cluster of 4 RHSS Nodes was created
RHSS1 - 10.70.37.86
RHSS2 - 10.70.37.187
RHSS3 - 10.70.37.46
RHSS4 - 10.70.37.198

3. Distributed replicate volume (3X2) was created and optimized for virt store
(i.e) gluster volume set <vol-name> group virt
      gluster volume set <vol-name> storage.owner-uid 36
      gluster volume set <vol-name> storage.owner-gid 36
This volume was up and running.

4. Volume information
[Thu Jan  2 06:56:19 UTC 2014 root.37.187:~ ] # gluster volume info
 
Volume Name: distrep
Type: Distributed-Replicate
Volume ID: b6e2ed30-370d-46b1-9071-fe851aa57caf
Status: Started
Number of Bricks: 3 x 2 = 6
Transport-type: tcp
Bricks:
Brick1: 10.70.37.86:/rhs/brick1/drdir1
Brick2: 10.70.37.187:/rhs/brick1/drdir1
Brick3: 10.70.37.86:/rhs/brick2/drdir2
Brick4: 10.70.37.187:/rhs/brick2/drdir2
Brick5: 10.70.37.46:/rhs/brick1/add-disk1
Brick6: 10.70.37.198:/rhs/brick1/add-disk1
Options Reconfigured:
cluster.server-quorum-type: server
cluster.quorum-type: auto
network.remote-dio: enable
cluster.eager-lock: enable
performance.stat-prefetch: off
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
storage.owner-uid: 36
storage.owner-gid: 36

STEPS PERFORMED
===============
1. Created a 4 Node cluster
(i.e) gluster peer probe <RHSS-Node>

2. Created 2 Distribute replicate volume.
NOTE: Those are all the volumes which were created for serving VM Image store
(i.e) gluster volume set <vol-name> group virt
      gluster volume set <vol-name> storage.owner-uid 36
      gluster volume set <vol-name> storage.owner-uid 36

3. Detach the host which is serving the volume
(i.e) gluster volume detach <RHSS-Node>
This failed with suitable error message

4. Retried step 3
Again this failed

5. Now stopped and removed one of the volume, from other RHSS Node
(i.e) gluster volume stop <vol-name>
      gluster volume delete <vol-name>

6. Retried step 3
Again this failed

7. Remove a pair of bricks.
(i.e) gluster volume remove-brick <vol-name> <brick1> <brick2> start

The command hung for around ~1.5 minutes and glusterd got crashed

CONSOLE LOG INFORMATION
=======================
1. All commands are executed from RHSS2 - 10.70.37.187

The following is the console logs, when bug was encountered:
[Thu Jan  2 06:55:12 UTC 2014 root.37.187:~ ] # gluster peer detach 10.70.37.46
peer detach: failed: One of the peers is probably down. Check with 'peer status'

[Thu Jan  2 06:55:23 UTC 2014 root.37.187:~ ] # gluster pe s
Number of Peers: 3

Hostname: 10.70.37.86
Uuid: 91a0bc99-3b1a-4c53-94b5-72864bce512d
State: Peer in Cluster (Connected)

Hostname: 10.70.37.46
Uuid: 41924551-eac0-4c22-98a1-adff049dbba8
State: Peer in Cluster (Connected)

Hostname: 10.70.37.198
Uuid: 5c657325-6e34-4110-8c83-c2b7405a2403
State: Peer in Cluster (Disconnected)

[Thu Jan  2 06:55:29 UTC 2014 root.37.187:~ ] # gluster peer detach 10.70.37.46
peer detach: failed: Brick(s) with the peer 10.70.37.46 exist in cluster

[Thu Jan  2 06:55:58 UTC 2014 root.37.187:~ ] # gluster pe s
Number of Peers: 3

Hostname: 10.70.37.86
Uuid: 91a0bc99-3b1a-4c53-94b5-72864bce512d
State: Peer in Cluster (Connected)

Hostname: 10.70.37.46
Uuid: 41924551-eac0-4c22-98a1-adff049dbba8
State: Peer in Cluster (Connected)

Hostname: 10.70.37.198
Uuid: 5c657325-6e34-4110-8c83-c2b7405a2403
State: Peer in Cluster (Connected)

[Thu Jan  2 06:56:01 UTC 2014 root.37.187:~ ] # gluster peer detach 10.70.37.46
peer detach: failed: Brick(s) with the peer 10.70.37.46 exist in cluster

[Thu Jan  2 06:56:07 UTC 2014 root.37.187:~ ] # gluster pe s
Number of Peers: 3

Hostname: 10.70.37.86
Uuid: 91a0bc99-3b1a-4c53-94b5-72864bce512d
State: Peer in Cluster (Connected)

Hostname: 10.70.37.46
Uuid: 41924551-eac0-4c22-98a1-adff049dbba8
State: Peer in Cluster (Connected)

Hostname: 10.70.37.198
Uuid: 5c657325-6e34-4110-8c83-c2b7405a2403
State: Peer in Cluster (Connected)
[Thu Jan  2 06:56:19 UTC 2014 root.37.187:~ ] # gluster volume info
 
Volume Name: distrep
Type: Distributed-Replicate
Volume ID: b6e2ed30-370d-46b1-9071-fe851aa57caf
Status: Started
Number of Bricks: 3 x 2 = 6
Transport-type: tcp
Bricks:
Brick1: 10.70.37.86:/rhs/brick1/drdir1
Brick2: 10.70.37.187:/rhs/brick1/drdir1
Brick3: 10.70.37.86:/rhs/brick2/drdir2
Brick4: 10.70.37.187:/rhs/brick2/drdir2
Brick5: 10.70.37.46:/rhs/brick1/add-disk1
Brick6: 10.70.37.198:/rhs/brick1/add-disk1
Options Reconfigured:
cluster.server-quorum-type: server
cluster.quorum-type: auto
network.remote-dio: enable
cluster.eager-lock: enable
performance.stat-prefetch: off
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
storage.owner-uid: 36
storage.owner-gid: 36
 
Volume Name: distrep2
Type: Distributed-Replicate
Volume ID: a64bd787-178e-44bc-9bae-1620cb665538
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.37.46:/rhs/brick1/dir1
Brick2: 10.70.37.198:/rhs/brick1/dir1
Brick3: 10.70.37.46:/rhs/brick2/dir2
Brick4: 10.70.37.198:/rhs/brick2/dir2
Options Reconfigured:
nfs.disable: off
user.cifs: enable
auth.allow: *
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
cluster.eager-lock: enable
network.remote-dio: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server
storage.owner-uid: 36
storage.owner-gid: 36

[Thu Jan  2 06:56:24 UTC 2014 root.37.187:~ ] # gluster volume info
 
Volume Name: distrep
Type: Distributed-Replicate
Volume ID: b6e2ed30-370d-46b1-9071-fe851aa57caf
Status: Started
Number of Bricks: 3 x 2 = 6
Transport-type: tcp
Bricks:
Brick1: 10.70.37.86:/rhs/brick1/drdir1
Brick2: 10.70.37.187:/rhs/brick1/drdir1
Brick3: 10.70.37.86:/rhs/brick2/drdir2
Brick4: 10.70.37.187:/rhs/brick2/drdir2
Brick5: 10.70.37.46:/rhs/brick1/add-disk1
Brick6: 10.70.37.198:/rhs/brick1/add-disk1
Options Reconfigured:
cluster.server-quorum-type: server
cluster.quorum-type: auto
network.remote-dio: enable
cluster.eager-lock: enable
performance.stat-prefetch: off
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
storage.owner-uid: 36
storage.owner-gid: 36

[Thu Jan  2 06:57:17 UTC 2014 root.37.187:~ ] # gluster peer detach 10.70.37.46
peer detach: failed: Brick(s) with the peer 10.70.37.46 exist in cluster

[Thu Jan  2 06:57:22 UTC 2014 root.37.187:~ ] # gluster volume remove-brick distrep 10.70.37.46:/rhs/brick1/add-disk1 10.70.37.198:/rhs/brick1/add-disk1 start
Connection failed. Please check if gluster daemon is operational.

[Thu Jan  2 06:59:52 UTC 2014 root.37.187:~ ] # service glusterd status
glusterd dead but pid file exists

[Thu Jan  2 07:00:09 UTC 2014 root.37.187:~ ] # ls /var/log/core
core.3758.1388645992.dump

[Thu Jan  2 07:00:48 UTC 2014 root.37.187:~ ] # ls /var/log/core -lh
total 44M
-rw------- 1 root root 110M Jan  2 01:59 core.3758.1388645992.dump

Comment 1 SATHEESARAN 2014-01-02 06:43:44 UTC

error snip from glusterd log file (/var/log/glusterd/etc-glusterfs-glusterd.vol.log) in 10.70.37.187. where the glusterd got crashed :

[2014-01-02 06:57:02.090532] I [rpc-clnt.c:977:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2014-01-02 06:57:02.090606] I [socket.c:3505:socket_init] 0-management: SSL support is NOT enabled
[2014-01-02 06:57:02.090630] I [socket.c:3520:socket_init] 0-management: using system polling thread
[2014-01-02 06:57:03.092282] E [glusterd-utils.c:4006:glusterd_nodesvc_unlink_socket_file] 0-management: Failed to remove /var/run/c0dfc1a7171d0c097f
48b95e254f0809.socket error: No such file or directory
[2014-01-02 06:57:03.100297] I [mem-pool.c:539:mem_pool_destroy] 0-management: size=2236 max=0 total=0
[2014-01-02 06:57:03.100376] I [mem-pool.c:539:mem_pool_destroy] 0-management: size=124 max=0 total=0
[2014-01-02 06:57:03.100461] I [rpc-clnt.c:977:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2014-01-02 06:57:03.100528] I [socket.c:3505:socket_init] 0-management: SSL support is NOT enabled
[2014-01-02 06:57:03.100544] I [socket.c:3520:socket_init] 0-management: using system polling thread
[2014-01-02 06:57:03.100545] I [socket.c:2235:socket_event_handler] 0-transport: disconnecting now
[2014-01-02 06:57:03.100790] I [mem-pool.c:539:mem_pool_destroy] 0-management: size=2236 max=0 total=0
[2014-01-02 06:57:03.100805] I [mem-pool.c:539:mem_pool_destroy] 0-management: size=124 max=0 total=0
[2014-01-02 06:57:03.101018] I [socket.c:2235:socket_event_handler] 0-transport: disconnecting now
[2014-01-02 06:57:17.383116] I [glusterd-handler.c:1073:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req
[2014-01-02 06:57:17.384307] I [glusterd-handler.c:1073:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req
[2014-01-02 06:57:22.555336] I [glusterd-handler.c:916:__glusterd_handle_cli_deprobe] 0-glusterd: Received CLI deprobe req
[2014-01-02 06:57:25.938661] I [glusterd-handler.c:1018:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req
[2014-01-02 06:57:26.228262] I [glusterd-handler.c:1073:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req
[2014-01-02 06:57:26.229347] I [glusterd-handler.c:1073:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req
[2014-01-02 06:58:22.128594] I [glusterd-handler.c:1018:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req
[2014-01-02 06:58:22.454366] I [glusterd-handler.c:1073:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req
[2014-01-02 06:58:22.455447] I [glusterd-handler.c:1073:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req
[2014-01-02 06:59:06.511463] I [glusterd-handler.c:1018:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req
[2014-01-02 06:59:06.836698] I [glusterd-handler.c:1073:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req
[2014-01-02 06:59:06.837477] I [glusterd-handler.c:1073:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req
[2014-01-02 06:59:28.571665] I [glusterd-handler.c:1018:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req
[2014-01-02 06:59:28.855855] I [glusterd-handler.c:1073:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req
[2014-01-02 06:59:28.856770] I [glusterd-handler.c:1073:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req
[2014-01-02 06:59:34.040168] I [glusterd-handler.c:1018:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req
[2014-01-02 06:59:34.329708] I [glusterd-handler.c:1073:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req
[2014-01-02 06:59:34.330747] I [glusterd-handler.c:1073:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req
[2014-01-02 06:59:39.534002] I [glusterd-handler.c:1018:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req
[2014-01-02 06:59:39.850115] I [glusterd-handler.c:1073:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req
[2014-01-02 06:59:39.850858] I [glusterd-handler.c:1073:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req
[2014-01-02 06:59:52.557460] I [glusterd-brick-ops.c:663:__glusterd_handle_remove_brick] 0-management: Received rem brick req
pending frames:
frame : type(0) op(0)

patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash: 2014-01-02 06:59:52configuration details:
argp 1
backtrace 1
dlfcn 1
fdatasync 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.4.0.52rhs
/lib64/libc.so.6(+0x32960)[0x7fe1494a8960]
/usr/lib64/glusterfs/3.4.0.52rhs/xlator/mgmt/glusterd.so(__glusterd_handle_remove_brick+0x78a)[0x7fe145c8230a]
/usr/lib64/glusterfs/3.4.0.52rhs/xlator/mgmt/glusterd.so(glusterd_big_locked_handler+0x3f)[0x7fe145c1278f]
/usr/lib64/libglusterfs.so.0(synctask_wrap+0x12)[0x7fe14a4410c2]
/lib64/libc.so.6(+0x43bb0)[0x7fe1494b9bb0]
---------
(END)

Comment 2 SATHEESARAN 2014-01-02 06:48:49 UTC

Created attachment 844372 [details]
core dump

Core dump that was available in the RHSS Node ( 10.70.37.187 ) where glusterd crashed

Comment 3 SATHEESARAN 2014-01-02 06:51:02 UTC

Created attachment 844374 [details]
gluster log file

glusterd log file available with the RHSS Node where glusterd crashed

Comment 5 Vivek Agarwal 2014-01-02 09:55:13 UTC

Per triage 1/2, removing from list for corbett

Comment 6 SATHEESARAN 2014-01-02 11:20:30 UTC

remove-brick operation as such doesn't causes glusterd to crash but following the steps as described in comment 0 lead to glusterd crash.

So, removing blocker for this bug

Comment 7 Ravishankar N 2014-01-02 17:29:21 UTC

Crash occurs in the following scenario:
1. Create a dist-rep volume on a trusted storage pool.
2. Add a new node to the pool by peer-probing it.
3. Perform remove-brick operation from this new node.
4. This causes the glusterd in the new node to crash.

Comment 8 Ravishankar N 2014-01-03 07:03:03 UTC

Downstream patch https://code.engineering.redhat.com/gerrit/17984

Comment 9 Vivek Agarwal 2014-01-03 07:11:08 UTC

Based on https://bugzilla.redhat.com/show_bug.cgi?id=1047747#c7, adding this to back the list for u2

Comment 10 SATHEESARAN 2014-01-08 01:48:11 UTC

Tested with glusterfs-3.4.0.55rhs-1 with following steps

1. Created trusted storage pool with 2 RHSS Nodes
2. Created a distributed replicate volume with 2X2
3. Started the volume
4. Fuse mounted the volume and started writing few files on to the mount
(i.e) mount.glusterfs <RHSS Node>:<vol-name> <mount-point>
        for i in {1..100}; do dd if=/dev/urandom of=<mount>/file$i bs=1024k count=100;done

5. Added a pair of bricks to make the volume as 3X2 distribute replicate
6. Started rebalance
(i.e) gluster volume rebalance <vol-name> start

7. After rebalance has been completed successfully, tried to peer probe a new node
(i.e) gluster peer probe <RHSS-Node>

8. Immediately after peer probe returns success, tried to remove a pair of bricks from the newly probed node
(i.e) gluster volume remove-brick <vol-name> <brick1> <brick2> start

remove brick completed successfully and commiting the removed-bricks also succeeded

No glusterd crash was seen

Comment 11 SATHEESARAN 2014-01-08 02:11:09 UTC

Steps to reproduce, provided in comment 0, I was totally unaware that the iprules to block all incoming glusterd traffic was existing and that simulated the steps as described by Ravi in  comment 7. 

Performing the steps mentioned in comment 0, for verification of this bug

Comment 12 SATHEESARAN 2014-01-08 10:14:50 UTC

Tested the following with glusterfs-3.4.0.55rhs-1

Performed the steps as follows,

1. Created a 4 Node Trusted Storage Pool
2. Created a 2 distributed replicate volume. one with 3x2 and other with 2X2
3. Blocked all glusterd traffic from all other nodes.
(i.e) iptables -I INPUT 1 -p tcp --dport 24007 -j DROP
4. Removed/Deleted one of the volume after stopping it
5. Flushed iptables rules in the RHSS Node
6. Started remove-brick which includes brick from the node where iptables rules are just flushed
7. Remove brick was successful and no glusterd crashes were found

Apart from the following performed the test steps in comment 10, 
1. Tested the same scenario, with following operations on the newly probed peer,
a. remove-brick
b. remove-brick start
c. remove-brick commit
d. rebalance
e. add-brick

There is no glusterd crash

Comment 14 errata-xmlrpc 2014-02-25 08:13:43 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-0208.html