Bug 990125

Summary: glusterd crash while stopping the volume
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: senaik
Component: glusterfsAssignee: Kaushal <kaushal>
Status: CLOSED ERRATA QA Contact: senaik
Severity: urgent Docs Contact:
Priority: urgent    
Version: 2.1CC: amarts, rhs-bugs, sasundar, surs, vbellur
Target Milestone: ---Keywords: TestBlocker
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-3.4.0.27rhs-1 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-09-23 22:29:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description senaik 2013-07-30 12:49:56 UTC
Description of problem:
========================= 
Glusterd crashed after performing remove brick 


Version-Release number of selected component (if applicable):
============================================================= 
3.4.0.13rhs-1.el6rhs.x86_64

How reproducible:


Steps to Reproduce:
====================== 
1.Create a distribute volume with 5 bricks 

on 6.4 client:
--------------
2.Create directory and files inside these directories 

3.Remove one brick from the volume
gluster volume remove-brick Vol13 10.70.34.85:/rhs/brick1/p1 start
volume remove-brick start: success
ID: 9071b330-da3e-4bf5-9f9b-28ab6dcbe6e2

gluster volume remove-brick Vol13 10.70.34.85:/rhs/brick1/p1 status

Node   Rebalanced-files  size     scanned  failures     status  run-time in secs

localhost       23       23.0MB     123      0         completed     1.00 
10.70.34.88     0       0Bytes      0        0         not started   0.00
10.70.34.86     0       0Bytes      0        0         not started   0.00
10.70.34.87     0       0Bytes      0        0         not started   0.00

[root@boost brick1]# gluster volume remove-brick Vol13 10.70.34.85:/rhs/brick1/p1 commit
Removing brick(s) can result in data loss. Do you want to Continue? (y/n) y
volume remove-brick commit: success

 gluster v i Vol13
 
Volume Name: Vol13
Type: Distribute
Volume ID: ec7a3214-2c23-4c2e-be2a-c803312628f2
Status: Started
Number of Bricks: 4
Transport-type: tcp
Bricks:
Brick1: 10.70.34.86:/rhs/brick1/p2
Brick2: 10.70.34.87:/rhs/brick1/p3
Brick3: 10.70.34.88:/rhs/brick1/p4
Brick4: 10.70.34.85:/rhs/brick1/p5


4. perform fix-layout 

gluster v rebalance Vol13 fix-layout start
volume rebalance: Vol13: success: Starting rebalance on volume Vol13 has been successful.
ID: f1e53c20-b5aa-4c59-8645-87842eba00bb

Checked the hash range from the backend for the directory created 

5. deleted files from the mount point and umounted the volume 

on 5.9 client :
-------------- 
6. mounted the volume again and created directory and files inside the directory

7. Remove brick 10.70.34.85:/rhs/brick1/p2 from the volume 

 gluster volume remove-brick Vol13 10.70.34.86:/rhs/brick1/p2 start 
volume remove-brick start: success
ID: 3b3fa27c-5ddb-46fb-8222-05845301a782

gluster volume remove-brick Vol13 10.70.34.86:/rhs/brick1/p2 status

Node   Rebalanced-files  size     scanned  failures     status  run-time in secs

localhost       0       0Bytes      0        0         not started   0.00 
10.70.34.88     0       0Bytes      0        0         not started   0.00
10.70.34.86     22      22.0MB      122      0         completed     0.00
10.70.34.87     7       0Bytes      0        0         not started   0.00

gluster volume remove-brick Vol13 10.70.34.86:/rhs/brick1/p2 commit
Removing brick(s) can result in data loss. Do you want to Continue? (y/n) y
volume remove-brick commit: success

gluster v i Vol13
 
Volume Name: Vol13
Type: Distribute
Volume ID: ec7a3214-2c23-4c2e-be2a-c803312628f2
Status: Started
Number of Bricks: 3
Transport-type: tcp
Bricks:
Brick1: 10.70.34.87:/rhs/brick1/p3
Brick2: 10.70.34.88:/rhs/brick1/p4
Brick3: 10.70.34.85:/rhs/brick1/p5


8. Stop the volume 

gluster v stop Vol13
Stopping volume will make its data inaccessible. Do you want to continue? (y/n) y
Connection failed. Please check if gluster daemon is operational.


service glusterd status
glusterd dead but pid file exists


Actual results:


Expected results:


Additional info:
=================== 

part of the log : 
--------------------- 

[2013-07-30 12:15:36.151673] I [socket.c:3487:socket_init] 0-management: SSL support is NOT enabled
[2013-07-30 12:15:36.151691] I [socket.c:3502:socket_init] 0-management: using system polling thread
pending frames:
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)

patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash: 2013-07-30 12:15:36configuration details:
argp 1
backtrace 1
dlfcn 1
fdatasync 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.4.0.13rhs
/lib64/libc.so.6[0x3b0ea32920]
/usr/lib64/glusterfs/3.4.0.13rhs/xlator/mgmt/glusterd.so(__glusterd_brick_rpc_notify+0x9a)[0x7f2a96da4b1a]
/usr/lib64/glusterfs/3.4.0.13rhs/xlator/mgmt/glusterd.so(glusterd_big_locked_notify+0x60)[0x7f2a96d988e0]
/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x177)[0x38b020df67]
/usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x28)[0x38b0209838]
/usr/lib64/glusterfs/3.4.0.13rhs/rpc-transport/socket.so(+0xa491)[0x7f2a96b08491]
/usr/lib64/libglusterfs.so.0[0x38afe5d3f7]
/usr/sbin/glusterd(main+0x5c6)[0x406856]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x3b0ea1ecdd]
/usr/sbin/glusterd[0x4045f9]

Comment 1 senaik 2013-07-30 13:04:09 UTC
are equal check sum on the mount point failed with error  :

[root@localhost dir2]# /opt/qa/tools/arequal-checksum /mnt/vol13
ftw (/mnt/vol13) returned -1 (No such file or directory), terminating


sosreports :
http://rhsqe-repo.lab.eng.blr.redhat.com/bugs_necessary_info/990125/

Comment 4 Kaushal 2013-08-01 11:01:26 UTC
From the logs this seems to be a crash caused by the volume stop command. There appears to be race in the cleanup of the rpc transport glusterd uses to connect with the brick, leading to a double free and the crash.

Comment 5 Amar Tumballi 2013-08-07 12:41:18 UTC
proposed fix @ http://review.gluster.org/5512

Comment 6 Kaushal 2013-08-12 13:43:03 UTC
Downstream fix at https://code.engineering.redhat.com/gerrit/11341

Comment 7 senaik 2013-08-16 12:51:17 UTC
Version : 
======== 

Found this crash while trying to verify this bug . 
Followed the same steps as mentioned in steps to reproduce (Fuse and NFS mount)

[root@junior glusterfs]# service glusterd status
glusterd dead but pid file exists

---------------Part of the log--------------------- 

[2013-08-16 09:48:41.311540] I [glusterd-utils.c:3560:glusterd_nfs_pmap_deregister] 0-: De-registered MOUNTV3 successfully
[2013-08-16 09:48:41.311728] I [glusterd-utils.c:3565:glusterd_nfs_pmap_deregister] 0-: De-registered MOUNTV1 successfully
[2013-08-16 09:48:41.311908] I [glusterd-utils.c:3570:glusterd_nfs_pmap_deregister] 0-: De-registered NFSV3 successfully
[2013-08-16 09:48:41.312088] I [glusterd-utils.c:3575:glusterd_nfs_pmap_deregister] 0-: De-registered NLM v4 successfully
[2013-08-16 09:48:41.312268] I [glusterd-utils.c:3580:glusterd_nfs_pmap_deregister] 0-: De-registered NLM v1 successfully
[2013-08-16 09:48:41.312503] I [glusterd-utils.c:3585:glusterd_nfs_pmap_deregister] 0-: De-registered ACL v3 successfully
[2013-08-16 09:48:42.319659] E [glusterd-utils.c:3526:glusterd_nodesvc_unlink_socket_file] 0-management: Failed to remove /var/run/0d536d1ec2d14cfee8af0da42b3a6df3.socket error: No such file or directory
[2013-08-16 09:48:42.319945] E [glusterd-hooks.c:291:glusterd_hooks_run_hooks] 0-management: Failed to open dir /var/lib/glusterd/hooks/1/stop/post, due to No such file or directory
pending frames:

patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash: 2013-08-16 09:48:42configuration details:
argp 1
backtrace 1
dlfcn 1
fdatasync 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.4.0.20rhs
/lib64/libc.so.6[0x397d232920]
/usr/lib64/glusterfs/3.4.0.20rhs/xlator/mgmt/glusterd.so(__glusterd_brick_rpc_notify+0x92)[0x7fb288d9b222]
/usr/lib64/glusterfs/3.4.0.20rhs/xlator/mgmt/glusterd.so(glusterd_big_locked_notify+0x60)[0x7fb288d8e450]
/usr/lib64/libglusterfs.so.0(gf_timer_proc+0xd0)[0x7fb28c827180]
/lib64/libpthread.so.0[0x397da07851]
/lib64/libc.so.6(clone+0x6d)[0x397d2e890d]

---------------------------------------------------------------------

Comment 9 senaik 2013-08-16 13:14:31 UTC
Missed specifying the version in comment 7 : 3.4.0.20rhs-2.el6rhs.x86_64

Comment 10 senaik 2013-08-20 10:32:03 UTC
Version : 3.4.0.20rhs-2.el6rhs.x86_64
======== 

Faced glusterd crash again while stopping the volume . 

Steps : 
------ 

1)Created a 2x2 distributed replicate volume 

2)Create some files on the mount point 
for i in {100..1000} ; do dd if=/dev/urandom of=f"$i" bs=10M count=1; done

3)While file creation is in progress , bring down one brick in the replica pair
 
4) After file creation is completed , bring back the brick online 
gluster v start <vol_name> force

5)Execute heal command 
gluster v heal Vol3 full

gluster v heal Vol3 info

6)on the mount point , deleted all files , 

7) Stop the volume 
 gluster v stop Vol3
Stopping volume will make its data inaccessible. Do you want to continue? (y/n) y
Connection failed. Please check if gluster daemon is operational.

service glusterd status
glusterd dead but pid file exists


---------------------Part of Log -------------------------- 

[2013-08-20 10:16:14.645926] I [mem-pool.c:539:mem_pool_destroy] 0-management: size=2236 max=0 total=0
[2013-08-20 10:16:14.645943] I [mem-pool.c:539:mem_pool_destroy] 0-management: size=124 max=0 total=0
[2013-08-20 10:16:14.646027] I [socket.c:2237:socket_event_handler] 0-transport: disconnecting now
[2013-08-20 10:16:14.646062] I [mem-pool.c:539:mem_pool_destroy] 0-management: size=2236 max=1 total=2
[2013-08-20 10:16:14.646075] I [mem-pool.c:539:mem_pool_destroy] 0-management: size=124 max=1 total=2
[2013-08-20 10:16:14.646188] I [socket.c:2237:socket_event_handler] 0-transport: disconnecting now
pending frames:
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)

patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash: 2013-08-20 10:16:14configuration details:
argp 1
backtrace 1
dlfcn 1
fdatasync 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.4.0.20rhs
/lib64/libc.so.6[0x3b0ea32920]
/usr/lib64/glusterfs/3.4.0.20rhs/xlator/mgmt/glusterd.so(__glusterd_brick_rpc_notify+0x92)[0x7f3861bd0222]
/usr/lib64/glusterfs/3.4.0.20rhs/xlator/mgmt/glusterd.so(glusterd_big_locked_notify+0x60)[0x7f3861bc3450]
/usr/lib64/libglusterfs.so.0(gf_timer_proc+0xd0)[0x7f386565c180]
/lib64/libpthread.so.0[0x3b0f207851]
/lib64/libc.so.6(clone+0x6d)[0x3b0eae890d]

--------------------------------------------------------------------

Comment 13 senaik 2013-08-26 08:39:01 UTC
Version : glusterfs-3.4.0.22rhs-1 
========
Repeated the steps as mentioned in 'Steps to Reproduce' and Comment 10 , did not face glusterd crash . 

Marking the bug as Verified

Comment 14 senaik 2013-08-27 06:19:02 UTC
Followed the same steps as mentioned in the bug and Comment 10 , after which I stopped the volume and deleted it and glusterd crashed which did not occur last time . 

Moving the bug back to 'Assigned' 

------------Part of log------------------

[2013-08-26 08:35:42.537781] E [glusterd-utils.c:1335:glusterd_brick_unlink_socket_file] 0-management: Failed to remove /var/run/e010aa1a569ea85f32dd5
9cc65072e7f.socket error: No such file or directory
[2013-08-26 08:35:43.683117] E [glusterd-utils.c:3526:glusterd_nodesvc_unlink_socket_file] 0-management: Failed to remove /var/run/69da6e6bc4924ccbcf6
33c59b5fe3d25.socket error: Permission denied
[2013-08-26 08:35:43.683467] I [glusterd-utils.c:3560:glusterd_nfs_pmap_deregister] 0-: De-registered MOUNTV3 successfully
[2013-08-26 08:35:43.683702] I [glusterd-utils.c:3565:glusterd_nfs_pmap_deregister] 0-: De-registered MOUNTV1 successfully
[2013-08-26 08:35:43.683890] I [glusterd-utils.c:3570:glusterd_nfs_pmap_deregister] 0-: De-registered NFSV3 successfully
[2013-08-26 08:35:43.684087] I [glusterd-utils.c:3575:glusterd_nfs_pmap_deregister] 0-: De-registered NLM v4 successfully
[2013-08-26 08:35:43.684288] I [glusterd-utils.c:3580:glusterd_nfs_pmap_deregister] 0-: De-registered NLM v1 successfully
[2013-08-26 08:35:43.684493] I [glusterd-utils.c:3585:glusterd_nfs_pmap_deregister] 0-: De-registered ACL v3 successfully
[2013-08-26 08:35:44.684800] E [glusterd-utils.c:3526:glusterd_nodesvc_unlink_socket_file] 0-management: Failed to remove /var/run/0d536d1ec2d14cfee8a
f0da42b3a6df3.socket error: No such file or directory
[2013-08-26 08:35:44.684998] I [glusterd-pmap.c:271:pmap_registry_remove] 0-pmap: removing brick /rhs/brick1/a5 on port 49272

pending frames:

patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash: 2013-08-26 08:35:44configuration details:
argp 1
backtrace 1
dlfcn 1
fdatasync 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.4.0.22rhs
/lib64/libc.so.6[0x397d232920]
/usr/lib64/glusterfs/3.4.0.22rhs/xlator/mgmt/glusterd.so(__glusterd_brick_rpc_notify+0x92)[0x7fb5a17d42f2]
/usr/lib64/glusterfs/3.4.0.22rhs/xlator/mgmt/glusterd.so(glusterd_big_locked_notify+0x60)[0x7fb5a17c7520]
/usr/lib64/libglusterfs.so.0(gf_timer_proc+0xd0)[0x7fb5a5261120]
/lib64/libpthread.so.0[0x397da07851]
/lib64/libc.so.6(clone+0x6d)[0x397d2e890d]

----------------------------------------------------------------------- 

sos reports at : 

http://rhsqe-repo.lab.eng.blr.redhat.com/bugs_necessary_info/990125/990125_26_Aug/

Comment 16 senaik 2013-09-03 10:36:05 UTC
Version : 
============ 
gluster --version
glusterfs 3.4.0.30rhs built on Aug 30 2013 08:15:37

Repeated the steps as mentioned in 'Steps to Reproduce' and Comment 10 , did not face glusterd crash . 

Marking the bug as 'Verified'

Comment 17 Scott Haines 2013-09-23 22:29:51 UTC
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. 

For information on the advisory, and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1262.html