1043535 – glusterd crash seen in glusterfs 3.4.0.49rhs

Bug 1043535 - glusterd crash seen in glusterfs 3.4.0.49rhs

Summary: glusterd crash seen in glusterfs 3.4.0.49rhs

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	glusterfs
Sub Component:
Version:	2.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	RHGS 2.1.2
Assignee:	Kaushal
QA Contact:	shylesh
Docs Contact:
URL:
Whiteboard:
Depends On:	1044327
Blocks:
TreeView+	depends on / blocked

Reported:	2013-12-16 15:04 UTC by Shruti Sampat
Modified:	2015-05-13 16:33 UTC (History)
CC List:	10 users (show)
Fixed In Version:	glusterfs-3.4.0.52rhs-1.el6rhs
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1044327 (view as bug list)
Environment:
Last Closed:	2014-02-25 08:09:07 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
core (260.79 KB, application/x-xz) 2013-12-16 15:16 UTC, Shruti Sampat	no flags	Details
core - new (262.50 KB, application/x-xz) 2013-12-17 12:24 UTC, Shruti Sampat	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2014:0208	0	normal	SHIPPED_LIVE	Red Hat Storage 2.1 enhancement and bug fix update #2	2014-02-25 12:20:30 UTC

Description Shruti Sampat 2013-12-16 15:04:44 UTC

Description of problem:
-------------------------

glusterd is seen to crash with the following backtrace - 

pending frames:
frame : type(0) op(0)

patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash: 2013-12-16 07:13:10configuration details:
argp 1
backtrace 1
dlfcn 1
fdatasync 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.4.0.49rhs
/lib64/libc.so.6[0x309fc32960]
/usr/lib64/libglusterfs.so.0(synctask_yield+0x10)[0x30a104ae00]
/usr/lib64/glusterfs/3.4.0.49rhs/xlator/mgmt/glusterd.so(gd_stop_rebalance_process+0x2c5)[0x7fd2cae55f45]
/usr/lib64/glusterfs/3.4.0.49rhs/xlator/mgmt/glusterd.so(gd_check_and_update_rebalance_info+0xd8)[0x7fd2cae5d368]
/usr/lib64/glusterfs/3.4.0.49rhs/xlator/mgmt/glusterd.so(glusterd_import_friend_volume+0x19f)[0x7fd2cae6b40f]
/usr/lib64/glusterfs/3.4.0.49rhs/xlator/mgmt/glusterd.so(glusterd_import_friend_volumes+0x66)[0x7fd2cae6b536]
/usr/lib64/glusterfs/3.4.0.49rhs/xlator/mgmt/glusterd.so(glusterd_compare_friend_data+0x142)[0x7fd2cae6b752]
/usr/lib64/glusterfs/3.4.0.49rhs/xlator/mgmt/glusterd.so(+0x378ac)[0x7fd2cae478ac]
/usr/lib64/glusterfs/3.4.0.49rhs/xlator/mgmt/glusterd.so(glusterd_friend_sm+0x19e)[0x7fd2cae47f2e]
/usr/lib64/glusterfs/3.4.0.49rhs/xlator/mgmt/glusterd.so(__glusterd_handle_incoming_friend_req+0x2fe)[0x7fd2cae4672e]
/usr/lib64/glusterfs/3.4.0.49rhs/xlator/mgmt/glusterd.so(glusterd_big_locked_handler+0x3f)[0x7fd2cae3619f]
/usr/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x295)[0x30a1809585]
/usr/lib64/libgfrpc.so.0(rpcsvc_notify+0x103)[0x30a18097c3]
/usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x28)[0x30a180adf8]
/usr/lib64/glusterfs/3.4.0.49rhs/rpc-transport/socket.so(+0x8d86)[0x7fd2c94bcd86]
/usr/lib64/glusterfs/3.4.0.49rhs/rpc-transport/socket.so(+0xa69d)[0x7fd2c94be69d]
/usr/lib64/libglusterfs.so.0[0x30a1062387]
/usr/sbin/glusterd(main+0x6c7)[0x4069d7]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x309fc1ecdd]
/usr/sbin/glusterd[0x404619]
---------

I had performed rebalance and remove-brick a couple of times. Restarted glusterd.

Version-Release number of selected component (if applicable):
glusterfs 3.4.0.49rhs

How reproducible:
Seen it once.

Steps to Reproduce:
1. Perform rebalance and remove-brick a few times. Restarted glusterd a few times.

Actual results:
glusterd crashed.

Expected results:
glusterd should not crash.

Additional info:

Comment 1 Shruti Sampat 2013-12-16 15:16:34 UTC

Created attachment 837263 [details]
core

Comment 4 SATHEESARAN 2013-12-16 19:44:18 UTC

Shruti,

This issue is the clone of, https://bugzilla.redhat.com/show_bug.cgi?id=1024316 and that has BLOCKER flag added to that.

Comment 5 santosh pradhan 2013-12-17 02:35:37 UTC

I am not working in glusterd segment. Just came across it and loaded the core. 

Quick Core analysis:

(gdb) where
#0  synctask_yield (task=0x0) at syncop.c:247
#1  0x00007fd2cae55f45 in gd_stop_rebalance_process (volinfo=0x15a28c0) at glusterd-utils.c:9102
#2  0x00007fd2cae5d368 in gd_check_and_update_rebalance_info (old_volinfo=0x15a28c0, new_volinfo=0x15b1c40) at glusterd-utils.c:3241
#3  0x00007fd2cae6b40f in glusterd_import_friend_volume (vols=0x7fd2cd0cbef0, count=2) at glusterd-utils.c:3287
#4  0x00007fd2cae6b536 in glusterd_import_friend_volumes (vols=0x7fd2cd0cbef0) at glusterd-utils.c:3327
#5  0x00007fd2cae6b752 in glusterd_compare_friend_data (vols=0x7fd2cd0cbef0, status=0x7fffea9c40ec, hostname=0x15a3ac0 "10.70.37.169")
    at glusterd-utils.c:3471
#6  0x00007fd2cae478ac in glusterd_ac_handle_friend_add_req (event=<value optimized out>, ctx=0x160f7a0) at glusterd-sm.c:654
#7  0x00007fd2cae47f2e in glusterd_friend_sm () at glusterd-sm.c:1026
#8  0x00007fd2cae4672e in __glusterd_handle_incoming_friend_req (req=0x7fd2c96c702c) at glusterd-handler.c:2043
#9  0x00007fd2cae3619f in glusterd_big_locked_handler (req=0x7fd2c96c702c, actor_fn=0x7fd2cae46430 <__glusterd_handle_incoming_friend_req>)
    at glusterd-handler.c:77
#10 0x00000030a1809585 in rpcsvc_handle_rpc_call (svc=<value optimized out>, trans=<value optimized out>, msg=0x15cdf70) at rpcsvc.c:629
#11 0x00000030a18097c3 in rpcsvc_notify (trans=0x15d07f0, mydata=<value optimized out>, event=<value optimized out>, data=0x15cdf70) at rpcsvc.c:723
#12 0x00000030a180adf8 in rpc_transport_notify (this=<value optimized out>, event=<value optimized out>, data=<value optimized out>)
    at rpc-transport.c:512
#13 0x00007fd2c94bcd86 in socket_event_poll_in (this=0x15d07f0) at socket.c:2119
#14 0x00007fd2c94be69d in socket_event_handler (fd=<value optimized out>, idx=<value optimized out>, data=0x15d07f0, poll_in=1, poll_out=0, 
    poll_err=0) at socket.c:2229
#15 0x00000030a1062387 in event_dispatch_epoll_handler (event_pool=0x1584ee0) at event-epoll.c:384
#16 event_dispatch_epoll (event_pool=0x1584ee0) at event-epoll.c:445
#17 0x00000000004069d7 in main (argc=2, argv=0x7fffea9c5ed8) at glusterfsd.c:2050

synctask_yield() segfaults because of NULL-ptr-deref. 

Looks like the routine synctask_get() in GD_SYNCOP() macro returns NULL. 

glusterd expert(s) can tell what can cause synctask to be NULL. I guess validating the inputs and logging would be much better than assuming and crashing. 

-Santosh

Comment 6 Shruti Sampat 2013-12-17 06:10:58 UTC

Saw another crash later on the same server - 

pending frames:
frame : type(0) op(0)
frame : type(0) op(0)

patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash: 2013-12-16 22:13:49configuration details:
argp 1
backtrace 1
dlfcn 1
fdatasync 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.4.0.49rhs
/lib64/libc.so.6[0x309fc32960]
/usr/lib64/glusterfs/3.4.0.49rhs/xlator/mgmt/glusterd.so(__glusterd_defrag_notify+0x1d0)[0x7fd916e095d0]
/usr/lib64/glusterfs/3.4.0.49rhs/xlator/mgmt/glusterd.so(glusterd_big_locked_notify+0x60)[0x7fd916db93c0]
/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x109)[0x30a180f539]
/usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x28)[0x30a180adf8]
/usr/lib64/glusterfs/3.4.0.49rhs/rpc-transport/socket.so(+0x557c)[0x7fd91543c57c]
/usr/lib64/glusterfs/3.4.0.49rhs/rpc-transport/socket.so(+0xa5b8)[0x7fd9154415b8]
/usr/lib64/libglusterfs.so.0[0x30a1062387]
/usr/sbin/glusterd(main+0x6c7)[0x4069d7]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x309fc1ecdd]
/usr/sbin/glusterd[0x404619]
---------

Comment 8 Shruti Sampat 2013-12-17 12:23:49 UTC

Saw another crash, find core attached.

Comment 9 Shruti Sampat 2013-12-17 12:24:43 UTC

Created attachment 837656 [details]
core - new

Comment 10 Kaushal 2013-12-19 07:24:26 UTC

Patch posted for review at https://code.engineering.redhat.com/gerrit/17693

Comment 11 shylesh 2014-01-09 11:34:04 UTC

As per discussion with krishnan simplified steps to reproduce the problem is 

Scenario 1
----------
1. created a distributed-replicate volume
2. Run rebalance and remove-bricks on the volume
3. stop the volume and delete the volume
4 Run some gluster commands

result:
----
No crash in glusterd

scenario 2
----------
1. created a distributed volume using 2 node cluster
2. add-brick and ran rebalance on the volume 
3. bring down one of the node
4. while a node is down run volume set command from another node
5. after node comes back run some gluster commands

result:
------
No crash in glusterd


verified on 3.4.0.54rhs-2.el6rhs.x86_64

Comment 13 errata-xmlrpc 2014-02-25 08:09:07 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-0208.html

Note You need to log in before you can comment on or make changes to this bug.