Bug 1259992

Summary:

Glusterd crashed during heals

Product:

[Red Hat Storage] Red Hat Gluster Storage

Reporter:

Bhaskarakiran <byarlaga>

Component:

glusterd

Assignee:

Satish Mohan <smohan>

Status:

CLOSED WONTFIX

QA Contact:

storage-qa-internal <storage-qa-internal>

Severity:

urgent

Docs Contact:

Priority:

unspecified

Version:

rhgs-3.1

CC:

amukherj, mzywusko, nlevinki, rcyriac, rhinduja, sankarshan, sasundar, vbellur

Target Milestone:

---

Keywords:

ZStream

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

glusterd

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2017-02-08 13:22:56 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1260930

Attachments:

Description	Flags
core file	none
new core	none

Description Bhaskarakiran 2015-09-04 05:25:29 UTC

Created attachment 1070158 [details]
core file

Description of problem:
======================

Seen the glusterd crash during heals on a disperse volume. Created 8+4 disperse volume, started creating directories, files and linux untar's. After some time failed two of the bricks and continued the IO. Had seen the number of entries to get healed reached > 20lakhs.  Stopped the IO and let the heal continue for that volume.  It had come down to 10 lakhs after a day and then there is the crash.

On the other volume, replaced bricks couple of times and triggered heal full. heal was successful on the replaced bricks.

Version-Release number of selected component (if applicable):
=============================================================
3.7.1-14

glusterfs 3.7.1 built on Aug 31 2015 23:59:02
Repository revision: git://git.gluster.com/glusterfs.git
Copyright (c) 2006-2011 Gluster Inc. <http://www.gluster.com>
GlusterFS comes with ABSOLUTELY NO WARRANTY.
You may redistribute copies of GlusterFS under the terms of the GNU General Public License.


How reproducible:
=================

Seen once.

Steps to Reproduce:
===================
As in description

Actual results:
==============
Crash

Expected results:
================


Additional info:
================
Core file is attached

Comment 2 Bhaskarakiran 2015-09-08 07:14:25 UTC

There is one more glusterd crash while enabling the heal with gluster v heal <volname> enable command. Below is the backtrace. Let me know if i have to file a new bug for this.

Corefile: interstellar.lab.eng.blr.redhat.com:/core.8010 - root/redhat if it needs to be looked at.


(gdb) bt
#0  0x00007f7c5df3cf8b in __strcmp_sse42 () from /lib64/libc.so.6
#1  0x00007f7c542461e7 in glusterd_check_client_op_version_support (
    volname=0x7f7c3c5dabd0 "vol2", op_version=op_version@entry=30703, 
    op_errstr=op_errstr@entry=0x7f7c40249720) at glusterd-utils.c:9930
#2  0x00007f7c5421b7f7 in glusterd_op_stage_set_volume (
    dict=dict@entry=0x7f7c3c38ddbc, op_errstr=op_errstr@entry=0x7f7c40249720)
    at glusterd-op-sm.c:1306
#3  0x00007f7c5421e2fb in glusterd_op_stage_validate (op=GD_OP_SET_VOLUME, 
    dict=dict@entry=0x7f7c3c38ddbc, op_errstr=op_errstr@entry=0x7f7c40249720, 
    rsp_dict=rsp_dict@entry=0x7f7c3c4d4d5c) at glusterd-op-sm.c:5406
#4  0x00007f7c5421e47f in glusterd_op_ac_stage_op (event=0x7f7c3c704190, 
    ctx=0x7f7c3c5cb8d0) at glusterd-op-sm.c:5164
#5  0x00007f7c54224a4f in glusterd_op_sm () at glusterd-op-sm.c:7371
#6  0x00007f7c5420b9ab in __glusterd_handle_stage_op (req=req@entry=0x7f7c5fa6eb78)
    at glusterd-handler.c:1022
#7  0x00007f7c54209c00 in glusterd_big_locked_handler (req=0x7f7c5fa6eb78, 
    actor_fn=0x7f7c5420b6c0 <__glusterd_handle_stage_op>) at glusterd-handler.c:83
#8  0x00007f7c5f794102 in synctask_wrap (old_task=<optimized out>) at syncop.c:381
#9  0x00007f7c5de520f0 in ?? () from /lib64/libc.so.6
#10 0x0000000000000000 in ?? ()
(gdb) q

Comment 3 Gaurav Kumar Garg 2015-09-09 14:30:36 UTC

Outlook of RCA of this bug is that i found (rpc_transport_t*)xprt object got corrupted. By debugging of core file i saw that xprt object have deleted by something else because we saw "oxbabebabe" address at the time of printing xprt list in gdb and this address can be assigned only by list_del (deleting node) operation. Now we are still analysing that how can xprt_list point to the deleted object in the list and exacerbating heal disable command. further analysis is going on...

Comment 4 Rahul Hinduja 2015-09-10 07:05:59 UTC

Created attachment 1072035 [details]
new core

Comment 5 Rahul Hinduja 2015-09-10 07:07:18 UTC

Observed the glusterd crash with bt: 

#0  0x00007f45b5ba832d in __gf_free (free_ptr=0x7f459000a4b0) at mem-pool.c:313
#1  0x00007f45aa6393d0 in glusterd_friend_sm () at glusterd-sm.c:1250
#2  0x00007f45aa63269c in __glusterd_handle_incoming_unfriend_req (req=req@entry=0x7f45b5e9706c) at glusterd-handler.c:2597
#3  0x00007f45aa62cc00 in glusterd_big_locked_handler (req=0x7f45b5e9706c, actor_fn=0x7f45aa6324d0 <__glusterd_handle_incoming_unfriend_req>) at glusterd-handler.c:83
#4  0x00007f45b593d549 in rpcsvc_handle_rpc_call (svc=0x7f45b6544040, trans=trans@entry=0x7f4590000920, msg=msg@entry=0x7f4590010960) at rpcsvc.c:703
#5  0x00007f45b593d7ab in rpcsvc_notify (trans=0x7f4590000920, mydata=<optimized out>, event=<optimized out>, data=0x7f4590010960) at rpcsvc.c:797
#6  0x00007f45b593f873 in rpc_transport_notify (this=this@entry=0x7f4590000920, event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7f4590010960) at rpc-transport.c:543
#7  0x00007f45a83b5bb6 in socket_event_poll_in (this=this@entry=0x7f4590000920) at socket.c:2290
#8  0x00007f45a83b8aa4 in socket_event_handler (fd=fd@entry=7, idx=idx@entry=2, data=0x7f4590000920, poll_in=1, poll_out=0, poll_err=0) at socket.c:2403
#9  0x00007f45b5bd66aa in event_dispatch_epoll_handler (event=0x7f45a61aae80, event_pool=0x7f45b6521c10) at event-epoll.c:575
#10 event_dispatch_epoll_worker (data=0x7f45b6544820) at event-epoll.c:678
#11 0x00007f45b49dddf5 in start_thread () from /lib64/libpthread.so.0
#12 0x00007f45b43241ad in clone () from /lib64/libc.so.6
(gdb) 

Updating this bug with core after the discussion with assignee

Comment 6 Gaurav Kumar Garg 2015-09-11 08:59:18 UTC

bug https://bugzilla.redhat.com/show_bug.cgi?id=1262236   have raised to fix the workaround for this bug. BZ 1262236 is for workaround of BZ 1259992 bug. Will continue work on RCA of BZ 1259992

Comment 7 Gaurav Kumar Garg 2015-09-15 11:08:06 UTC

(In reply to Rahul Hinduja from comment #5)
> Observed the glusterd crash with bt: 
> 
> #0  0x00007f45b5ba832d in __gf_free (free_ptr=0x7f459000a4b0) at
> mem-pool.c:313
> #1  0x00007f45aa6393d0 in glusterd_friend_sm () at glusterd-sm.c:1250
> #2  0x00007f45aa63269c in __glusterd_handle_incoming_unfriend_req
> (req=req@entry=0x7f45b5e9706c) at glusterd-handler.c:2597
> #3  0x00007f45aa62cc00 in glusterd_big_locked_handler (req=0x7f45b5e9706c,
> actor_fn=0x7f45aa6324d0 <__glusterd_handle_incoming_unfriend_req>) at
> glusterd-handler.c:83
> #4  0x00007f45b593d549 in rpcsvc_handle_rpc_call (svc=0x7f45b6544040,
> trans=trans@entry=0x7f4590000920, msg=msg@entry=0x7f4590010960) at
> rpcsvc.c:703
> #5  0x00007f45b593d7ab in rpcsvc_notify (trans=0x7f4590000920,
> mydata=<optimized out>, event=<optimized out>, data=0x7f4590010960) at
> rpcsvc.c:797
> #6  0x00007f45b593f873 in rpc_transport_notify
> (this=this@entry=0x7f4590000920,
> event=event@entry=RPC_TRANSPORT_MSG_RECEIVED,
> data=data@entry=0x7f4590010960) at rpc-transport.c:543
> #7  0x00007f45a83b5bb6 in socket_event_poll_in
> (this=this@entry=0x7f4590000920) at socket.c:2290
> #8  0x00007f45a83b8aa4 in socket_event_handler (fd=fd@entry=7,
> idx=idx@entry=2, data=0x7f4590000920, poll_in=1, poll_out=0, poll_err=0) at
> socket.c:2403
> #9  0x00007f45b5bd66aa in event_dispatch_epoll_handler
> (event=0x7f45a61aae80, event_pool=0x7f45b6521c10) at event-epoll.c:575
> #10 event_dispatch_epoll_worker (data=0x7f45b6544820) at event-epoll.c:678
> #11 0x00007f45b49dddf5 in start_thread () from /lib64/libpthread.so.0
> #12 0x00007f45b43241ad in clone () from /lib64/libc.so.6
> (gdb) 
> 
> Updating this bug with core after the discussion with assignee

Rahul could you provide me info how can i access core file.

Comment 8 Rahul Hinduja 2015-09-28 09:14:17 UTC

Gaurav, Core is attached with this mail. Please find that in attachment.

Comment 11 Atin Mukherjee 2017-02-08 13:22:56 UTC

This crash was observed when ping time out was enabled for GlusterD to GlusterD communication. We don't have any future plan to enable this option back and hence closing this bug.