Bug 1131463

Summary: glusterd crash while starting volume
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: senaik
Component: glusterdAssignee: Bug Updates Notification Mailing List <rhs-bugs>
Status: CLOSED WONTFIX QA Contact: storage-qa-internal <storage-qa-internal>
Severity: high Docs Contact:
Priority: unspecified    
Version: rhgs-3.0CC: amukherj, nlevinki, nsathyan, rhinduja, sasundar, vagarwal, vbellur
Target Milestone: ---Keywords: ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-12-30 10:10:26 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description senaik 2014-08-19 10:55:03 UTC
Description of problem:
======================
glusterd crashed on one of the nodes when starting a stopped volume while glusterd was brought back up on one of the nodes where it was down. 

Version-Release number of selected component (if applicable):
============================================================
glusterfs 3.6.0.27

How reproducible:
================
1/1


Steps to Reproduce:
====================
1.Create a 2x2 dist repl volume and start it

2.Fuse and NFS mount the volume and create some files,directories. Remove some files and directories

3.While I/O is going on, create some snapshots , delete some snapshots

4.Stop I/O and stop the volume and try some restore operations

5.Bring glusterd down on node2(snapshot14.lab.eng.blr.redhat.com)

6.Stop the volume from node1(snapshot13.lab.eng.blr.redhat.com)

7.Bring back glusterd on node2 , while its still coming back, immediately try to start the volume from node1

gluster v start vol0
volume start: vol0: failed: Staging failed on snapshot14.lab.eng.blr.redhat.com. Error: Volume vol0 already started

But volume status shows 'STOPPED'


gluster v i vol0
 
Volume Name: vol0
Type: Distributed-Replicate
Volume ID: 9e14daed-397d-40c9-ba26-306b1ff7098d
Status: Stopped
Snap Volume: no
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: snapshot13.lab.eng.blr.redhat.com:/var/run/gluster/snaps/3f3f29e7b60a4ee9bf9b5bdc2ac34217/brick1/b1
Brick2: snapshot14.lab.eng.blr.redhat.com:/var/run/gluster/snaps/3f3f29e7b60a4ee9bf9b5bdc2ac34217/brick2/b1
Brick3: snapshot15.lab.eng.blr.redhat.com:/var/run/gluster/snaps/3f3f29e7b60a4ee9bf9b5bdc2ac34217/brick3/b1
Brick4: snapshot16.lab.eng.blr.redhat.com:/var/run/gluster/snaps/3f3f29e7b60a4ee9bf9b5bdc2ac34217/brick4/b1
Options Reconfigured:
performance.readdir-ahead: on
auto-delete: disable
snap-max-soft-limit: 90
snap-max-hard-limit: 256

8.Started the volume again ,it was successful

9.Retried steps 5,6,7 a few times which resulted in glusterd crash on node2

gluster v start vol0
volume start: vol0: failed: Commit failed on 00000000-0000-0000-0000-000000000000. Please check log file for details

bt:
===

(gdb) bt
#0  uuid_is_null (uu=0x10 <Address 0x10 out of bounds>) at ../../contrib/uuid/isnull.c:44
#1  0x00007f9b542ea867 in glusterd_brick_start (volinfo=0x23d07a0, brickinfo=0xffffffffffffc300, wait=_gf_true)
    at glusterd-utils.c:6840
#2  0x00007f9b54334761 in glusterd_start_volume (volinfo=0x23d07a0, flags=<value optimized out>, wait=_gf_true)
    at glusterd-volume-ops.c:1898
#3  0x00007f9b54336eb3 in glusterd_op_start_volume (dict=0x7f9b5ba2c4cc, op_errstr=<value optimized out>)
    at glusterd-volume-ops.c:1988
#4  0x00007f9b542cb14b in glusterd_op_commit_perform (op=GD_OP_START_VOLUME, dict=0x7f9b5ba2c4cc, 
    op_errstr=0x277d1b8, rsp_dict=0x7f9b5ba2bfe0) at glusterd-op-sm.c:4831
#5  0x00007f9b542cbf4c in glusterd_op_ac_commit_op (event=0x7f9b40000fd0, ctx=0x7f9b40001320)
    at glusterd-op-sm.c:4611
#6  0x00007f9b542c84e5 in glusterd_op_sm () at glusterd-op-sm.c:6522
#7  0x00007f9b542ab293 in __glusterd_handle_commit_op (req=0x7f9b5403db58) at glusterd-handler.c:1038
#8  0x00007f9b542a846f in glusterd_big_locked_handler (req=0x7f9b5403db58, 
    actor_fn=0x7f9b542ab170 <__glusterd_handle_commit_op>) at glusterd-handler.c:80
#9  0x000000317fe5b9d2 in synctask_wrap (old_task=<value optimized out>) at syncop.c:333
#10 0x0000003ae4a43bf0 in ?? () from /lib64/libc.so.6
#11 0x0000000000000000 in ?? ()

Actual results:
==============
glusterd crashed while trying to start volume 

Expected results:
================
There should be no glusterd crash 


Additional info:

Comment 3 Atin Mukherjee 2014-08-21 11:35:46 UTC
Since the sosreport doesn't capture core dump from /root would you please attach the core dump?

Comment 6 Atin Mukherjee 2014-08-26 06:14:25 UTC
RCA
---

As per the logs, there was a race between volume start and peer handshake. Volume start gives up the big log while starting a brick and at that time if peer handshake comes in then it acquires the big lock and changes the volinfo for which the next brick start failed as the volinfo got corrupted.

Comment 8 Atin Mukherjee 2015-06-22 04:09:37 UTC
Considering volinfo list is yet to be protected using URCU, deferring it to the next release.

Comment 10 Atin Mukherjee 2015-12-30 10:10:26 UTC
Since this is a rare race and the same type of problem can be avoided using central store in GlusterD 2.0, closing this bug.