Bug 1131463
| Summary: | glusterd crash while starting volume | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | senaik |
| Component: | glusterd | Assignee: | Bug Updates Notification Mailing List <rhs-bugs> |
| Status: | CLOSED WONTFIX | QA Contact: | storage-qa-internal <storage-qa-internal> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | rhgs-3.0 | CC: | amukherj, nlevinki, nsathyan, rhinduja, sasundar, vagarwal, vbellur |
| Target Milestone: | --- | Keywords: | ZStream |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2015-12-30 10:10:26 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Since the sosreport doesn't capture core dump from /root would you please attach the core dump? RCA --- As per the logs, there was a race between volume start and peer handshake. Volume start gives up the big log while starting a brick and at that time if peer handshake comes in then it acquires the big lock and changes the volinfo for which the next brick start failed as the volinfo got corrupted. Considering volinfo list is yet to be protected using URCU, deferring it to the next release. Since this is a rare race and the same type of problem can be avoided using central store in GlusterD 2.0, closing this bug. |
Description of problem: ====================== glusterd crashed on one of the nodes when starting a stopped volume while glusterd was brought back up on one of the nodes where it was down. Version-Release number of selected component (if applicable): ============================================================ glusterfs 3.6.0.27 How reproducible: ================ 1/1 Steps to Reproduce: ==================== 1.Create a 2x2 dist repl volume and start it 2.Fuse and NFS mount the volume and create some files,directories. Remove some files and directories 3.While I/O is going on, create some snapshots , delete some snapshots 4.Stop I/O and stop the volume and try some restore operations 5.Bring glusterd down on node2(snapshot14.lab.eng.blr.redhat.com) 6.Stop the volume from node1(snapshot13.lab.eng.blr.redhat.com) 7.Bring back glusterd on node2 , while its still coming back, immediately try to start the volume from node1 gluster v start vol0 volume start: vol0: failed: Staging failed on snapshot14.lab.eng.blr.redhat.com. Error: Volume vol0 already started But volume status shows 'STOPPED' gluster v i vol0 Volume Name: vol0 Type: Distributed-Replicate Volume ID: 9e14daed-397d-40c9-ba26-306b1ff7098d Status: Stopped Snap Volume: no Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: snapshot13.lab.eng.blr.redhat.com:/var/run/gluster/snaps/3f3f29e7b60a4ee9bf9b5bdc2ac34217/brick1/b1 Brick2: snapshot14.lab.eng.blr.redhat.com:/var/run/gluster/snaps/3f3f29e7b60a4ee9bf9b5bdc2ac34217/brick2/b1 Brick3: snapshot15.lab.eng.blr.redhat.com:/var/run/gluster/snaps/3f3f29e7b60a4ee9bf9b5bdc2ac34217/brick3/b1 Brick4: snapshot16.lab.eng.blr.redhat.com:/var/run/gluster/snaps/3f3f29e7b60a4ee9bf9b5bdc2ac34217/brick4/b1 Options Reconfigured: performance.readdir-ahead: on auto-delete: disable snap-max-soft-limit: 90 snap-max-hard-limit: 256 8.Started the volume again ,it was successful 9.Retried steps 5,6,7 a few times which resulted in glusterd crash on node2 gluster v start vol0 volume start: vol0: failed: Commit failed on 00000000-0000-0000-0000-000000000000. Please check log file for details bt: === (gdb) bt #0 uuid_is_null (uu=0x10 <Address 0x10 out of bounds>) at ../../contrib/uuid/isnull.c:44 #1 0x00007f9b542ea867 in glusterd_brick_start (volinfo=0x23d07a0, brickinfo=0xffffffffffffc300, wait=_gf_true) at glusterd-utils.c:6840 #2 0x00007f9b54334761 in glusterd_start_volume (volinfo=0x23d07a0, flags=<value optimized out>, wait=_gf_true) at glusterd-volume-ops.c:1898 #3 0x00007f9b54336eb3 in glusterd_op_start_volume (dict=0x7f9b5ba2c4cc, op_errstr=<value optimized out>) at glusterd-volume-ops.c:1988 #4 0x00007f9b542cb14b in glusterd_op_commit_perform (op=GD_OP_START_VOLUME, dict=0x7f9b5ba2c4cc, op_errstr=0x277d1b8, rsp_dict=0x7f9b5ba2bfe0) at glusterd-op-sm.c:4831 #5 0x00007f9b542cbf4c in glusterd_op_ac_commit_op (event=0x7f9b40000fd0, ctx=0x7f9b40001320) at glusterd-op-sm.c:4611 #6 0x00007f9b542c84e5 in glusterd_op_sm () at glusterd-op-sm.c:6522 #7 0x00007f9b542ab293 in __glusterd_handle_commit_op (req=0x7f9b5403db58) at glusterd-handler.c:1038 #8 0x00007f9b542a846f in glusterd_big_locked_handler (req=0x7f9b5403db58, actor_fn=0x7f9b542ab170 <__glusterd_handle_commit_op>) at glusterd-handler.c:80 #9 0x000000317fe5b9d2 in synctask_wrap (old_task=<value optimized out>) at syncop.c:333 #10 0x0000003ae4a43bf0 in ?? () from /lib64/libc.so.6 #11 0x0000000000000000 in ?? () Actual results: ============== glusterd crashed while trying to start volume Expected results: ================ There should be no glusterd crash Additional info: