1088355 – [SNAPSHOT] : glusterd crash on 2 nodes while snapshot was in progress when IO was in progress on the client

Bug 1088355 - [SNAPSHOT] : glusterd crash on 2 nodes while snapshot was in progress when IO was in progress on the client

Summary: [SNAPSHOT] : glusterd crash on 2 nodes while snapshot was in progress when IO...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	snapshot
Sub Component:
Version:	rhgs-3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	RHGS 3.0.0
Assignee:	Vijaikumar Mallikarjuna
QA Contact:	senaik
Docs Contact:
URL:
Whiteboard:	SNAPSHOT
Depends On:	1096729 1104459 1104462
Blocks:	1091926
TreeView+	depends on / blocked

Reported:	2014-04-16 13:30 UTC by senaik
Modified:	2016-09-17 12:58 UTC (History)
CC List:	8 users (show)
Fixed In Version:	glusterfs-3.6.0-3.0.el6rhs
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1091926 (view as bug list)
Environment:
Last Closed:	2014-09-22 19:35:45 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2014:1278	0	normal	SHIPPED_LIVE	Red Hat Storage Server 3.0 bug fix and enhancement update	2014-09-22 23:26:55 UTC

Description senaik 2014-04-16 13:30:17 UTC

Description of problem:
======================
glusterd crash on 2 nodes while snapshot was in progress when file creation was in progress on the client

Version-Release number of selected component (if applicable):
============================================================
glusterfs 3.5qa2

How reproducible:


Steps to Reproduce:
==================
1.Create a dist-rep volume and start it 

2.Fuse and NFS mount the volume and create some files 
for i in {1..100}; do dd if=/dev/urandom of=fuse"$i" bs=10M count=1; done
for i in {1..100}; do dd if=/dev/urandom of=nfs"$i" bs=10M count=1; done

3.While file creation is in progress create multiple snapshots 
for i in {1..100} ; do gluster snapshot create snap_vol1_$i vol1 ; done
snapshot create: snap_vol1_1: snap created successfully
snapshot create: snap_vol1_2: snap created successfully
snapshot create: snap_vol1_3: snap created successf
.
.
snapshot create: snap_vol1_69: snap created successfully
snapshot create: failed: Commit failed on 10.70.44.56. Please check log file for details.
Snapshot command failed
.
.
snapshot create: snap_vol1_92: snap created successfully
snapshot create: failed: Commit failed on 10.70.44.57. Please check log file for details.
Snapshot command failed

While snapshot creation was in progress created another volume and took snapshots 


bt :
===
(gdb) bt
#0  0x0000003bd380f867 in ?? () from /lib64/libgcc_s.so.1
#1  0x0000003bd3810119 in _Unwind_Backtrace () from /lib64/libgcc_s.so.1
#2  0x0000003bcf0febf6 in backtrace () from /lib64/libc.so.6
#3  0x0000003bd041e956 in _gf_msg_backtrace_nomem (level=<value optimized out>, stacksize=200) at logging.c:971
#4  0x0000003bd0437410 in gf_print_trace (signum=11, ctx=0x1aed010) at common-utils.c:530
#5  <signal handler called>
#6  0x00000000000001c1 in ?? ()
#7  0x0000003bd0c08196 in rpcsvc_transport_submit (trans=<value optimized out>, rpchdr=<value optimized out>, 
    rpchdrcount=<value optimized out>, proghdr=<value optimized out>, proghdrcount=<value optimized out>, 
    progpayload=<value optimized out>, progpayloadcount=0, iobref=0x7faca041b350, priv=0x0) at rpcsvc.c:1006
#8  0x0000003bd0c08b18 in rpcsvc_submit_generic (req=0x7facb5d9902c, proghdr=0x2138bd0, 
    hdrcount=<value optimized out>, payload=0x0, payloadcount=0, iobref=0x7faca041b350) at rpcsvc.c:1190
#9  0x0000003bd0c08f46 in rpcsvc_error_reply (req=0x7facb5d9902c) at rpcsvc.c:1238
#10 0x0000003bd0c08fbb in rpcsvc_check_and_reply_error (ret=-1, frame=<value optimized out>, opaque=0x7facb5d9902c)
    at rpcsvc.c:492
#11 0x0000003bd0457c3a in synctask_wrap (old_task=<value optimized out>) at syncop.c:335
#12 0x0000003bcf043bf0 in ?? () from /lib64/libc.so.6
#13 0x0000000000000000 in ?? ()

Actual results:
==============
glusterd crash while snapshot was in progress 

Expected results:
=================
There should be no glusterd crash 


Additional info:

Comment 3 senaik 2014-04-18 07:24:03 UTC

http://rhsqe-repo.lab.eng.blr.redhat.com/bugs_necessary_info/snapshots/1088355/

Comment 4 Nagaprasad Sathyanarayana 2014-04-21 06:17:33 UTC

Marking snapshot BZs to RHS 3.0.

Comment 5 Vijaikumar Mallikarjuna 2014-04-21 09:29:52 UTC

Looking at the stack trace from core file.
It looks like stack is corrupted.
#12 0x0000003bcf043bf0 in ?? () from /lib64/libc.so.6
#13 0x0000000000000000 in ?? ()

I will try to re-create this problem with val-rind and see If I can find something.

Comment 6 Rahul Hinduja 2014-04-22 07:15:05 UTC

With the similar steps, hit another glusterd crash

(gdb) bt
#0  0x000000308c7904c8 in main_arena () from /lib64/libc.so.6
#1  0x000000308e008196 in rpcsvc_transport_submit (trans=<value optimized out>, rpchdr=<value optimized out>, rpchdrcount=<value optimized out>, 
    proghdr=<value optimized out>, proghdrcount=<value optimized out>, progpayload=<value optimized out>, progpayloadcount=0, iobref=0x7fb63406f9f0, priv=0x0)
    at rpcsvc.c:1006
#2  0x000000308e008b18 in rpcsvc_submit_generic (req=0x7fb639d9c644, proghdr=0x1a78170, hdrcount=<value optimized out>, payload=0x0, payloadcount=0, 
    iobref=0x7fb63406f9f0) at rpcsvc.c:1190
#3  0x000000308e008f46 in rpcsvc_error_reply (req=0x7fb639d9c644) at rpcsvc.c:1238
#4  0x000000308e008fbb in rpcsvc_check_and_reply_error (ret=-1, frame=<value optimized out>, opaque=0x7fb639d9c644) at rpcsvc.c:492
#5  0x000000308d857c3a in synctask_wrap (old_task=<value optimized out>) at syncop.c:335
#6  0x000000308c443bf0 in ?? () from /lib64/libc.so.6
#7  0x0000000000000000 in ?? ()

Steps were:
===========

1. Start and create 4 volumes
2. Create snapshots in loop for all the 4 volumes simultaneously

Comment 7 Vijaikumar Mallikarjuna 2014-04-22 12:42:12 UTC

While creating a snapshot we release big-lock when doing the mount operation.
This might cause a deadlock kind of scenario or the data-structure corruption.

This is solved in the patch: http://review.gluster.org/#/c/7461/
We need to run the test on this patch and see if this solves the issue.

Comment 8 Vijaikumar Mallikarjuna 2014-04-24 09:20:11 UTC

Patch http://review.gluster.org/#/c/7461/ is pending for review

Comment 9 Nagaprasad Sathyanarayana 2014-04-25 04:58:15 UTC

Moving back to Assigned state. The downstream BZ can be moved to POST once it is merged in upstream.

Comment 10 Vijaikumar Mallikarjuna 2014-04-28 10:47:01 UTC

Patch #7461 has multiple fixes.
Posted a separate patch to address this issue: http://review.gluster.org/#/c/7579/

Comment 11 senaik 2014-05-12 10:56:49 UTC

Marking this bug as a dependent of bz 1096729 as snapshots on multiple volume with IO is failing

Comment 12 Nagaprasad Sathyanarayana 2014-05-19 10:56:34 UTC

Setting flags required to add BZs to RHS 3.0 Errata

Comment 13 rjoseph 2014-06-02 13:01:36 UTC

Removed upstream bugs as dependent bugs and also removed the bugs which does not have any relation with this bug.

Comment 14 senaik 2014-06-05 09:13:07 UTC

Version : glusterfs-3.6.0.12-1.el6rhs.x86_64
=======

Retried the steps as mentioned in "Steps to Reproduce" , did not face the issue again. (Ping time out set to 0 which is mentioned as workaround for bz 1096729)
 
Marking bug as 'verified'

Comment 16 errata-xmlrpc 2014-09-22 19:35:45 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-1278.html

Note You need to log in before you can comment on or make changes to this bug.