Bug 1088355 - [SNAPSHOT] : glusterd crash on 2 nodes while snapshot was in progress when IO was in progress on the client
Summary: [SNAPSHOT] : glusterd crash on 2 nodes while snapshot was in progress when IO...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: snapshot
Version: rhgs-3.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: RHGS 3.0.0
Assignee: Vijaikumar Mallikarjuna
QA Contact: senaik
URL:
Whiteboard: SNAPSHOT
Depends On: 1096729 1104459 1104462
Blocks: 1091926
TreeView+ depends on / blocked
 
Reported: 2014-04-16 13:30 UTC by senaik
Modified: 2016-09-17 12:58 UTC (History)
8 users (show)

Fixed In Version: glusterfs-3.6.0-3.0.el6rhs
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1091926 (view as bug list)
Environment:
Last Closed: 2014-09-22 19:35:45 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2014:1278 0 normal SHIPPED_LIVE Red Hat Storage Server 3.0 bug fix and enhancement update 2014-09-22 23:26:55 UTC

Description senaik 2014-04-16 13:30:17 UTC
Description of problem:
======================
glusterd crash on 2 nodes while snapshot was in progress when file creation was in progress on the client

Version-Release number of selected component (if applicable):
============================================================
glusterfs 3.5qa2

How reproducible:


Steps to Reproduce:
==================
1.Create a dist-rep volume and start it 

2.Fuse and NFS mount the volume and create some files 
for i in {1..100}; do dd if=/dev/urandom of=fuse"$i" bs=10M count=1; done
for i in {1..100}; do dd if=/dev/urandom of=nfs"$i" bs=10M count=1; done

3.While file creation is in progress create multiple snapshots 
for i in {1..100} ; do gluster snapshot create snap_vol1_$i vol1 ; done
snapshot create: snap_vol1_1: snap created successfully
snapshot create: snap_vol1_2: snap created successfully
snapshot create: snap_vol1_3: snap created successf
.
.
snapshot create: snap_vol1_69: snap created successfully
snapshot create: failed: Commit failed on 10.70.44.56. Please check log file for details.
Snapshot command failed
.
.
snapshot create: snap_vol1_92: snap created successfully
snapshot create: failed: Commit failed on 10.70.44.57. Please check log file for details.
Snapshot command failed

While snapshot creation was in progress created another volume and took snapshots 


bt :
===
(gdb) bt
#0  0x0000003bd380f867 in ?? () from /lib64/libgcc_s.so.1
#1  0x0000003bd3810119 in _Unwind_Backtrace () from /lib64/libgcc_s.so.1
#2  0x0000003bcf0febf6 in backtrace () from /lib64/libc.so.6
#3  0x0000003bd041e956 in _gf_msg_backtrace_nomem (level=<value optimized out>, stacksize=200) at logging.c:971
#4  0x0000003bd0437410 in gf_print_trace (signum=11, ctx=0x1aed010) at common-utils.c:530
#5  <signal handler called>
#6  0x00000000000001c1 in ?? ()
#7  0x0000003bd0c08196 in rpcsvc_transport_submit (trans=<value optimized out>, rpchdr=<value optimized out>, 
    rpchdrcount=<value optimized out>, proghdr=<value optimized out>, proghdrcount=<value optimized out>, 
    progpayload=<value optimized out>, progpayloadcount=0, iobref=0x7faca041b350, priv=0x0) at rpcsvc.c:1006
#8  0x0000003bd0c08b18 in rpcsvc_submit_generic (req=0x7facb5d9902c, proghdr=0x2138bd0, 
    hdrcount=<value optimized out>, payload=0x0, payloadcount=0, iobref=0x7faca041b350) at rpcsvc.c:1190
#9  0x0000003bd0c08f46 in rpcsvc_error_reply (req=0x7facb5d9902c) at rpcsvc.c:1238
#10 0x0000003bd0c08fbb in rpcsvc_check_and_reply_error (ret=-1, frame=<value optimized out>, opaque=0x7facb5d9902c)
    at rpcsvc.c:492
#11 0x0000003bd0457c3a in synctask_wrap (old_task=<value optimized out>) at syncop.c:335
#12 0x0000003bcf043bf0 in ?? () from /lib64/libc.so.6
#13 0x0000000000000000 in ?? ()

Actual results:
==============
glusterd crash while snapshot was in progress 

Expected results:
=================
There should be no glusterd crash 


Additional info:

Comment 4 Nagaprasad Sathyanarayana 2014-04-21 06:17:33 UTC
Marking snapshot BZs to RHS 3.0.

Comment 5 Vijaikumar Mallikarjuna 2014-04-21 09:29:52 UTC
Looking at the stack trace from core file.
It looks like stack is corrupted.
#12 0x0000003bcf043bf0 in ?? () from /lib64/libc.so.6
#13 0x0000000000000000 in ?? ()

I will try to re-create this problem with val-rind and see If I can find something.

Comment 6 Rahul Hinduja 2014-04-22 07:15:05 UTC
With the similar steps, hit another glusterd crash

(gdb) bt
#0  0x000000308c7904c8 in main_arena () from /lib64/libc.so.6
#1  0x000000308e008196 in rpcsvc_transport_submit (trans=<value optimized out>, rpchdr=<value optimized out>, rpchdrcount=<value optimized out>, 
    proghdr=<value optimized out>, proghdrcount=<value optimized out>, progpayload=<value optimized out>, progpayloadcount=0, iobref=0x7fb63406f9f0, priv=0x0)
    at rpcsvc.c:1006
#2  0x000000308e008b18 in rpcsvc_submit_generic (req=0x7fb639d9c644, proghdr=0x1a78170, hdrcount=<value optimized out>, payload=0x0, payloadcount=0, 
    iobref=0x7fb63406f9f0) at rpcsvc.c:1190
#3  0x000000308e008f46 in rpcsvc_error_reply (req=0x7fb639d9c644) at rpcsvc.c:1238
#4  0x000000308e008fbb in rpcsvc_check_and_reply_error (ret=-1, frame=<value optimized out>, opaque=0x7fb639d9c644) at rpcsvc.c:492
#5  0x000000308d857c3a in synctask_wrap (old_task=<value optimized out>) at syncop.c:335
#6  0x000000308c443bf0 in ?? () from /lib64/libc.so.6
#7  0x0000000000000000 in ?? ()

Steps were:
===========

1. Start and create 4 volumes
2. Create snapshots in loop for all the 4 volumes simultaneously

Comment 7 Vijaikumar Mallikarjuna 2014-04-22 12:42:12 UTC
While creating a snapshot we release big-lock when doing the mount operation.
This might cause a deadlock kind of scenario or the data-structure corruption.

This is solved in the patch: http://review.gluster.org/#/c/7461/
We need to run the test on this patch and see if this solves the issue.

Comment 8 Vijaikumar Mallikarjuna 2014-04-24 09:20:11 UTC
Patch http://review.gluster.org/#/c/7461/ is pending for review

Comment 9 Nagaprasad Sathyanarayana 2014-04-25 04:58:15 UTC
Moving back to Assigned state. The downstream BZ can be moved to POST once it is merged in upstream.

Comment 10 Vijaikumar Mallikarjuna 2014-04-28 10:47:01 UTC
Patch #7461 has multiple fixes.
Posted a separate patch to address this issue: http://review.gluster.org/#/c/7579/

Comment 11 senaik 2014-05-12 10:56:49 UTC
Marking this bug as a dependent of bz 1096729 as snapshots on multiple volume with IO is failing

Comment 12 Nagaprasad Sathyanarayana 2014-05-19 10:56:34 UTC
Setting flags required to add BZs to RHS 3.0 Errata

Comment 13 rjoseph 2014-06-02 13:01:36 UTC
Removed upstream bugs as dependent bugs and also removed the bugs which does not have any relation with this bug.

Comment 14 senaik 2014-06-05 09:13:07 UTC
Version : glusterfs-3.6.0.12-1.el6rhs.x86_64
=======

Retried the steps as mentioned in "Steps to Reproduce" , did not face the issue again. (Ping time out set to 0 which is mentioned as workaround for bz 1096729)
 
Marking bug as 'verified'

Comment 16 errata-xmlrpc 2014-09-22 19:35:45 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-1278.html


Note You need to log in before you can comment on or make changes to this bug.