Description of problem: ====================== glusterd crash on 2 nodes while snapshot was in progress when file creation was in progress on the client Version-Release number of selected component (if applicable): ============================================================ glusterfs 3.5qa2 How reproducible: Steps to Reproduce: ================== 1.Create a dist-rep volume and start it 2.Fuse and NFS mount the volume and create some files for i in {1..100}; do dd if=/dev/urandom of=fuse"$i" bs=10M count=1; done for i in {1..100}; do dd if=/dev/urandom of=nfs"$i" bs=10M count=1; done 3.While file creation is in progress create multiple snapshots for i in {1..100} ; do gluster snapshot create snap_vol1_$i vol1 ; done snapshot create: snap_vol1_1: snap created successfully snapshot create: snap_vol1_2: snap created successfully snapshot create: snap_vol1_3: snap created successf . . snapshot create: snap_vol1_69: snap created successfully snapshot create: failed: Commit failed on 10.70.44.56. Please check log file for details. Snapshot command failed . . snapshot create: snap_vol1_92: snap created successfully snapshot create: failed: Commit failed on 10.70.44.57. Please check log file for details. Snapshot command failed While snapshot creation was in progress created another volume and took snapshots bt : === (gdb) bt #0 0x0000003bd380f867 in ?? () from /lib64/libgcc_s.so.1 #1 0x0000003bd3810119 in _Unwind_Backtrace () from /lib64/libgcc_s.so.1 #2 0x0000003bcf0febf6 in backtrace () from /lib64/libc.so.6 #3 0x0000003bd041e956 in _gf_msg_backtrace_nomem (level=<value optimized out>, stacksize=200) at logging.c:971 #4 0x0000003bd0437410 in gf_print_trace (signum=11, ctx=0x1aed010) at common-utils.c:530 #5 <signal handler called> #6 0x00000000000001c1 in ?? () #7 0x0000003bd0c08196 in rpcsvc_transport_submit (trans=<value optimized out>, rpchdr=<value optimized out>, rpchdrcount=<value optimized out>, proghdr=<value optimized out>, proghdrcount=<value optimized out>, progpayload=<value optimized out>, progpayloadcount=0, iobref=0x7faca041b350, priv=0x0) at rpcsvc.c:1006 #8 0x0000003bd0c08b18 in rpcsvc_submit_generic (req=0x7facb5d9902c, proghdr=0x2138bd0, hdrcount=<value optimized out>, payload=0x0, payloadcount=0, iobref=0x7faca041b350) at rpcsvc.c:1190 #9 0x0000003bd0c08f46 in rpcsvc_error_reply (req=0x7facb5d9902c) at rpcsvc.c:1238 #10 0x0000003bd0c08fbb in rpcsvc_check_and_reply_error (ret=-1, frame=<value optimized out>, opaque=0x7facb5d9902c) at rpcsvc.c:492 #11 0x0000003bd0457c3a in synctask_wrap (old_task=<value optimized out>) at syncop.c:335 #12 0x0000003bcf043bf0 in ?? () from /lib64/libc.so.6 #13 0x0000000000000000 in ?? () Actual results: ============== glusterd crash while snapshot was in progress Expected results: ================= There should be no glusterd crash Additional info:
http://rhsqe-repo.lab.eng.blr.redhat.com/bugs_necessary_info/snapshots/1088355/
Marking snapshot BZs to RHS 3.0.
Looking at the stack trace from core file. It looks like stack is corrupted. #12 0x0000003bcf043bf0 in ?? () from /lib64/libc.so.6 #13 0x0000000000000000 in ?? () I will try to re-create this problem with val-rind and see If I can find something.
With the similar steps, hit another glusterd crash (gdb) bt #0 0x000000308c7904c8 in main_arena () from /lib64/libc.so.6 #1 0x000000308e008196 in rpcsvc_transport_submit (trans=<value optimized out>, rpchdr=<value optimized out>, rpchdrcount=<value optimized out>, proghdr=<value optimized out>, proghdrcount=<value optimized out>, progpayload=<value optimized out>, progpayloadcount=0, iobref=0x7fb63406f9f0, priv=0x0) at rpcsvc.c:1006 #2 0x000000308e008b18 in rpcsvc_submit_generic (req=0x7fb639d9c644, proghdr=0x1a78170, hdrcount=<value optimized out>, payload=0x0, payloadcount=0, iobref=0x7fb63406f9f0) at rpcsvc.c:1190 #3 0x000000308e008f46 in rpcsvc_error_reply (req=0x7fb639d9c644) at rpcsvc.c:1238 #4 0x000000308e008fbb in rpcsvc_check_and_reply_error (ret=-1, frame=<value optimized out>, opaque=0x7fb639d9c644) at rpcsvc.c:492 #5 0x000000308d857c3a in synctask_wrap (old_task=<value optimized out>) at syncop.c:335 #6 0x000000308c443bf0 in ?? () from /lib64/libc.so.6 #7 0x0000000000000000 in ?? () Steps were: =========== 1. Start and create 4 volumes 2. Create snapshots in loop for all the 4 volumes simultaneously
While creating a snapshot we release big-lock when doing the mount operation. This might cause a deadlock kind of scenario or the data-structure corruption. This is solved in the patch: http://review.gluster.org/#/c/7461/ We need to run the test on this patch and see if this solves the issue.
Patch http://review.gluster.org/#/c/7461/ is pending for review
Moving back to Assigned state. The downstream BZ can be moved to POST once it is merged in upstream.
Patch #7461 has multiple fixes. Posted a separate patch to address this issue: http://review.gluster.org/#/c/7579/
Marking this bug as a dependent of bz 1096729 as snapshots on multiple volume with IO is failing
Setting flags required to add BZs to RHS 3.0 Errata
Removed upstream bugs as dependent bugs and also removed the bugs which does not have any relation with this bug.
Version : glusterfs-3.6.0.12-1.el6rhs.x86_64 ======= Retried the steps as mentioned in "Steps to Reproduce" , did not face the issue again. (Ping time out set to 0 which is mentioned as workaround for bz 1096729) Marking bug as 'verified'
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHEA-2014-1278.html