Bug 1049278

Summary: [SNAPSHOT]: glusterd crashed while performing IO and taking snapshot at the same time
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Rahul Hinduja <rhinduja>
Component: snapshotAssignee: Avra Sengupta <asengupt>
Status: CLOSED ERRATA QA Contact: senaik
Severity: urgent Docs Contact:
Priority: urgent    
Version: rhgs-3.0CC: asengupt, nsathyan, rhs-bugs, rjoseph, sdharane, senaik, ssamanta, storage-qa-internal
Target Milestone: ---   
Target Release: RHGS 3.0.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: SNAPSHOT
Fixed In Version: glusterfs-3.4.1.snap.feb05.2014 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-09-22 19:31:33 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1048831    
Bug Blocks:    

Description Rahul Hinduja 2014-01-07 10:29:36 UTC
Description of problem:
=======================

While IO was in progress from FUSE and NFS mount tried to create multiple snaps of a given volume. Observed a glusterd crash with bt as follows:

(gdb) bt
#0  gf_store_mkstemp (shandle=0x64656b7361) at store.c:66
#1  0x00007fc9085be0c2 in glusterd_store_perform_snap_volume_store (volinfo=0xeb4fa0, snap_volinfo=0x7fc8f0195aa0) at glusterd-store.c:1371
#2  0x00007fc9085be17f in glusterd_store_snap_volume (volinfo=0xeb4fa0, snap=0x7fc8f019ab40) at glusterd-store.c:1422
#3  0x00007fc9085be453 in glusterd_store_perform_snap_store (volinfo=0xeb4fa0) at glusterd-store.c:1523
#4  0x00007fc9085fdce8 in glusterd_do_snap (volinfo=0xeb4fa0, snapname=0x7fc8fc124570 "snap33", dict=0x7fc90a8399f8, cg=<value optimized out>, cg_id=0x0, volcount=1, 
    snap_volid=0x7fc8fc1228a0 "!\267\017*P\323D\254\274\234\247\304\365r-snaps-E", cg_name=0x0) at glusterd-snapshot.c:3171
#5  0x00007fc9085ff666 in glusterd_snapshot_create_commit (dict=<value optimized out>, op_errstr=0x110a080, rsp_dict=<value optimized out>) at glusterd-snapshot.c:4055
#6  0x00007fc908600873 in glusterd_snapshot (dict=0x7fc90a8399f8, op_errstr=0x110a080, rsp_dict=0x7fc90a839a84) at glusterd-snapshot.c:4356
#7  0x00007fc908604f3e in gd_mgmt_v3_commit_fn (op=GD_OP_SNAP, dict=0x7fc90a8399f8, op_errstr=0x110a080, rsp_dict=0x7fc90a839a84) at glusterd-mgmt.c:174
#8  0x00007fc9086021c3 in glusterd_handle_commit_fn (req=0x7fc9084ee02c) at glusterd-mgmt-handler.c:546
#9  0x00007fc9085773cf in glusterd_big_locked_handler (req=0x7fc9084ee02c, actor_fn=0x7fc908601f80 <glusterd_handle_commit_fn>) at glusterd-handler.c:78
#10 0x0000003f09c4cdd2 in synctask_wrap (old_task=<value optimized out>) at syncop.c:293
#11 0x0000003213043bf0 in ?? () from /lib64/libc.so.6
#12 0x0000000000000000 in ?? ()



Version-Release number of selected component (if applicable):
=============================================================

glusterfs-3.4.0.snap.dec30.2013git-1.el6.x86_64


How reproducible:
=================
1/1

Steps to Reproduce:
===================
1. Create a cluster of 4 servers (server1-4)
2. Create a volume (vol0)
3. Mount the volume on client (Fuse and NFS)
4. Start creating snapshot of volume from server1 (Used for loop to create 100 snapshots)
5. Start creating IO from the mount point 
6. After the successful creation of 30 snapshot, the snapshot creation failed and glusterd crashed on server2

Note: For creation of IO used arequal script:
./run.sh -w /mnt/vol0 -t arequal -l /mnt/logs-vol0/arequal.log


Actual results:
===============

glusterd crashed with logs as follows:

frame : type(0) op(0)

patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash: 2014-01-07 02:59:21configuration details:
argp 1
backtrace 1
dlfcn 1
fdatasync 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.4.0.snap.dec30.2013git
/lib64/libc.so.6[0x32130329a0]
/usr/lib64/libglusterfs.so.0(gf_store_mkstemp+0x38)[0x3f09c48ee8]
/usr/lib64/glusterfs/3.4.0.snap.dec30.2013git/xlator/mgmt/glusterd.so(glusterd_store_perform_snap_volume_store+0x162)[0x7fc9085be0c2]
/usr/lib64/glusterfs/3.4.0.snap.dec30.2013git/xlator/mgmt/glusterd.so(glusterd_store_snap_volume+0x5f)[0x7fc9085be17f]
/usr/lib64/glusterfs/3.4.0.snap.dec30.2013git/xlator/mgmt/glusterd.so(glusterd_store_perform_snap_store+0x1b3)[0x7fc9085be453]
/usr/lib64/glusterfs/3.4.0.snap.dec30.2013git/xlator/mgmt/glusterd.so(glusterd_do_snap+0x6e8)[0x7fc9085fdce8]
/usr/lib64/glusterfs/3.4.0.snap.dec30.2013git/xlator/mgmt/glusterd.so(glusterd_snapshot_create_commit+0x556)[0x7fc9085ff666]
/usr/lib64/glusterfs/3.4.0.snap.dec30.2013git/xlator/mgmt/glusterd.so(glusterd_snapshot+0x113)[0x7fc908600873]
/usr/lib64/glusterfs/3.4.0.snap.dec30.2013git/xlator/mgmt/glusterd.so(gd_mgmt_v3_commit_fn+0xae)[0x7fc908604f3e]
/usr/lib64/glusterfs/3.4.0.snap.dec30.2013git/xlator/mgmt/glusterd.so(+0xb41c3)[0x7fc9086021c3]
/usr/lib64/glusterfs/3.4.0.snap.dec30.2013git/xlator/mgmt/glusterd.so(glusterd_big_locked_handler+0x3f)[0x7fc9085773cf]
/usr/lib64/libglusterfs.so.0(synctask_wrap+0x12)[0x3f09c4cdd2]
/lib64/libc.so.6[0x3213043bf0]


Expected results:
=================

glusterd should not crash and snap create should be successful

Comment 3 Rahul Hinduja 2014-01-21 12:05:31 UTC
Issue is reproducible with build: glusterfs-3.4.1.snap.jan15.2014git-1.el6.x86_64

IO pattern: compile_kernel

bt:
===

#0  gf_store_mkstemp (shandle=0x2d74736f68250000) at store.c:66
#1  0x00007fcaae138e82 in glusterd_store_perform_snap_volume_store (volinfo=0x7fca98001720, snap_volinfo=0x7fcaa84581d0) at glusterd-store.c:1379
#2  0x00007fcaae138f3f in glusterd_store_snap_volume (volinfo=0x7fca98001720, snap=0x7fcaa845afa0) at glusterd-store.c:1430
#3  0x00007fcaae139213 in glusterd_store_perform_snap_store (volinfo=0x7fca98001720) at glusterd-store.c:1534
#4  0x00007fcaae17c680 in glusterd_do_snap (volinfo=0x7fca98001720, snapname=0x7fcaa843fb80 "s45", dict=0x7fcab03c5638, cg=0x0, cg_id=0x0, volcount=1, 
    snap_volid=0x7fca985d7640 "\227\250Kں\355G\215\217V\210\336z\300\245(", cg_name=0x0) at glusterd-snapshot.c:3114
#5  0x00007fcaae17d1ac in glusterd_snapshot_create_commit (dict=<value optimized out>, op_errstr=0x24d7698, rsp_dict=<value optimized out>) at glusterd-snapshot.c:4026
#6  0x00007fcaae17d5c3 in glusterd_snapshot (dict=0x7fcab03c5638, op_errstr=0x24d7698, rsp_dict=0x7fcab03c7eb0) at glusterd-snapshot.c:4404
#7  0x00007fcaae18143e in gd_mgmt_v3_commit_fn (op=GD_OP_SNAP, dict=0x7fcab03c5638, op_errstr=0x24d7698, rsp_dict=0x7fcab03c7eb0) at glusterd-mgmt.c:174
#8  0x00007fcaae181f97 in glusterd_mgmt_v3_commit (conf=0x20a7890, op=GD_OP_SNAP, op_ctx=0x7fcab03c5034, req_dict=0x7fcab03c5638, op_errstr=0x24d7698, npeers=3)
    at glusterd-mgmt.c:957
#9  0x00007fcaae1845ec in glusterd_mgmt_v3_initiate_snap_phases (req=0x209ae5c, op=GD_OP_SNAP, dict=0x7fcab03c5034) at glusterd-mgmt.c:1578
#10 0x00007fcaae17baab in glusterd_handle_snapshot_fn (req=0x209ae5c) at glusterd-snapshot.c:4656
#11 0x00007fcaae0f148f in glusterd_big_locked_handler (req=0x209ae5c, actor_fn=0x7fcaae17b580 <glusterd_handle_snapshot_fn>) at glusterd-handler.c:78
#12 0x00000033d1c4ce52 in synctask_wrap (old_task=<value optimized out>) at syncop.c:293
#13 0x0000003213043bf0 in ?? () from /lib64/libc.so.6
#14 0x0000000000000000 in ?? ()

Moving back the bug to assigned state.

Comment 4 senaik 2014-02-14 12:15:46 UTC
Faced brick crash while verifying this bug which is tracked by bz 1048831. Marking this bug as dependant of bz 1048831

bt :
====

(gdb) bt
#0  0x0000003e33032925 in raise () from /lib64/libc.so.6
#1  0x0000003e33034105 in abort () from /lib64/libc.so.6
#2  0x0000003e33070837 in __libc_message () from /lib64/libc.so.6
#3  0x0000003e33076166 in malloc_printerr () from /lib64/libc.so.6
#4  0x0000003e33078c93 in _int_free () from /lib64/libc.so.6
#5  0x00007fd2d06f603a in ltable_delete_locks (ltable=0x7fd2b0000ee0) at posix.c:2559
#6  0x00007fd2d06f6466 in disconnect_cbk (this=<value optimized out>, client=<value optimized out>) at posix.c:2619
#7  0x0000003555a63d9d in gf_client_disconnect (client=0x1cb7b50) at client_t.c:374
#8  0x00007fd2cbbbf608 in server_connection_cleanup (this=0x1c72570, client=0x1cb7b50, flags=<value optimized out>)
    at server-helpers.c:244
#9  0x00007fd2cbbbae0c in server_rpc_notify (rpc=<value optimized out>, xl=0x1c72570, event=<value optimized out>, 
    data=0x1cb6d50) at server.c:558
#10 0x0000003555e07cc5 in rpcsvc_handle_disconnect (svc=0x1c74490, trans=0x1cb6d50) at rpcsvc.c:682
#11 0x0000003555e09800 in rpcsvc_notify (trans=0x1cb6d50, mydata=<value optimized out>, 
    event=<value optimized out>, data=0x1cb6d50) at rpcsvc.c:720
#12 0x0000003555e0af18 in rpc_transport_notify (this=<value optimized out>, event=<value optimized out>, 
    data=<value optimized out>) at rpc-transport.c:512
#13 0x00007fd2d1d72761 in socket_event_poll_err (fd=<value optimized out>, idx=<value optimized out>, 
    data=0x1cb6d50, poll_in=<value optimized out>, poll_out=0, poll_err=24) at socket.c:1071
#14 socket_event_handler (fd=<value optimized out>, idx=<value optimized out>, data=0x1cb6d50, 
    poll_in=<value optimized out>, poll_out=0, poll_err=24) at socket.c:2239
#15 0x0000003555a66107 in event_dispatch_epoll_handler (event_pool=0x1c44ee0) at event-epoll.c:384
#16 event_dispatch_epoll (event_pool=0x1c44ee0) at event-epoll.c:445
#17 0x000000000040680a in main (argc=19, argv=0x7ffff54e6288) at glusterfsd.c:1964

Comment 6 Avra Sengupta 2014-03-24 11:37:00 UTC
Fixed with http://review.gluster.org/#/c/6903/

Comment 7 Nagaprasad Sathyanarayana 2014-04-21 06:17:47 UTC
Marking snapshot BZs to RHS 3.0.

Comment 8 Nagaprasad Sathyanarayana 2014-05-19 10:56:31 UTC
Setting flags required to add BZs to RHS 3.0 Errata

Comment 10 Rahul Hinduja 2014-06-05 08:40:00 UTC
Verified with build: glusterfs-3.6.0.12-1.el6rhs.x86_64

No crash observed while taking snapshot of a volume when arequal was in progress from fuse and nfs client.

Moving the bug to verified state.

Comment 12 errata-xmlrpc 2014-09-22 19:31:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-1278.html