Bug 1099833

Summary: [SNAPSHOT]: glusterd crashed while running snapshot creation for all the volumes while arequal was in progress
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Rahul Hinduja <rhinduja>
Component: snapshotAssignee: Avra Sengupta <asengupt>
Status: CLOSED ERRATA QA Contact: Rahul Hinduja <rhinduja>
Severity: high Docs Contact:
Priority: high    
Version: rhgs-3.0CC: rhs-bugs, sdharane, ssamanta, storage-qa-internal, vagarwal, vbellur
Target Milestone: ---   
Target Release: RHGS 3.0.0   
Hardware: x86_64   
OS: Linux   
Whiteboard: SNAPSHOT
Fixed In Version: glusterfs-3.6.0.11-1.el6rhs Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1100325 (view as bug list) Environment:
Last Closed: 2014-09-22 19:39:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1100325    

Description Rahul Hinduja 2014-05-21 10:12:38 UTC
Description of problem:
=======================

In a scenario where the system was creating snapshots of 4 volumes simultaneously while the IO from the client was in progress, one of the glusterd crashed with the following bt

(gdb) bt
#0  glusterd_handle_mgmt_v3_unlock_fn (req=0x7f47de1aa664) at glusterd-mgmt-handler.c:873
#1  0x00007f47de911acf in glusterd_big_locked_handler (req=0x7f47de1aa664, actor_fn=0x7f47de9c41f0 <glusterd_handle_mgmt_v3_unlock_fn>) at glusterd-handler.c:81
#2  0x000000384f25b742 in synctask_wrap (old_task=<value optimized out>) at syncop.c:333
#3  0x000000384de43bf0 in ?? () from /lib64/libc-2.12.so
#4  0x0000000000000000 in ?? ()
(gdb) f 0
#0  glusterd_handle_mgmt_v3_unlock_fn (req=0x7f47de1aa664) at glusterd-mgmt-handler.c:873
873	        gf_log (this->name, GF_LOG_TRACE, "Returning %d", ret);
(gdb) l
868	                        GF_FREE (ctx);
869	        }
870	
871	        free (lock_req.dict.dict_val);
872	
873	        gf_log (this->name, GF_LOG_TRACE, "Returning %d", ret);
874	        return ret;
875	}
876	
877	int
(gdb) 



Version-Release number of selected component (if applicable):
==============================================================

glusterfs-3.6.0.4-1.el6rhs.x86_64


Steps Carried:
==============
1. Create 4 node clustered system
2. Create and start 4 volumes to the system named(vol0 to vol3)
3. Mount the volumes (Fuse and NFS)
4. Start are equal from all the 8 mounts of the 4 volumes [2(fuse+nfs)*4(volumes)=8]
5. While IO is in progress, start the snap creation of all the 4 volumes simultaneously from different nodes in cluster.


Actual results:
===============
After few snaps, glusterd crashed


Expected results:
=================

glusterd should not crash


Additional info:
================

Log snippet:
============

[2014-05-20 13:23:39.420078] E [glusterd-mgmt.c:116:gd_mgmt_v3_collate_errors] 0-management: Unlocking failed on 10.70.42.175. Please check log file for details.
[2014-05-20 13:23:43.583228] E [glusterd-mgmt-handler.c:643:glusterd_handle_post_validate_fn] 0-management: Failed to decode post validation request received from peer
[2014-05-20 13:23:43.583346] E [rpcsvc.c:533:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully
[2014-05-20 13:23:43.584277] I [glusterd-snapshot.c:4625:glusterd_do_snap_cleanup] 0-management: snap b163 is not found
[2014-05-20 13:23:43.584345] E [glusterd-snapshot.c:5931:glusterd_snapshot_create_postvalidate] 0-management: unable to find snap b163
[2014-05-20 13:23:43.584552] I [glusterd-rpc-ops.c:556:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: 3801feb3-7066-4c86-996b-366e71ab3dac
[2014-05-20 13:23:43.585458] E [glusterd-mgmt.c:1532:glusterd_mgmt_v3_release_peer_locks] 0-management: Unlock failed on peers
[2014-05-20 13:23:43.587079] E [glusterd-mgmt-handler.c:810:glusterd_handle_mgmt_v3_unlock_fn] 0-management: Failed to decode unlock request received from peer
pending frames:
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash: 
2014-05-20 13:23:43
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.6.0.4
/usr/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb6)[0x384f21fe56]
/usr/lib64/libglusterfs.so.0(gf_print_trace+0x33f)[0x384f23a28f]
/lib64/libc.so.6[0x384de329a0]
/usr/lib64/glusterfs/3.6.0.4/xlator/mgmt/glusterd.so(+0xe2438)[0x7f47de9c4438]
/usr/lib64/glusterfs/3.6.0.4/xlator/mgmt/glusterd.so(glusterd_big_locked_handler+0x3f)[0x7f47de911acf]
/usr/lib64/libglusterfs.so.0(synctask_wrap+0x12)[0x384f25b742]
/lib64/libc.so.6[0x384de43bf0]

Comment 3 Avra Sengupta 2014-05-26 06:35:45 UTC
Fix at https://code.engineering.redhat.com/gerrit/25660

Comment 4 Rahul Hinduja 2014-06-05 11:50:43 UTC
Verified with build: glusterfs-3.6.0.12-1.el6rhs.x86_64

Didnt observe the glusterd crash with the steps mentioned above. 

Moving the bug to verified state

Comment 6 errata-xmlrpc 2014-09-22 19:39:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-1278.html