Description of problem: ======================= In a scenario where the system was creating snapshots of 4 volumes simultaneously while the IO from the client was in progress, one of the glusterd crashed with the following bt (gdb) bt #0 glusterd_handle_mgmt_v3_unlock_fn (req=0x7f47de1aa664) at glusterd-mgmt-handler.c:873 #1 0x00007f47de911acf in glusterd_big_locked_handler (req=0x7f47de1aa664, actor_fn=0x7f47de9c41f0 <glusterd_handle_mgmt_v3_unlock_fn>) at glusterd-handler.c:81 #2 0x000000384f25b742 in synctask_wrap (old_task=<value optimized out>) at syncop.c:333 #3 0x000000384de43bf0 in ?? () from /lib64/libc-2.12.so #4 0x0000000000000000 in ?? () (gdb) f 0 #0 glusterd_handle_mgmt_v3_unlock_fn (req=0x7f47de1aa664) at glusterd-mgmt-handler.c:873 873 gf_log (this->name, GF_LOG_TRACE, "Returning %d", ret); (gdb) l 868 GF_FREE (ctx); 869 } 870 871 free (lock_req.dict.dict_val); 872 873 gf_log (this->name, GF_LOG_TRACE, "Returning %d", ret); 874 return ret; 875 } 876 877 int (gdb) Version-Release number of selected component (if applicable): ============================================================== glusterfs-3.6.0.4-1.el6rhs.x86_64 Steps Carried: ============== 1. Create 4 node clustered system 2. Create and start 4 volumes to the system named(vol0 to vol3) 3. Mount the volumes (Fuse and NFS) 4. Start are equal from all the 8 mounts of the 4 volumes [2(fuse+nfs)*4(volumes)=8] 5. While IO is in progress, start the snap creation of all the 4 volumes simultaneously from different nodes in cluster. Actual results: =============== After few snaps, glusterd crashed Expected results: ================= glusterd should not crash Additional info: ================ Log snippet: ============ [2014-05-20 13:23:39.420078] E [glusterd-mgmt.c:116:gd_mgmt_v3_collate_errors] 0-management: Unlocking failed on 10.70.42.175. Please check log file for details. [2014-05-20 13:23:43.583228] E [glusterd-mgmt-handler.c:643:glusterd_handle_post_validate_fn] 0-management: Failed to decode post validation request received from peer [2014-05-20 13:23:43.583346] E [rpcsvc.c:533:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully [2014-05-20 13:23:43.584277] I [glusterd-snapshot.c:4625:glusterd_do_snap_cleanup] 0-management: snap b163 is not found [2014-05-20 13:23:43.584345] E [glusterd-snapshot.c:5931:glusterd_snapshot_create_postvalidate] 0-management: unable to find snap b163 [2014-05-20 13:23:43.584552] I [glusterd-rpc-ops.c:556:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: 3801feb3-7066-4c86-996b-366e71ab3dac [2014-05-20 13:23:43.585458] E [glusterd-mgmt.c:1532:glusterd_mgmt_v3_release_peer_locks] 0-management: Unlock failed on peers [2014-05-20 13:23:43.587079] E [glusterd-mgmt-handler.c:810:glusterd_handle_mgmt_v3_unlock_fn] 0-management: Failed to decode unlock request received from peer pending frames: frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) patchset: git://git.gluster.com/glusterfs.git signal received: 11 time of crash: 2014-05-20 13:23:43 configuration details: argp 1 backtrace 1 dlfcn 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.6.0.4 /usr/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb6)[0x384f21fe56] /usr/lib64/libglusterfs.so.0(gf_print_trace+0x33f)[0x384f23a28f] /lib64/libc.so.6[0x384de329a0] /usr/lib64/glusterfs/3.6.0.4/xlator/mgmt/glusterd.so(+0xe2438)[0x7f47de9c4438] /usr/lib64/glusterfs/3.6.0.4/xlator/mgmt/glusterd.so(glusterd_big_locked_handler+0x3f)[0x7f47de911acf] /usr/lib64/libglusterfs.so.0(synctask_wrap+0x12)[0x384f25b742] /lib64/libc.so.6[0x384de43bf0]
Fix at https://code.engineering.redhat.com/gerrit/25660
Verified with build: glusterfs-3.6.0.12-1.el6rhs.x86_64 Didnt observe the glusterd crash with the steps mentioned above. Moving the bug to verified state
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHEA-2014-1278.html