Description of problem: ======================= Snapshot delete failed because of frequent peer disconnects and the lock is held forever on the volume Version-Release number of selected component (if applicable): ============================================================ glusterfs 3.6.0.24 built on Jul 3 2014 How reproducible: Steps to Reproduce: ================== ping timer set to 30 (default) 1.Create a 12 node cluster and create a 6x2 dist-repl volume 2.Fuse and NFS mount the volume and start creating IO I/O Pattern : for i in {1500..3000} ; do cp -rvf /etc etc.$i ; done for i in {1..100}; do dd if=/dev/urandom of=fuse"$i" bs=1024M count=1; done 3.Create upto 256 snapshots 4.Delete few snapshots in loop Some of the snapshots were deleted successfully , after some time snapshot delete failed and subsequent snapshot delete operations failed with "another transaction in progress" as the volume lock was still held and this lock is held forever. Snapshot delete has failed as there are disconnects seen as shown in the log snippet below. -----------------Part of Log------------------------ [2014-07-14 18:13:49.121688] I [glusterd-snapshot.c:4817:glusterd_snapshot_remove_commit] 0-management: Successfully marked snap vol1_snap112 for deco mmission. [2014-07-14 18:13:49.122181] E [rpc-transport.c:481:rpc_transport_unref] (-->/usr/lib64/glusterfs/3.6.0.24/xlator/mgmt/glusterd.so(glusterd_brick_disc onnect+0x38) [0x7f51c497c8c8] (-->/usr/lib64/glusterfs/3.6.0.24/xlator/mgmt/glusterd.so(glusterd_rpc_clnt_unref+0x35) [0x7f51c497c745] (-->/usr/lib64/ libgfrpc.so.0(rpc_clnt_unref+0x63) [0x3ae240d633]))) 0-rpc_transport: invalid argument: this [2014-07-14 18:13:49.122546] I [MSGID: 106005] [glusterd-handler.c:4165:__glusterd_brick_rpc_notify] 0-management: Brick 192.168.12.11:/var/run/gluste r/snaps/15541ed3a83342aeb2fdd2762b5dcf4b/brick1/b2 has disconnected from glusterd. [2014-07-14 18:13:49.160881] E [glusterd-utils.c:12272:glusterd_umount] 0-management: umounting /var/run/gluster/snaps/15541ed3a83342aeb2fdd2762b5dcf4 b/brick1 failed (Bad file descriptor) [2014-07-14 18:14:51.506196] C [rpc-clnt-ping.c:105:rpc_clnt_ping_timer_expired] 0-management: server 192.168.12.24:24007 has not responded in the last 30 seconds, disconnecting. [2014-07-14 18:14:51.506734] E [rpc-clnt.c:362:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x15d) [0x3ae240fe7d] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x91) [0x3ae240f8b1] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x3ae240f7fe]))) 0-management: forced unwinding frame type(glusterd mgmt v3) op(--(4)) called at 2014-07-14 18:13:50.731412 (xid=0x1442) [2014-07-14 18:14:51.506771] E [glusterd-mgmt.c:116:gd_mgmt_v3_collate_errors] 0-management: Commit failed on 192.168.12.24. Please check log file for details. ------------------------------------------------------------------------------- Actual results: Expected results: Additional info:
Could you please attach the SOS report
Won't fix in RHGS 3.0 Works fine in RHGS 3.1.1