Bug 1119683

Summary: [SNAPSHOT]: snapshot delete failed because of frequent peer disconnects and the lock is held forever on the volume
Product: Red Hat Gluster Storage Reporter: senaik
Component: snapshotAssignee: Avra Sengupta <asengupt>
Status: CLOSED CURRENTRELEASE QA Contact: storage-qa-internal <storage-qa-internal>
Severity: high Docs Contact:
Priority: unspecified    
Version: rhgs-3.0CC: josferna, rhs-bugs, storage-qa-internal, vagarwal
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: SNAPSHOT
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-01-29 12:58:45 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description senaik 2014-07-15 09:26:45 UTC
Description of problem:
Snapshot delete failed because of frequent peer disconnects and the lock is held forever on the volume

Version-Release number of selected component (if applicable):
glusterfs built on Jul  3 2014

How reproducible:

Steps to Reproduce:
ping timer set to 30 (default)

1.Create a 12 node cluster and create a 6x2 dist-repl volume

2.Fuse and NFS mount the volume and start creating IO 
I/O Pattern : 
for i in {1500..3000} ; do cp -rvf /etc etc.$i ; done
for i in {1..100}; do dd if=/dev/urandom of=fuse"$i" bs=1024M count=1; done

3.Create upto 256 snapshots 

4.Delete few snapshots in loop 

Some of the snapshots were deleted successfully , after some time snapshot delete failed  and subsequent snapshot delete operations failed with "another transaction in progress" as the volume lock was still held and this lock is held forever. 

Snapshot delete has failed as there are disconnects seen as shown in the log snippet below. 

-----------------Part of Log------------------------

[2014-07-14 18:13:49.121688] I [glusterd-snapshot.c:4817:glusterd_snapshot_remove_commit] 0-management: Successfully marked snap vol1_snap112 for deco
[2014-07-14 18:13:49.122181] E [rpc-transport.c:481:rpc_transport_unref] (-->/usr/lib64/glusterfs/
onnect+0x38) [0x7f51c497c8c8] (-->/usr/lib64/glusterfs/ [0x7f51c497c745] (-->/usr/lib64/
libgfrpc.so.0(rpc_clnt_unref+0x63) [0x3ae240d633]))) 0-rpc_transport: invalid argument: this
[2014-07-14 18:13:49.122546] I [MSGID: 106005] [glusterd-handler.c:4165:__glusterd_brick_rpc_notify] 0-management: Brick
r/snaps/15541ed3a83342aeb2fdd2762b5dcf4b/brick1/b2 has disconnected from glusterd.
[2014-07-14 18:13:49.160881] E [glusterd-utils.c:12272:glusterd_umount] 0-management: umounting /var/run/gluster/snaps/15541ed3a83342aeb2fdd2762b5dcf4
b/brick1 failed (Bad file descriptor)
[2014-07-14 18:14:51.506196] C [rpc-clnt-ping.c:105:rpc_clnt_ping_timer_expired] 0-management: server has not responded in the last 30 seconds, disconnecting.
[2014-07-14 18:14:51.506734] E [rpc-clnt.c:362:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x15d) [0x3ae240fe7d] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x91) [0x3ae240f8b1] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x3ae240f7fe]))) 0-management: forced unwinding frame type(glusterd mgmt v3) op(--(4)) called at 2014-07-14 18:13:50.731412 (xid=0x1442)
[2014-07-14 18:14:51.506771] E [glusterd-mgmt.c:116:gd_mgmt_v3_collate_errors] 0-management: Commit failed on Please check log file for details.


Actual results:

Expected results:

Additional info:

Comment 2 Joseph Elwin Fernandes 2014-07-21 12:54:41 UTC
Could you please attach the SOS report

Comment 5 Avra Sengupta 2016-01-29 12:58:45 UTC
Won't fix in RHGS 3.0 Works fine in RHGS 3.1.1