Bug 1119683

Summary:	[SNAPSHOT]: snapshot delete failed because of frequent peer disconnects and the lock is held forever on the volume
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	senaik
Component:	snapshot	Assignee:	Avra Sengupta <asengupt>
Status:	CLOSED CURRENTRELEASE	QA Contact:	storage-qa-internal <storage-qa-internal>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.0	CC:	josferna, rhs-bugs, storage-qa-internal, vagarwal
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	SNAPSHOT
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-01-29 12:58:45 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description senaik 2014-07-15 09:26:45 UTC

Description of problem:
=======================
Snapshot delete failed because of frequent peer disconnects and the lock is held forever on the volume

Version-Release number of selected component (if applicable):
============================================================
glusterfs 3.6.0.24 built on Jul  3 2014

How reproducible:


Steps to Reproduce:
==================
ping timer set to 30 (default)

1.Create a 12 node cluster and create a 6x2 dist-repl volume

2.Fuse and NFS mount the volume and start creating IO 
I/O Pattern : 
for i in {1500..3000} ; do cp -rvf /etc etc.$i ; done
for i in {1..100}; do dd if=/dev/urandom of=fuse"$i" bs=1024M count=1; done

3.Create upto 256 snapshots 

4.Delete few snapshots in loop 

Some of the snapshots were deleted successfully , after some time snapshot delete failed  and subsequent snapshot delete operations failed with "another transaction in progress" as the volume lock was still held and this lock is held forever. 

Snapshot delete has failed as there are disconnects seen as shown in the log snippet below. 

-----------------Part of Log------------------------

[2014-07-14 18:13:49.121688] I [glusterd-snapshot.c:4817:glusterd_snapshot_remove_commit] 0-management: Successfully marked snap vol1_snap112 for deco
mmission.
[2014-07-14 18:13:49.122181] E [rpc-transport.c:481:rpc_transport_unref] (-->/usr/lib64/glusterfs/3.6.0.24/xlator/mgmt/glusterd.so(glusterd_brick_disc
onnect+0x38) [0x7f51c497c8c8] (-->/usr/lib64/glusterfs/3.6.0.24/xlator/mgmt/glusterd.so(glusterd_rpc_clnt_unref+0x35) [0x7f51c497c745] (-->/usr/lib64/
libgfrpc.so.0(rpc_clnt_unref+0x63) [0x3ae240d633]))) 0-rpc_transport: invalid argument: this
[2014-07-14 18:13:49.122546] I [MSGID: 106005] [glusterd-handler.c:4165:__glusterd_brick_rpc_notify] 0-management: Brick 192.168.12.11:/var/run/gluste
r/snaps/15541ed3a83342aeb2fdd2762b5dcf4b/brick1/b2 has disconnected from glusterd.
[2014-07-14 18:13:49.160881] E [glusterd-utils.c:12272:glusterd_umount] 0-management: umounting /var/run/gluster/snaps/15541ed3a83342aeb2fdd2762b5dcf4
b/brick1 failed (Bad file descriptor)
[2014-07-14 18:14:51.506196] C [rpc-clnt-ping.c:105:rpc_clnt_ping_timer_expired] 0-management: server 192.168.12.24:24007 has not responded in the last 30 seconds, disconnecting.
[2014-07-14 18:14:51.506734] E [rpc-clnt.c:362:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x15d) [0x3ae240fe7d] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x91) [0x3ae240f8b1] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x3ae240f7fe]))) 0-management: forced unwinding frame type(glusterd mgmt v3) op(--(4)) called at 2014-07-14 18:13:50.731412 (xid=0x1442)
[2014-07-14 18:14:51.506771] E [glusterd-mgmt.c:116:gd_mgmt_v3_collate_errors] 0-management: Commit failed on 192.168.12.24. Please check log file for details.

-------------------------------------------------------------------------------



Actual results:


Expected results:


Additional info:

Comment 2 Joseph Elwin Fernandes 2014-07-21 12:54:41 UTC

Could you please attach the SOS report

Comment 5 Avra Sengupta 2016-01-29 12:58:45 UTC

Won't fix in RHGS 3.0 Works fine in RHGS 3.1.1