1119683 – [SNAPSHOT]: snapshot delete failed because of frequent peer disconnects and the lock is held forever on the volume

Bug 1119683 - [SNAPSHOT]: snapshot delete failed because of frequent peer disconnects and the lock is held forever on the volume

Summary: [SNAPSHOT]: snapshot delete failed because of frequent peer disconnects and ...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	snapshot
Sub Component:
Version:	rhgs-3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Avra Sengupta
QA Contact:	storage-qa-internal@redhat.com
Docs Contact:
URL:
Whiteboard:	SNAPSHOT
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-07-15 09:26 UTC by senaik
Modified:	2016-09-17 12:53 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-01-29 12:58:45 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description senaik 2014-07-15 09:26:45 UTC

Description of problem:
=======================
Snapshot delete failed because of frequent peer disconnects and the lock is held forever on the volume

Version-Release number of selected component (if applicable):
============================================================
glusterfs 3.6.0.24 built on Jul  3 2014

How reproducible:


Steps to Reproduce:
==================
ping timer set to 30 (default)

1.Create a 12 node cluster and create a 6x2 dist-repl volume

2.Fuse and NFS mount the volume and start creating IO 
I/O Pattern : 
for i in {1500..3000} ; do cp -rvf /etc etc.$i ; done
for i in {1..100}; do dd if=/dev/urandom of=fuse"$i" bs=1024M count=1; done

3.Create upto 256 snapshots 

4.Delete few snapshots in loop 

Some of the snapshots were deleted successfully , after some time snapshot delete failed  and subsequent snapshot delete operations failed with "another transaction in progress" as the volume lock was still held and this lock is held forever. 

Snapshot delete has failed as there are disconnects seen as shown in the log snippet below. 

-----------------Part of Log------------------------

[2014-07-14 18:13:49.121688] I [glusterd-snapshot.c:4817:glusterd_snapshot_remove_commit] 0-management: Successfully marked snap vol1_snap112 for deco
mmission.
[2014-07-14 18:13:49.122181] E [rpc-transport.c:481:rpc_transport_unref] (-->/usr/lib64/glusterfs/3.6.0.24/xlator/mgmt/glusterd.so(glusterd_brick_disc
onnect+0x38) [0x7f51c497c8c8] (-->/usr/lib64/glusterfs/3.6.0.24/xlator/mgmt/glusterd.so(glusterd_rpc_clnt_unref+0x35) [0x7f51c497c745] (-->/usr/lib64/
libgfrpc.so.0(rpc_clnt_unref+0x63) [0x3ae240d633]))) 0-rpc_transport: invalid argument: this
[2014-07-14 18:13:49.122546] I [MSGID: 106005] [glusterd-handler.c:4165:__glusterd_brick_rpc_notify] 0-management: Brick 192.168.12.11:/var/run/gluste
r/snaps/15541ed3a83342aeb2fdd2762b5dcf4b/brick1/b2 has disconnected from glusterd.
[2014-07-14 18:13:49.160881] E [glusterd-utils.c:12272:glusterd_umount] 0-management: umounting /var/run/gluster/snaps/15541ed3a83342aeb2fdd2762b5dcf4
b/brick1 failed (Bad file descriptor)
[2014-07-14 18:14:51.506196] C [rpc-clnt-ping.c:105:rpc_clnt_ping_timer_expired] 0-management: server 192.168.12.24:24007 has not responded in the last 30 seconds, disconnecting.
[2014-07-14 18:14:51.506734] E [rpc-clnt.c:362:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x15d) [0x3ae240fe7d] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x91) [0x3ae240f8b1] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x3ae240f7fe]))) 0-management: forced unwinding frame type(glusterd mgmt v3) op(--(4)) called at 2014-07-14 18:13:50.731412 (xid=0x1442)
[2014-07-14 18:14:51.506771] E [glusterd-mgmt.c:116:gd_mgmt_v3_collate_errors] 0-management: Commit failed on 192.168.12.24. Please check log file for details.

-------------------------------------------------------------------------------



Actual results:


Expected results:


Additional info:

Comment 2 Joseph Elwin Fernandes 2014-07-21 12:54:41 UTC

Could you please attach the SOS report

Comment 5 Avra Sengupta 2016-01-29 12:58:45 UTC

Won't fix in RHGS 3.0 Works fine in RHGS 3.1.1

Note You need to log in before you can comment on or make changes to this bug.