Bug 1119683 - [SNAPSHOT]: snapshot delete failed because of frequent peer disconnects and the lock is held forever on the volume
Summary: [SNAPSHOT]: snapshot delete failed because of frequent peer disconnects and ...
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: snapshot
Version: rhgs-3.0
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: ---
Assignee: Avra Sengupta
QA Contact: storage-qa-internal@redhat.com
Whiteboard: SNAPSHOT
Depends On:
TreeView+ depends on / blocked
Reported: 2014-07-15 09:26 UTC by senaik
Modified: 2016-09-17 12:53 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Last Closed: 2016-01-29 12:58:45 UTC
Target Upstream Version:

Attachments (Terms of Use)

Description senaik 2014-07-15 09:26:45 UTC
Description of problem:
Snapshot delete failed because of frequent peer disconnects and the lock is held forever on the volume

Version-Release number of selected component (if applicable):
glusterfs built on Jul  3 2014

How reproducible:

Steps to Reproduce:
ping timer set to 30 (default)

1.Create a 12 node cluster and create a 6x2 dist-repl volume

2.Fuse and NFS mount the volume and start creating IO 
I/O Pattern : 
for i in {1500..3000} ; do cp -rvf /etc etc.$i ; done
for i in {1..100}; do dd if=/dev/urandom of=fuse"$i" bs=1024M count=1; done

3.Create upto 256 snapshots 

4.Delete few snapshots in loop 

Some of the snapshots were deleted successfully , after some time snapshot delete failed  and subsequent snapshot delete operations failed with "another transaction in progress" as the volume lock was still held and this lock is held forever. 

Snapshot delete has failed as there are disconnects seen as shown in the log snippet below. 

-----------------Part of Log------------------------

[2014-07-14 18:13:49.121688] I [glusterd-snapshot.c:4817:glusterd_snapshot_remove_commit] 0-management: Successfully marked snap vol1_snap112 for deco
[2014-07-14 18:13:49.122181] E [rpc-transport.c:481:rpc_transport_unref] (-->/usr/lib64/glusterfs/
onnect+0x38) [0x7f51c497c8c8] (-->/usr/lib64/glusterfs/ [0x7f51c497c745] (-->/usr/lib64/
libgfrpc.so.0(rpc_clnt_unref+0x63) [0x3ae240d633]))) 0-rpc_transport: invalid argument: this
[2014-07-14 18:13:49.122546] I [MSGID: 106005] [glusterd-handler.c:4165:__glusterd_brick_rpc_notify] 0-management: Brick
r/snaps/15541ed3a83342aeb2fdd2762b5dcf4b/brick1/b2 has disconnected from glusterd.
[2014-07-14 18:13:49.160881] E [glusterd-utils.c:12272:glusterd_umount] 0-management: umounting /var/run/gluster/snaps/15541ed3a83342aeb2fdd2762b5dcf4
b/brick1 failed (Bad file descriptor)
[2014-07-14 18:14:51.506196] C [rpc-clnt-ping.c:105:rpc_clnt_ping_timer_expired] 0-management: server has not responded in the last 30 seconds, disconnecting.
[2014-07-14 18:14:51.506734] E [rpc-clnt.c:362:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x15d) [0x3ae240fe7d] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x91) [0x3ae240f8b1] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x3ae240f7fe]))) 0-management: forced unwinding frame type(glusterd mgmt v3) op(--(4)) called at 2014-07-14 18:13:50.731412 (xid=0x1442)
[2014-07-14 18:14:51.506771] E [glusterd-mgmt.c:116:gd_mgmt_v3_collate_errors] 0-management: Commit failed on Please check log file for details.


Actual results:

Expected results:

Additional info:

Comment 2 Joseph Elwin Fernandes 2014-07-21 12:54:41 UTC
Could you please attach the SOS report

Comment 5 Avra Sengupta 2016-01-29 12:58:45 UTC
Won't fix in RHGS 3.0 Works fine in RHGS 3.1.1

Note You need to log in before you can comment on or make changes to this bug.