1104191 – [SNAPSHOT] : Time taken to create snapshot might cross the CLI time out (~2min) when high IO is in progress and snap creation fails.

Bug 1104191 - [SNAPSHOT] : Time taken to create snapshot might cross the CLI time out (~2min) when high IO is in progress and snap creation fails.

Summary: [SNAPSHOT] : Time taken to create snapshot might cross the CLI time out (~2mi...

Keywords:
Status:	CLOSED DEFERRED
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	snapshot
Sub Component:
Version:	rhgs-3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Raghavendra Bhat
QA Contact:	Anoop
Docs Contact:
URL:
Whiteboard:	SNAPSHOT
Depends On:
Blocks:	1087818
TreeView+	depends on / blocked

Reported:	2014-06-03 13:19 UTC by senaik
Modified:	2016-09-17 12:58 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Known Issue
Doc Text:	The Snapshot command fails if snapshot command is run simultaneously from multiple nodes when high write or read is happening on the origin or parent volume. Workaround (if any): Avoid running multiple snapshot command simultaneously from different nodes.
Clone Of:
Environment:
Last Closed:	2016-01-29 13:37:01 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description senaik 2014-06-03 13:19:07 UTC

Description of problem:
======================
Few Snapshot creation failures seen with "quorum not met" & "brick ops failed" messages when multiple file/directory creation is in progress from fuse and nfs mount

Version-Release number of selected component (if applicable):
============================================================
glusterfs-3.6.0.11-1.el6rhs.x86_64

How reproducible:
================
1/1

Steps to Reproduce:
==================

1. Setup a cluster of 4 servers (server1,server2,server3,server4)
2. Create four volumes from these servers (vol0,vol1,vol2,vol3)
3. Mount the volumes on client (Fuse and NFS mount)
4. create directories names as f and n from each fuse mount of volumes
5. cd to f from all fuse mounts of volumes
6. cd to n from all nfs mounts of volumes
7. Start creating heavy IO from all the fuse (f) mount and nfs (n) mount of every volume.

for i in {1..50} ; do cp -rvf /etc etc.$i ; done 

8. While IO is in progress create snapshots on all volumes from different nodes 

for i in {1..256} ; do gluster snapshot create snap_vol1_$i vol0 ; done
for i in {1..256} ; do gluster snapshot create snap_vol2_$i vol1 ; done
for i in {1..256} ; do gluster snapshot create snap_vol2_$i vol2 ; done
for i in {1..256} ; do gluster snapshot create snap_vol3_$i vol3 ; done

Initially few snapshots were not created as snapshot creation failed as it crossed the 2 min cli timeout.

Then one snapshot creation failed with "quorum not met" error message 

snapshot13 :
snapshot create: success: Snap snap_vol0_41 created successfully
snapshot create: failed: quorum is not met
Snapshot command failed

Checked gluster volume info

Then snapshot creation failed with "brick ops failed" error message

snapshot create: success: Snap snap_vol0_40 created successfully
snapshot create: success: Snap snap_vol0_41 created successfully
snapshot create: failed: quorum is not met
Snapshot command failed
snapshot create: failed: Another transaction is in progress Please try again after sometime.
Snapshot command failed
snapshot create: failed: Brick ops failed on snapshot14.lab.eng.blr.redhat.com. Please check log file for details.
Brick ops failed on snapshot16.lab.eng.blr.redhat.com. Please check log file for details.
Brick ops failed on snapshot15.lab.eng.blr.redhat.com. Please check log file for details.
Snapshot command failed


There are also many brick disconnect messages seen in the log 

[2014-06-03 09:57:17.220019] I [socket.c:2239:socket_event_handler] 0-transport: disconnecting now
[2014-06-03 09:57:18.009250] I [MSGID: 106005] [glusterd-handler.c:4126:__glusterd_brick_rpc_notify] 0-management: Brick snapshot13.lab.eng.blr.redhat.com:/var/run/gluster/snaps/550f650254c84564b8546a9905644493/brick1/b3 has disconnected from glusterd.


-------------Part of log messages----------------------------

 snapshot13.lab.eng.blr.redhat.com:/var/run/gluster/snaps/d5907135f6524917bcabeb4d69d9ea33/brick1/b0 has disconnec/disc/qted from glusterd.
[2014-06-03 09:55:28.962797] W [glusterd-utils.c:1558:glusterd_snap_volinfo_find] 0-management: Snap volume 7bb6a9a00
5814aa5868a4322b586414b.snapshot13.lab.eng.blr.redhat.com.var-run-gluster-snaps-7bb6a9a005814aa5868a4322b586414b-bric
k1-b2 not found
[2014-06-03 09:55:28.963226] W [glusterd-utils.c:1558:glusterd_snap_volinfo_find] 0-management: Snap volume d5907135f
6524917bcabeb4d69d9ea33.snapshot13.lab.eng.blr.redhat.com.var-run-gluster-snaps-d5907135f6524917bcabeb4d69d9ea33-bric
k1-b0 not found
[2014-06-03 09:55:29.114718] E [glusterd-utils.c:12489:glusterd_volume_quorum_check] 0-management: quorum is not met
[2014-06-03 09:55:29.120722] W [glusterd-utils.c:12715:glusterd_snap_quorum_check_for_create] 0-management: volume d5907135f6524917bcabeb4d69d9ea33 is not in quorum
[2014-06-03 09:55:29.120749] W [glusterd-utils.c:12754:glusterd_snap_quorum_check] 0-management: Quorum checkfailed during snapshot create command
[2014-06-03 09:55:29.120766] W [glusterd-mgmt.c:1928:glusterd_mgmt_v3_initiate_snap_phases] 0-management: quorum check failed
[2014-06-03 09:55:29.121124] E [rpc-transport.c:481:rpc_transport_unref] (-->/usr/lib64/glusterfs/3.6.0.11/xlator/mgmt/glusterd.so(glusterd_brick_disconnect+0x38) [0x7f9564e2f298] (-->/usr/lib64/glusterfs/3.6.0.11/xlator/mgmt/glusterd.so(glusterd_rpc_clnt_unref+0x35) [0x7f9564e2f155] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_unref+0x63) [0x3564e0d633]))) 0-rpc_transport: invalid argument: this



-------------------------------------------------------------

Actual results:




Expected results:


Additional info:

Comment 1 senaik 2014-06-03 13:24:02 UTC

sosreports :

http://rhsqe-repo.lab.eng.blr.redhat.com/bugs_necessary_info/1104191/

Comment 3 senaik 2014-06-25 11:45:50 UTC

Version : glusterfs 3.6.0.22

Went through the logs and found that the barrier timedout as it took more than 2 mins because parallel snapshot creation was in progress. 
Modifying this bug to track the issue where snapshot might take more than 2 mins when parallel snapshot create is in progress while heavy IO also is in progress.

Tried the same case on physical machines without any failure. This needs to be documented

Comment 4 Shalaka 2014-06-27 11:21:43 UTC

Please review and signoff edited doc text.

Comment 8 Avra Sengupta 2016-01-29 13:37:01 UTC

There is no end to increasing the CLI timeout. With increase of nodes, the time taken exponentially increases. Current Gluster architecture does not support implementation of this feature. Therefore this feature request is deferred till Gluterd 2.0.

Note You need to log in before you can comment on or make changes to this bug.