Bug 1139709

Summary:	[SNAPSHOT] : Attaching another node to the cluster which has a lot of snapshots times out
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	senaik
Component:	snapshot	Assignee:	Avra Sengupta <asengupt>
Status:	CLOSED DEFERRED	QA Contact:	storage-qa-internal <storage-qa-internal>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.0	CC:	asengupt, nsathyan, rhs-bugs, ssaha, storage-qa-internal
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	SNAPSHOT
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:
Clones:	1179664 (view as bug list)		Environment:
Last Closed:	2016-01-29 13:49:31 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1179664

Description senaik 2014-09-09 13:38:38 UTC

Description of problem:
=======================
In a scenario where there are many snapshots taken on the volume and if another node is attached to the cluster, it takes a long time (times out) as all the snapshots have to be copied to the new node which is being attached to the cluster. User has to wait until all the snapshots are copied before gluster peer status shows the newly added node is in 'Peer in Cluster' state

Also as the cluster is scaled up, peer probe times out and returns with errno (-1)as the frame times out and the newly added node remains in the state: 'Sent and Received peer request (Connected)' 

Version-Release number of selected component (if applicable):
============================================================
glusterfs 3.6.0.28

How reproducible:
================
always


Steps to Reproduce:
==================
12 node cluster 
6x2 dist repl volume 

1.Fuse and NFS mount the volume and create some IO
2.Created ~170 snapshots 
3.Attach another node to cluster 

[root@dhcp-8-29-222 ~]# time gluster peer probe 10.8.30.26

real	2m0.095s
user	0m0.080s
sys	0m0.030s

gluster peer status shows the state of the node as 'Sent and Received peer request ' until all the snapshots are copied 

Hostname: 10.8.30.26
Uuid: 77d79c8d-c1b1-41a6-870b-3c51755cc285
State: Sent and Received peer request (Connected)


[root@dhcp-8-30-26 ~]# less /var/lib/glusterd/snaps/ | wc -l
153
[root@dhcp-8-30-26 ~]# less /var/lib/glusterd/snaps/ | wc -l
153
[root@dhcp-8-30-26 ~]# less /var/lib/glusterd/snaps/ | wc -l
172

After all the snapshots are copied, the newly added node shows its state as 'Peer in Cluster (Connected)'

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Also as the cluster is scaled up with more nodes, peer probe times out and returns with errno -1 (though all the snapshots are copied to the newly added node) as the frame times out and gluster peer status shows the state of the node as 'Sent and Received peer request (Connected)' until the node is detached and attached again.

--------------------Part of  .cmd_log_history --------------------

[2014-09-09 11:07:14.704437]  : peer probe 10.8.30.29 : FAILED : Probe returned with unknown errno -1
[2014-09-09 11:09:20.730677]  : peer probe 10.8.30.30 : FAILED : Probe returned with unknown errno -1

-------------------------------------------------------------------

Actual results:
==============
Attaching another node to the cluster which has many snapshots takes a long time
and as the cluster is scaled up it times out and returns with errno -1  as the frame times out 

Expected results:
================
As the cluster is scaled up, peer probe should not take long and should complete successfully without timing out


Additional info:

Comment 6 Avra Sengupta 2016-01-29 13:49:31 UTC

Current Gluster architecture does not support implementation of this feature. Therefore this feature request is deferred till Gluterd 2.0.