1139709 – [SNAPSHOT] : Attaching another node to the cluster which has a lot of snapshots times out

Bug 1139709 - [SNAPSHOT] : Attaching another node to the cluster which has a lot of snapshots times out

Summary: [SNAPSHOT] : Attaching another node to the cluster which has a lot of snapsho...

Keywords:
Status:	CLOSED DEFERRED
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	snapshot
Sub Component:
Version:	rhgs-3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Avra Sengupta
QA Contact:	storage-qa-internal@redhat.com
Docs Contact:
URL:
Whiteboard:	SNAPSHOT
Depends On:
Blocks:	1179664
TreeView+	depends on / blocked

Reported:	2014-09-09 13:38 UTC by senaik
Modified:	2016-09-17 12:52 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1179664 (view as bug list)
Environment:
Last Closed:	2016-01-29 13:49:31 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description senaik 2014-09-09 13:38:38 UTC

Description of problem:
=======================
In a scenario where there are many snapshots taken on the volume and if another node is attached to the cluster, it takes a long time (times out) as all the snapshots have to be copied to the new node which is being attached to the cluster. User has to wait until all the snapshots are copied before gluster peer status shows the newly added node is in 'Peer in Cluster' state

Also as the cluster is scaled up, peer probe times out and returns with errno (-1)as the frame times out and the newly added node remains in the state: 'Sent and Received peer request (Connected)' 

Version-Release number of selected component (if applicable):
============================================================
glusterfs 3.6.0.28

How reproducible:
================
always


Steps to Reproduce:
==================
12 node cluster 
6x2 dist repl volume 

1.Fuse and NFS mount the volume and create some IO
2.Created ~170 snapshots 
3.Attach another node to cluster 

[root@dhcp-8-29-222 ~]# time gluster peer probe 10.8.30.26

real	2m0.095s
user	0m0.080s
sys	0m0.030s

gluster peer status shows the state of the node as 'Sent and Received peer request ' until all the snapshots are copied 

Hostname: 10.8.30.26
Uuid: 77d79c8d-c1b1-41a6-870b-3c51755cc285
State: Sent and Received peer request (Connected)


[root@dhcp-8-30-26 ~]# less /var/lib/glusterd/snaps/ | wc -l
153
[root@dhcp-8-30-26 ~]# less /var/lib/glusterd/snaps/ | wc -l
153
[root@dhcp-8-30-26 ~]# less /var/lib/glusterd/snaps/ | wc -l
172

After all the snapshots are copied, the newly added node shows its state as 'Peer in Cluster (Connected)'

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Also as the cluster is scaled up with more nodes, peer probe times out and returns with errno -1 (though all the snapshots are copied to the newly added node) as the frame times out and gluster peer status shows the state of the node as 'Sent and Received peer request (Connected)' until the node is detached and attached again.

--------------------Part of  .cmd_log_history --------------------

[2014-09-09 11:07:14.704437]  : peer probe 10.8.30.29 : FAILED : Probe returned with unknown errno -1
[2014-09-09 11:09:20.730677]  : peer probe 10.8.30.30 : FAILED : Probe returned with unknown errno -1

-------------------------------------------------------------------

Actual results:
==============
Attaching another node to the cluster which has many snapshots takes a long time
and as the cluster is scaled up it times out and returns with errno -1  as the frame times out 

Expected results:
================
As the cluster is scaled up, peer probe should not take long and should complete successfully without timing out


Additional info:

Comment 6 Avra Sengupta 2016-01-29 13:49:31 UTC

Current Gluster architecture does not support implementation of this feature. Therefore this feature request is deferred till Gluterd 2.0.

Note You need to log in before you can comment on or make changes to this bug.