1258831 – [RFE] Primary Slave Node Failure Handling

Bug 1258831 - [RFE] Primary Slave Node Failure Handling

Summary: [RFE] Primary Slave Node Failure Handling

Keywords:
Status:	CLOSED UPSTREAM
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	geo-replication
Sub Component:
Version:	mainline
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Ravishankar N
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-09-01 11:32 UTC by Aravinda VK
Modified:	2018-11-19 05:40 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2018-11-19 05:19:58 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	567	0	None	None	None	2018-11-19 05:40:34 UTC

Description Aravinda VK 2015-09-01 11:32:57 UTC

Description of problem:
------------------------
When primary slave node which is used in Geo-rep command goes down, geo-rep fails to get other slave nodes information and fails to start Geo-replication.

If Geo-rep is already started and Primary slave node goes down, that worker will remain Faulty since it is unable to get the Other nodes information.

Solution:
---------
Save slave hosts details in config file. When a worker goes to faulty, it tries to get volume status using --remote-host. Use this pool of hosts for remote-host

Cache the slave nodes/cluster info in config file as slave_nodes. If Primary Slave node is not available, use other available node.

Pseudo code:
-------------
Two new config items: prev_main_node, slave_nodes

1. if prev_main_node not in CONFIG:
Set prev_main_node = Slave node passed in Geo-rep command

2. Try to get Slave Volinfo with prev_main_node
3. If failed, Try to get Slave Volinfo from the node specified in Geo-rep command(If Node specified in Geo-rep command != prev_main_node)
4. If failed, check `slave_nodes` is available in CONFIG
5. If not available, FAIL
6. If available, Try to get Slave Volinfo using any one remote host except previously failed
7. If Volinfo available, match the Slave Vol UUID with the results to make sure it is the same Slave Volume
8. If Volinfo is valid, return it and update prev_main_node in config file and re-update `slave_nodes`
9. If invalid Volinfo FAIL
10. If Volinfo not available(from step 7) from all nodes, FAIL

Comment 1 Vijay Bellur 2018-11-19 05:40:35 UTC

Migrated to github:

https://github.com/gluster/glusterfs/issues/567

Please follow the github issue for further updates on this bug.

Note You need to log in before you can comment on or make changes to this bug.