1441675 – adding node to cns may fail if one of the existing node is down

Bug 1441675 - adding node to cns may fail if one of the existing node is down

Summary: adding node to cns may fail if one of the existing node is down

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	heketi
Sub Component:
Version:	cns-3.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	CNS 3.6
Assignee:	Mohamed Ashiq
QA Contact:	Apeksha
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1415606 1445447
TreeView+	depends on / blocked

Reported:	2017-04-12 12:39 UTC by krishnaram Karthick
Modified:	2021-03-11 15:08 UTC (History)
CC List:	13 users (show)
Fixed In Version:	heketi-5.0.0-7 rhgs-volmanager-docker-5.0.0-9
Doc Type:	Bug Fix
Doc Text:	Prior to this update, heketi performed 'gluster peer probe' operation only from the first node in the trusted pool. Hence, adding a new node failed if the first node of the pool was not reachable. With this fix, 'gluster peer probe' operation tries on the next online node if the first node in the trusted pool is not reachable.
Clone Of:
Environment:
Last Closed:	2017-10-11 07:07:22 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
topologyinfo (3.01 KB, text/plain) 2017-04-12 12:55 UTC, krishnaram Karthick	no flags	Details
heketi_logs (4.04 KB, text/plain) 2017-04-12 12:56 UTC, krishnaram Karthick	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2017:2879	0	normal	SHIPPED_LIVE	heketi bug fix and enhancement update	2017-10-11 11:07:06 UTC

Description krishnaram Karthick 2017-04-12 12:39:56 UTC

Description of problem:
When CNS has three nodes N1,N2 and N3 where N1 is down, adding a new node N4 to this cluster might fail as it tries to perform peer probe from N1, which is the first node in the list.

# heketi-cli topology load -j=topology.json
	Found node dhcp47-106.lab.eng.blr.redhat.com on cluster 796e6db1981f369ea0340913eeea4c9a
		Found device /dev/sdd
	Creating node dhcp47-81.lab.eng.blr.redhat.com ... Unable to create node: Unable to execute command on glusterfs-d3qp1:
	Found node dhcp47-74.lab.eng.blr.redhat.com on cluster 796e6db1981f369ea0340913eeea4c9a
		Found device /dev/sdd
	Found node dhcp47-82.lab.eng.blr.redhat.com on cluster 796e6db1981f369ea0340913eeea4c9a
		Found device /dev/sdd

# oc get pods -o wide
NAME                             READY     STATUS    RESTARTS   AGE       IP             NODE
glusterfs-6gsv2                  1/1       Running   1          36m       10.70.47.74    dhcp47-74.lab.eng.blr.redhat.com
glusterfs-d3qp1                  1/1       Running   1          36m       10.70.47.106   dhcp47-106.lab.eng.blr.redhat.com
glusterfs-hwxn0                  1/1       Running   1          36m       10.70.47.82    dhcp47-82.lab.eng.blr.redhat.com
glusterfs-l0g6c                  1/1       Running   1          36m       10.70.47.81    dhcp47-81.lab.eng.blr.redhat.com
heketi-1-z0q5n                   1/1       Running   3          29m       10.130.0.7     dhcp47-82.lab.eng.blr.redhat.com
storage-project-router-1-46m1r   1/1       Running   1          48m       10.70.47.74    dhcp47-74.lab.eng.blr.redhat.com


Version-Release number of selected component (if applicable):
heketi-client-4.0.0-6.el7rhgs.x86_64

How reproducible:
always

Steps to Reproduce:
1. have 3 in a CNS setup - Node{1..3}
2. bring down first node in the topology file
3. Add a new node (Node4) in order to replace Node1
4. update topology file and try to load topology file

Actual results:
adding new node fails as it tries to do a peer probe from Node-1

Expected results:
when Node-1 is down, heketi should try to peer probe from node-2

Additional info:

Comment 3 krishnaram Karthick 2017-04-12 12:55:49 UTC

Created attachment 1271135 [details]
topologyinfo

Comment 4 krishnaram Karthick 2017-04-12 12:56:23 UTC

Created attachment 1271136 [details]
heketi_logs

Comment 5 Divya 2017-04-18 08:29:11 UTC

Raghavendra Talur,

Could you please add the Known Issues text for this bug?

Regards,
Divya

Comment 6 Raghavendra Talur 2017-04-18 09:29:44 UTC

I have provided the doc text for known issue. We don't know of a workaround hence that part is skipped.

Comment 9 Mohamed Ashiq 2017-07-27 08:57:03 UTC

Upstream patch:

https://github.com/heketi/heketi/pull/819

Unit test pending 

Will add them soon.

Comment 10 Mohamed Ashiq 2017-08-01 12:28:34 UTC

(In reply to Mohamed Ashiq from comment #9)
> Upstream patch:
> 
> https://github.com/heketi/heketi/pull/819
> 
> Unit test pending 
> 
> Will add them soon.

Patch merged upstream.

Comment 11 Apeksha 2017-09-14 06:47:33 UTC

Adding a new node, when one of the node is down is now working fine build - cns-deploy-5.0.0-37.el7rhgs.x86_64, heketi-client-5.0.0-11.el7rhgs.x86_64

[root@dhcp46-156 ~]# oc get nodes
NAME                                STATUS                     AGE       VERSION
dhcp46-14.lab.eng.blr.redhat.com    Ready                      17h       v1.6.1+5115d708d7
dhcp46-223.lab.eng.blr.redhat.com   NotReady                   17h       v1.6.1+5115d708d7
dhcp47-127.lab.eng.blr.redhat.com   Ready                      17h       v1.6.1+5115d708d7
dhcp47-169.lab.eng.blr.redhat.com   Ready,SchedulingDisabled   17h       v1.6.1+5115d708d7
dhcp47-184.lab.eng.blr.redhat.com   Ready                      17h       v1.6.1+5115d708d7

[root@dhcp46-156 ~]# heketi-cli node add --zone=1 --cluster=dbdaa75a2da75a7bd6a5fe368725d92c --management-host-name=dhcp47-127.lab.eng.blr.redhat.com --storage-host-name=10.70.47.127
Node information:
Id: 4e2e1425c5cf7fbdd30fd2d6c55a6087
State: online
Cluster Id: dbdaa75a2da75a7bd6a5fe368725d92c
Zone: 1
Management Hostname dhcp47-127.lab.eng.blr.redhat.com
Storage Hostname 10.70.47.127

Comment 14 Raghavendra Talur 2017-10-04 08:47:02 UTC

I changed the type from known issue to bug fix. Please recheck the doc text.

Comment 15 errata-xmlrpc 2017-10-11 07:07:22 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:2879

Note You need to log in before you can comment on or make changes to this bug.