Bug 1441675

Summary: adding node to cns may fail if one of the existing node is down
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: krishnaram Karthick <kramdoss>
Component: heketiAssignee: Mohamed Ashiq <mliyazud>
Status: CLOSED ERRATA QA Contact: Apeksha <akhakhar>
Severity: high Docs Contact:
Priority: unspecified    
Version: cns-3.5CC: asriram, divya, fcami, hchiramm, madam, mliyazud, pprakash, rcyriac, rhs-bugs, rtalur, srmukher, storage-qa-internal, vinug
Target Milestone: ---   
Target Release: CNS 3.6   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: heketi-5.0.0-7 rhgs-volmanager-docker-5.0.0-9 Doc Type: Bug Fix
Doc Text:
Prior to this update, heketi performed 'gluster peer probe' operation only from the first node in the trusted pool. Hence, adding a new node failed if the first node of the pool was not reachable. With this fix, 'gluster peer probe' operation tries on the next online node if the first node in the trusted pool is not reachable.
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-10-11 07:07:22 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1415606, 1445447    
Attachments:
Description Flags
topologyinfo
none
heketi_logs none

Description krishnaram Karthick 2017-04-12 12:39:56 UTC
Description of problem:
When CNS has three nodes N1,N2 and N3 where N1 is down, adding a new node N4 to this cluster might fail as it tries to perform peer probe from N1, which is the first node in the list.

# heketi-cli topology load -j=topology.json
	Found node dhcp47-106.lab.eng.blr.redhat.com on cluster 796e6db1981f369ea0340913eeea4c9a
		Found device /dev/sdd
	Creating node dhcp47-81.lab.eng.blr.redhat.com ... Unable to create node: Unable to execute command on glusterfs-d3qp1:
	Found node dhcp47-74.lab.eng.blr.redhat.com on cluster 796e6db1981f369ea0340913eeea4c9a
		Found device /dev/sdd
	Found node dhcp47-82.lab.eng.blr.redhat.com on cluster 796e6db1981f369ea0340913eeea4c9a
		Found device /dev/sdd

# oc get pods -o wide
NAME                             READY     STATUS    RESTARTS   AGE       IP             NODE
glusterfs-6gsv2                  1/1       Running   1          36m       10.70.47.74    dhcp47-74.lab.eng.blr.redhat.com
glusterfs-d3qp1                  1/1       Running   1          36m       10.70.47.106   dhcp47-106.lab.eng.blr.redhat.com
glusterfs-hwxn0                  1/1       Running   1          36m       10.70.47.82    dhcp47-82.lab.eng.blr.redhat.com
glusterfs-l0g6c                  1/1       Running   1          36m       10.70.47.81    dhcp47-81.lab.eng.blr.redhat.com
heketi-1-z0q5n                   1/1       Running   3          29m       10.130.0.7     dhcp47-82.lab.eng.blr.redhat.com
storage-project-router-1-46m1r   1/1       Running   1          48m       10.70.47.74    dhcp47-74.lab.eng.blr.redhat.com


Version-Release number of selected component (if applicable):
heketi-client-4.0.0-6.el7rhgs.x86_64

How reproducible:
always

Steps to Reproduce:
1. have 3 in a CNS setup - Node{1..3}
2. bring down first node in the topology file
3. Add a new node (Node4) in order to replace Node1
4. update topology file and try to load topology file

Actual results:
adding new node fails as it tries to do a peer probe from Node-1

Expected results:
when Node-1 is down, heketi should try to peer probe from node-2

Additional info:

Comment 3 krishnaram Karthick 2017-04-12 12:55:49 UTC
Created attachment 1271135 [details]
topologyinfo

Comment 4 krishnaram Karthick 2017-04-12 12:56:23 UTC
Created attachment 1271136 [details]
heketi_logs

Comment 5 Divya 2017-04-18 08:29:11 UTC
Raghavendra Talur,

Could you please add the Known Issues text for this bug?

Regards,
Divya

Comment 6 Raghavendra Talur 2017-04-18 09:29:44 UTC
I have provided the doc text for known issue. We don't know of a workaround hence that part is skipped.

Comment 9 Mohamed Ashiq 2017-07-27 08:57:03 UTC
Upstream patch:

https://github.com/heketi/heketi/pull/819

Unit test pending 

Will add them soon.

Comment 10 Mohamed Ashiq 2017-08-01 12:28:34 UTC
(In reply to Mohamed Ashiq from comment #9)
> Upstream patch:
> 
> https://github.com/heketi/heketi/pull/819
> 
> Unit test pending 
> 
> Will add them soon.

Patch merged upstream.

Comment 11 Apeksha 2017-09-14 06:47:33 UTC
Adding a new node, when one of the node is down is now working fine build - cns-deploy-5.0.0-37.el7rhgs.x86_64, heketi-client-5.0.0-11.el7rhgs.x86_64

[root@dhcp46-156 ~]# oc get nodes
NAME                                STATUS                     AGE       VERSION
dhcp46-14.lab.eng.blr.redhat.com    Ready                      17h       v1.6.1+5115d708d7
dhcp46-223.lab.eng.blr.redhat.com   NotReady                   17h       v1.6.1+5115d708d7
dhcp47-127.lab.eng.blr.redhat.com   Ready                      17h       v1.6.1+5115d708d7
dhcp47-169.lab.eng.blr.redhat.com   Ready,SchedulingDisabled   17h       v1.6.1+5115d708d7
dhcp47-184.lab.eng.blr.redhat.com   Ready                      17h       v1.6.1+5115d708d7

[root@dhcp46-156 ~]# heketi-cli node add --zone=1 --cluster=dbdaa75a2da75a7bd6a5fe368725d92c --management-host-name=dhcp47-127.lab.eng.blr.redhat.com --storage-host-name=10.70.47.127
Node information:
Id: 4e2e1425c5cf7fbdd30fd2d6c55a6087
State: online
Cluster Id: dbdaa75a2da75a7bd6a5fe368725d92c
Zone: 1
Management Hostname dhcp47-127.lab.eng.blr.redhat.com
Storage Hostname 10.70.47.127

Comment 14 Raghavendra Talur 2017-10-04 08:47:02 UTC
I changed the type from known issue to bug fix. Please recheck the doc text.

Comment 15 errata-xmlrpc 2017-10-11 07:07:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:2879