Bug 1441675

Summary:

adding node to cns may fail if one of the existing node is down

Product:

[Red Hat Storage] Red Hat Gluster Storage

Reporter:

krishnaram Karthick <kramdoss>

Component:

heketi

Assignee:

Mohamed Ashiq <mliyazud>

Status:

CLOSED ERRATA

QA Contact:

Apeksha <akhakhar>

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

cns-3.5

CC:

asriram, divya, fcami, hchiramm, madam, mliyazud, pprakash, rcyriac, rhs-bugs, rtalur, srmukher, storage-qa-internal, vinug

Target Milestone:

---

Target Release:

CNS 3.6

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

heketi-5.0.0-7 rhgs-volmanager-docker-5.0.0-9

Doc Type:

Bug Fix

Doc Text:

Prior to this update, heketi performed 'gluster peer probe' operation only from the first node in the trusted pool. Hence, adding a new node failed if the first node of the pool was not reachable. With this fix, 'gluster peer probe' operation tries on the next online node if the first node in the trusted pool is not reachable.

Story Points:

---

Clone Of:

Environment:

Last Closed:

2017-10-11 07:07:22 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1415606, 1445447

Attachments:

Description	Flags
topologyinfo	none
heketi_logs	none

Description krishnaram Karthick 2017-04-12 12:39:56 UTC

Description of problem:
When CNS has three nodes N1,N2 and N3 where N1 is down, adding a new node N4 to this cluster might fail as it tries to perform peer probe from N1, which is the first node in the list.

# heketi-cli topology load -j=topology.json
	Found node dhcp47-106.lab.eng.blr.redhat.com on cluster 796e6db1981f369ea0340913eeea4c9a
		Found device /dev/sdd
	Creating node dhcp47-81.lab.eng.blr.redhat.com ... Unable to create node: Unable to execute command on glusterfs-d3qp1:
	Found node dhcp47-74.lab.eng.blr.redhat.com on cluster 796e6db1981f369ea0340913eeea4c9a
		Found device /dev/sdd
	Found node dhcp47-82.lab.eng.blr.redhat.com on cluster 796e6db1981f369ea0340913eeea4c9a
		Found device /dev/sdd

# oc get pods -o wide
NAME                             READY     STATUS    RESTARTS   AGE       IP             NODE
glusterfs-6gsv2                  1/1       Running   1          36m       10.70.47.74    dhcp47-74.lab.eng.blr.redhat.com
glusterfs-d3qp1                  1/1       Running   1          36m       10.70.47.106   dhcp47-106.lab.eng.blr.redhat.com
glusterfs-hwxn0                  1/1       Running   1          36m       10.70.47.82    dhcp47-82.lab.eng.blr.redhat.com
glusterfs-l0g6c                  1/1       Running   1          36m       10.70.47.81    dhcp47-81.lab.eng.blr.redhat.com
heketi-1-z0q5n                   1/1       Running   3          29m       10.130.0.7     dhcp47-82.lab.eng.blr.redhat.com
storage-project-router-1-46m1r   1/1       Running   1          48m       10.70.47.74    dhcp47-74.lab.eng.blr.redhat.com


Version-Release number of selected component (if applicable):
heketi-client-4.0.0-6.el7rhgs.x86_64

How reproducible:
always

Steps to Reproduce:
1. have 3 in a CNS setup - Node{1..3}
2. bring down first node in the topology file
3. Add a new node (Node4) in order to replace Node1
4. update topology file and try to load topology file

Actual results:
adding new node fails as it tries to do a peer probe from Node-1

Expected results:
when Node-1 is down, heketi should try to peer probe from node-2

Additional info:

Comment 3 krishnaram Karthick 2017-04-12 12:55:49 UTC

Created attachment 1271135 [details]
topologyinfo

Comment 4 krishnaram Karthick 2017-04-12 12:56:23 UTC

Created attachment 1271136 [details]
heketi_logs

Comment 5 Divya 2017-04-18 08:29:11 UTC

Raghavendra Talur,

Could you please add the Known Issues text for this bug?

Regards,
Divya

Comment 6 Raghavendra Talur 2017-04-18 09:29:44 UTC

I have provided the doc text for known issue. We don't know of a workaround hence that part is skipped.

Comment 9 Mohamed Ashiq 2017-07-27 08:57:03 UTC

Upstream patch:

https://github.com/heketi/heketi/pull/819

Unit test pending 

Will add them soon.

Comment 10 Mohamed Ashiq 2017-08-01 12:28:34 UTC

(In reply to Mohamed Ashiq from comment #9)
> Upstream patch:
> 
> https://github.com/heketi/heketi/pull/819
> 
> Unit test pending 
> 
> Will add them soon.

Patch merged upstream.

Comment 11 Apeksha 2017-09-14 06:47:33 UTC

Adding a new node, when one of the node is down is now working fine build - cns-deploy-5.0.0-37.el7rhgs.x86_64, heketi-client-5.0.0-11.el7rhgs.x86_64

[root@dhcp46-156 ~]# oc get nodes
NAME                                STATUS                     AGE       VERSION
dhcp46-14.lab.eng.blr.redhat.com    Ready                      17h       v1.6.1+5115d708d7
dhcp46-223.lab.eng.blr.redhat.com   NotReady                   17h       v1.6.1+5115d708d7
dhcp47-127.lab.eng.blr.redhat.com   Ready                      17h       v1.6.1+5115d708d7
dhcp47-169.lab.eng.blr.redhat.com   Ready,SchedulingDisabled   17h       v1.6.1+5115d708d7
dhcp47-184.lab.eng.blr.redhat.com   Ready                      17h       v1.6.1+5115d708d7

[root@dhcp46-156 ~]# heketi-cli node add --zone=1 --cluster=dbdaa75a2da75a7bd6a5fe368725d92c --management-host-name=dhcp47-127.lab.eng.blr.redhat.com --storage-host-name=10.70.47.127
Node information:
Id: 4e2e1425c5cf7fbdd30fd2d6c55a6087
State: online
Cluster Id: dbdaa75a2da75a7bd6a5fe368725d92c
Zone: 1
Management Hostname dhcp47-127.lab.eng.blr.redhat.com
Storage Hostname 10.70.47.127

Comment 14 Raghavendra Talur 2017-10-04 08:47:02 UTC

I changed the type from known issue to bug fix. Please recheck the doc text.

Comment 15 errata-xmlrpc 2017-10-11 07:07:22 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:2879