Description of problem: When CNS has three nodes N1,N2 and N3 where N1 is down, adding a new node N4 to this cluster might fail as it tries to perform peer probe from N1, which is the first node in the list. # heketi-cli topology load -j=topology.json Found node dhcp47-106.lab.eng.blr.redhat.com on cluster 796e6db1981f369ea0340913eeea4c9a Found device /dev/sdd Creating node dhcp47-81.lab.eng.blr.redhat.com ... Unable to create node: Unable to execute command on glusterfs-d3qp1: Found node dhcp47-74.lab.eng.blr.redhat.com on cluster 796e6db1981f369ea0340913eeea4c9a Found device /dev/sdd Found node dhcp47-82.lab.eng.blr.redhat.com on cluster 796e6db1981f369ea0340913eeea4c9a Found device /dev/sdd # oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE glusterfs-6gsv2 1/1 Running 1 36m 10.70.47.74 dhcp47-74.lab.eng.blr.redhat.com glusterfs-d3qp1 1/1 Running 1 36m 10.70.47.106 dhcp47-106.lab.eng.blr.redhat.com glusterfs-hwxn0 1/1 Running 1 36m 10.70.47.82 dhcp47-82.lab.eng.blr.redhat.com glusterfs-l0g6c 1/1 Running 1 36m 10.70.47.81 dhcp47-81.lab.eng.blr.redhat.com heketi-1-z0q5n 1/1 Running 3 29m 10.130.0.7 dhcp47-82.lab.eng.blr.redhat.com storage-project-router-1-46m1r 1/1 Running 1 48m 10.70.47.74 dhcp47-74.lab.eng.blr.redhat.com Version-Release number of selected component (if applicable): heketi-client-4.0.0-6.el7rhgs.x86_64 How reproducible: always Steps to Reproduce: 1. have 3 in a CNS setup - Node{1..3} 2. bring down first node in the topology file 3. Add a new node (Node4) in order to replace Node1 4. update topology file and try to load topology file Actual results: adding new node fails as it tries to do a peer probe from Node-1 Expected results: when Node-1 is down, heketi should try to peer probe from node-2 Additional info:
Created attachment 1271135 [details] topologyinfo
Created attachment 1271136 [details] heketi_logs
Raghavendra Talur, Could you please add the Known Issues text for this bug? Regards, Divya
I have provided the doc text for known issue. We don't know of a workaround hence that part is skipped.
Upstream patch: https://github.com/heketi/heketi/pull/819 Unit test pending Will add them soon.
(In reply to Mohamed Ashiq from comment #9) > Upstream patch: > > https://github.com/heketi/heketi/pull/819 > > Unit test pending > > Will add them soon. Patch merged upstream.
Adding a new node, when one of the node is down is now working fine build - cns-deploy-5.0.0-37.el7rhgs.x86_64, heketi-client-5.0.0-11.el7rhgs.x86_64 [root@dhcp46-156 ~]# oc get nodes NAME STATUS AGE VERSION dhcp46-14.lab.eng.blr.redhat.com Ready 17h v1.6.1+5115d708d7 dhcp46-223.lab.eng.blr.redhat.com NotReady 17h v1.6.1+5115d708d7 dhcp47-127.lab.eng.blr.redhat.com Ready 17h v1.6.1+5115d708d7 dhcp47-169.lab.eng.blr.redhat.com Ready,SchedulingDisabled 17h v1.6.1+5115d708d7 dhcp47-184.lab.eng.blr.redhat.com Ready 17h v1.6.1+5115d708d7 [root@dhcp46-156 ~]# heketi-cli node add --zone=1 --cluster=dbdaa75a2da75a7bd6a5fe368725d92c --management-host-name=dhcp47-127.lab.eng.blr.redhat.com --storage-host-name=10.70.47.127 Node information: Id: 4e2e1425c5cf7fbdd30fd2d6c55a6087 State: online Cluster Id: dbdaa75a2da75a7bd6a5fe368725d92c Zone: 1 Management Hostname dhcp47-127.lab.eng.blr.redhat.com Storage Hostname 10.70.47.127
I changed the type from known issue to bug fix. Please recheck the doc text.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:2879