Description of problem: ++++++++++++++++++++++++++ We had a 4 node CNS cluster. Created a blockvolume with HA=4 and the blockvolume creation succeeded. Post this, we brought down one of the 4 nodes. With 3 nodes still up and running, tried creating blockvolumes with HA=3 , HA=2, and HA=1 in different attempts. But all creation requests failed with following error message: # date ; heketi-cli blockvolume create --size=10 --name=volHA3 --ha=3; date Mon Jul 16 10:37:19 IST 2018 Error: insufficient block hosts online Mon Jul 16 10:37:27 IST 2018 +++++++++++++++++++++++++++++++++++++++ Note: This bug seems to be somehow based on BZ#1595531. +++++++ https://bugzilla.redhat.com/show_bug.cgi?id=1595531 We noted down the host list usd by heketi during blockvolume creation. It is seen that the order of nodes passed by Heketi is: +++++++++ Hosts: [10.70.46.181 10.70.46.132 10.70.46.233 10.70.46.150] Hence, for creating HA=3 volumes, we did try some corner cases and following were the results: i) On bringging down any one of the first 3 nodes "10.70.46.181 10.70.46.132 10.70.46.233" blockvolume creation with HA=3 fails. ii) On bringing down the last host in the list 10.70.46.150 , no issue observed and volumes with HA=3 were created successfully. iii) Brought down 1st node in the list, and blockvolumes with HA=1 and HA=2 failed getting created. iv) Brought down node 3 in the list and HA=1 and HA=2 succeeded, but HA=3 still fails. Hence, we may not face this issue if somehow unknowingly, we try to create a HA=3 volume and the node which is down is actually the last node of the above list. ++++++++++++++++++++++++++++++++++++++++ # heketi-cli blockvolume info 844f58788fc0339e659ed8c621d62da5 Name: volHA4 Size: 10 Volume Id: 844f58788fc0339e659ed8c621d62da5 Cluster Id: 3fbc2bfba517118dcf3fa4a29bda4a19 Hosts: [10.70.46.181 10.70.46.132 10.70.46.233 10.70.46.150] IQN: iqn.2016-12.org.gluster-block:34858a17-be29-46d3-939d-1d919150104c LUN: 0 Hacount: 4 Username: Password: Block Hosting Volume: 983d934c4f9a1c69843892542e290f6f ++++++++++++++++++++++++++++++++++++++ Version-Release number of selected component (if applicable): ++++++++++++++++++++++++++++++++++ # oc rsh heketi-storage-1-r44vr rpm -qa|grep heketi python-heketi-7.0.0-3.el7rhgs.x86_64 heketi-client-7.0.0-3.el7rhgs.x86_64 heketi-7.0.0-3.el7rhgs.x86_64 How reproducible: +++++++++++++++++++ 3x3 Steps to Reproduce: 1. Create a 4 node CNS setup and confirm that all gluster-blockd services are UP and running. 2. Create a blockvolume with HA=4 . The blockvolume creation succeeds. Note the order of hosts used in the "Hosts" list of the blockvolume info # date ; heketi-cli blockvolume create --size=10 --name=volHA4 --ha=4; date Mon Jul 16 10:22:47 IST 2018 Name: volHA4 Size: 10 Volume Id: 844f58788fc0339e659ed8c621d62da5 Cluster Id: 3fbc2bfba517118dcf3fa4a29bda4a19 Hosts: [10.70.46.181 10.70.46.132 10.70.46.233 10.70.46.150] IQN: iqn.2016-12.org.gluster-block:34858a17-be29-46d3-939d-1d919150104c LUN: 0 Hacount: 4 Username: Password: Block Hosting Volume: 983d934c4f9a1c69843892542e290f6f Mon Jul 16 10:23:54 IST 2018 E.g . Node A,B,C,D are the orders of the hosts in Hosts[] list Hosts: [A B C D] 3. Bring down one CNS node, preferably any of the first 3 nodes(A B C) in the above list. 4. Try creating a blockvolume with HA=3 # date ; heketi-cli blockvolume create --size=10 --name=volHA3 --ha=3; date Mon Jul 16 10:37:19 IST 2018 Error: insufficient block hosts online Mon Jul 16 10:37:27 IST 2018 Actual results: +++++++++++++++ Blockvolume creation with HA=3 fails, even when 3 nodes are UP in a 4 node CNS cluster Expected results: +++++++++++++ Heketi should be able to pick up the 3 UP and running nodes to create new blockvolumes, even when 4th node is down. Additional info: +++++++++++++++ All steps will be detailed in the next comment.
Root cause identified. Loop terminates at i == ha count instead of length(identified_hosts) == ha count. I will work on this patch.
Upstream fix: https://github.com/heketi/heketi/pull/1280
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:2686