Description of problem:
++++++++++++++++++++++++++
We had a 4 node CNS cluster. Created a blockvolume with HA=4 and the blockvolume creation succeeded. Post this, we brought down one of the 4 nodes.
With 3 nodes still up and running, tried creating blockvolumes with HA=3 , HA=2, and HA=1 in different attempts. But all creation requests failed with following error message:
# date ; heketi-cli blockvolume create --size=10 --name=volHA3 --ha=3; date
Mon Jul 16 10:37:19 IST 2018
Error: insufficient block hosts online
Mon Jul 16 10:37:27 IST 2018
+++++++++++++++++++++++++++++++++++++++
Note: This bug seems to be somehow based on BZ#1595531.
+++++++
https://bugzilla.redhat.com/show_bug.cgi?id=1595531
We noted down the host list usd by heketi during blockvolume creation. It is seen that the order of nodes passed by Heketi is:
+++++++++
Hosts: [10.70.46.181 10.70.46.132 10.70.46.233 10.70.46.150]
Hence, for creating HA=3 volumes, we did try some corner cases and following were the results:
i) On bringging down any one of the first 3 nodes "10.70.46.181 10.70.46.132 10.70.46.233" blockvolume creation with HA=3 fails.
ii) On bringing down the last host in the list 10.70.46.150 , no issue observed and volumes with HA=3 were created successfully.
iii) Brought down 1st node in the list, and blockvolumes with HA=1 and HA=2 failed getting created.
iv) Brought down node 3 in the list and HA=1 and HA=2 succeeded, but HA=3 still fails.
Hence, we may not face this issue if somehow unknowingly, we try to create a HA=3 volume and the node which is down is actually the last node of the above list.
++++++++++++++++++++++++++++++++++++++++
# heketi-cli blockvolume info 844f58788fc0339e659ed8c621d62da5
Name: volHA4
Size: 10
Volume Id: 844f58788fc0339e659ed8c621d62da5
Cluster Id: 3fbc2bfba517118dcf3fa4a29bda4a19
Hosts: [10.70.46.181 10.70.46.132 10.70.46.233 10.70.46.150]
IQN: iqn.2016-12.org.gluster-block:34858a17-be29-46d3-939d-1d919150104c
LUN: 0
Hacount: 4
Username:
Password:
Block Hosting Volume: 983d934c4f9a1c69843892542e290f6f
++++++++++++++++++++++++++++++++++++++
Version-Release number of selected component (if applicable):
++++++++++++++++++++++++++++++++++
# oc rsh heketi-storage-1-r44vr rpm -qa|grep heketi
python-heketi-7.0.0-3.el7rhgs.x86_64
heketi-client-7.0.0-3.el7rhgs.x86_64
heketi-7.0.0-3.el7rhgs.x86_64
How reproducible:
+++++++++++++++++++
3x3
Steps to Reproduce:
1. Create a 4 node CNS setup and confirm that all gluster-blockd services are UP and running.
2. Create a blockvolume with HA=4 . The blockvolume creation succeeds. Note the order of hosts used in the "Hosts" list of the blockvolume info
# date ; heketi-cli blockvolume create --size=10 --name=volHA4 --ha=4; date
Mon Jul 16 10:22:47 IST 2018
Name: volHA4
Size: 10
Volume Id: 844f58788fc0339e659ed8c621d62da5
Cluster Id: 3fbc2bfba517118dcf3fa4a29bda4a19
Hosts: [10.70.46.181 10.70.46.132 10.70.46.233 10.70.46.150]
IQN: iqn.2016-12.org.gluster-block:34858a17-be29-46d3-939d-1d919150104c
LUN: 0
Hacount: 4
Username:
Password:
Block Hosting Volume: 983d934c4f9a1c69843892542e290f6f
Mon Jul 16 10:23:54 IST 2018
E.g . Node A,B,C,D are the orders of the hosts in Hosts[] list
Hosts: [A B C D]
3. Bring down one CNS node, preferably any of the first 3 nodes(A B C) in the above list.
4. Try creating a blockvolume with HA=3
# date ; heketi-cli blockvolume create --size=10 --name=volHA3 --ha=3; date
Mon Jul 16 10:37:19 IST 2018
Error: insufficient block hosts online
Mon Jul 16 10:37:27 IST 2018
Actual results:
+++++++++++++++
Blockvolume creation with HA=3 fails, even when 3 nodes are UP in a 4 node CNS cluster
Expected results:
+++++++++++++
Heketi should be able to pick up the 3 UP and running nodes to create new blockvolumes, even when 4th node is down.
Additional info:
+++++++++++++++
All steps will be detailed in the next comment.
Comment 10Raghavendra Talur
2018-07-19 06:03:43 UTC
Root cause identified. Loop terminates at i == ha count instead of length(identified_hosts) == ha count.
I will work on this patch.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHEA-2018:2686