Bug 1601341

Summary: On a 4 node setup heketi blockvolume creation with HA=3 fails when one node is powered off
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Neha Berry <nberry>
Component: heketiAssignee: Sven Anderson <svanders>
Status: CLOSED ERRATA QA Contact: Neha Berry <nberry>
Severity: high Docs Contact:
Priority: unspecified    
Version: cns-3.10CC: hchiramm, jmulligan, kramdoss, madam, pprakash, rhs-bugs, rtalur, sankarshan, sarumuga, storage-qa-internal, svanders
Target Milestone: ---   
Target Release: CNS 3.10   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-09-12 09:23:49 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1568862    

Description Neha Berry 2018-07-16 06:57:25 UTC
Description of problem:
++++++++++++++++++++++++++
We had a 4 node CNS cluster. Created a blockvolume with HA=4 and the blockvolume creation succeeded. Post this, we brought down one of the 4 nodes. 
With 3 nodes still up and running, tried creating blockvolumes with HA=3 , HA=2, and HA=1 in different attempts. But all creation requests failed with following error message:
 
# date ; heketi-cli blockvolume create --size=10 --name=volHA3 --ha=3; date
Mon Jul 16 10:37:19 IST 2018
Error: insufficient block hosts online
Mon Jul 16 10:37:27 IST 2018

+++++++++++++++++++++++++++++++++++++++
Note: This bug seems to be somehow based on BZ#1595531.
+++++++ 
https://bugzilla.redhat.com/show_bug.cgi?id=1595531

We noted down the host list usd by heketi during blockvolume creation. It is seen that the order of nodes passed by Heketi is:
+++++++++
Hosts: [10.70.46.181 10.70.46.132 10.70.46.233 10.70.46.150]

Hence, for creating HA=3 volumes, we did try some corner cases and following were the results:
i) On bringging down any one of the first 3 nodes "10.70.46.181 10.70.46.132 10.70.46.233" blockvolume creation with HA=3 fails.
ii) On bringing down the last host in the list 10.70.46.150 , no issue observed and volumes with HA=3 were created successfully.
iii) Brought down 1st node in the list, and blockvolumes with HA=1 and HA=2 failed getting created.
iv) Brought down node 3 in the list and HA=1 and HA=2 succeeded, but HA=3 still fails. 

Hence, we may not face this issue if somehow unknowingly, we try to create a HA=3 volume and the node which is down is actually the last node of the above list.

++++++++++++++++++++++++++++++++++++++++

# heketi-cli blockvolume info  844f58788fc0339e659ed8c621d62da5
Name: volHA4
Size: 10
Volume Id: 844f58788fc0339e659ed8c621d62da5
Cluster Id: 3fbc2bfba517118dcf3fa4a29bda4a19
Hosts: [10.70.46.181 10.70.46.132 10.70.46.233 10.70.46.150]
IQN: iqn.2016-12.org.gluster-block:34858a17-be29-46d3-939d-1d919150104c
LUN: 0
Hacount: 4
Username: 
Password: 
Block Hosting Volume: 983d934c4f9a1c69843892542e290f6f

++++++++++++++++++++++++++++++++++++++


Version-Release number of selected component (if applicable):
++++++++++++++++++++++++++++++++++

# oc rsh heketi-storage-1-r44vr rpm -qa|grep heketi
python-heketi-7.0.0-3.el7rhgs.x86_64
heketi-client-7.0.0-3.el7rhgs.x86_64
heketi-7.0.0-3.el7rhgs.x86_64




How reproducible:
+++++++++++++++++++
3x3

Steps to Reproduce:
1. Create a 4 node CNS setup and confirm that all gluster-blockd services are UP and running.

2. Create a blockvolume with HA=4 . The blockvolume creation succeeds. Note the order of hosts used in the "Hosts" list of the blockvolume info

# date ; heketi-cli blockvolume create --size=10 --name=volHA4 --ha=4; date
Mon Jul 16 10:22:47 IST 2018
Name: volHA4
Size: 10
Volume Id: 844f58788fc0339e659ed8c621d62da5
Cluster Id: 3fbc2bfba517118dcf3fa4a29bda4a19
Hosts: [10.70.46.181 10.70.46.132 10.70.46.233 10.70.46.150]
IQN: iqn.2016-12.org.gluster-block:34858a17-be29-46d3-939d-1d919150104c
LUN: 0
Hacount: 4
Username: 
Password: 
Block Hosting Volume: 983d934c4f9a1c69843892542e290f6f
Mon Jul 16 10:23:54 IST 2018


E.g . Node A,B,C,D are the orders of the hosts in Hosts[] list
    Hosts: [A B C D]


3. Bring down one CNS node, preferably any of the first 3 nodes(A B C) in the above list.

4. Try creating a blockvolume with HA=3
# date ; heketi-cli blockvolume create --size=10 --name=volHA3 --ha=3; date
Mon Jul 16 10:37:19 IST 2018
Error: insufficient block hosts online
Mon Jul 16 10:37:27 IST 2018

 

Actual results:
+++++++++++++++
Blockvolume creation with HA=3 fails, even when 3 nodes are UP in a 4 node CNS cluster

Expected results:
+++++++++++++
Heketi should be able to pick up the 3 UP and running nodes to create new blockvolumes, even when 4th node is down.


Additional info:
+++++++++++++++

All steps will be detailed in the next comment.

Comment 10 Raghavendra Talur 2018-07-19 06:03:43 UTC
Root cause identified. Loop terminates at i == ha count instead of length(identified_hosts) == ha count.

I will work on this patch.

Comment 11 Sven Anderson 2018-07-24 14:44:54 UTC
Upstream fix: https://github.com/heketi/heketi/pull/1280

Comment 19 errata-xmlrpc 2018-09-12 09:23:49 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2686