Bug 1540680

Summary: [RFE] Deeper CNS Gluster health status checks needed in order to validate health of pools during OCP/CNS upgrades
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Aaren <aren.dej>
Component: cns-ansibleAssignee: Jose A. Rivera <jarrpa>
Status: CLOSED CURRENTRELEASE QA Contact: Prasanth <pprakash>
Severity: high Docs Contact:
Priority: unspecified    
Version: cns-3.6CC: aren.dej, hchiramm, jarrpa, jmulligan, kramdoss, madam, rhs-bugs, rtalur, sankarshan, sarumuga, storage-qa-internal
Target Milestone: ---Keywords: FutureFeature
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-03-21 20:03:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1641915    

Description Aaren 2018-01-31 17:41:30 UTC
Description of problem:
Upgrades of CNS clusters entails cascading updates from node to node, via openshift-ansible plpaybooks, but those don't inspect the gluster cluster health to ensure there's no healing going on before moving on to upgrade the next node in the cluster. This RFE is about adding heketi-cli and/or openshift-ansible functionality to indicate whether or not the Gluster pool for CNS is healing or not, and allow this to tie into openshift-ansible playbooks that upgrade the gluster nodes, so as to avoid breaking the cluster's consistency.

We have experienced a case where a customer's OCP w/ CNS cluster had undergone an upgrade via openshift-ansible and the playbooks don't stop to check for healing state of the gluster pool, so the upgrade continues regardless and ruins the data consistency, necessitating a rebuild, and resulting in potential data loss.

Version-Release number of selected component (if applicable):
CNS 3.6
with
Openshift 3.7

How reproducible:
upgrade a functional OCP 3.6 cluster with CNS 3.6 to OCP 3.7.

Steps to Reproduce:
1. build OCP 3.6 with 3 dedicated gluster nodes for CNS 3.6
2. use openshift-ansible to install CNS 3.6
3. upgrade cluster to OCP 3.7 with openshift-ansible

Actual results:
Observe that there isn't a stage where a health check can be done to sufficiently validate that the gluster cluster is completed healing before upgrading the next node, which results in inconsistent gluster cluster.

Expected results:
Health check of gluster storage being completely healthy before upgrading the next node in line.

Additional info:

suggest: assign to jrivera

Comment 3 Aaren 2018-01-31 18:30:04 UTC
(In reply to Aaren from comment #2)
> https://bugzilla.redhat.com/show_bug.cgi?id=1540685

related ^^

Comment 5 Raghavendra Talur 2019-01-23 20:27:06 UTC
Jose,

I changed the component to cns-ansible as the bug asks for better checks before running CNS/OCS upgrade playbooks. Triage this bug depending on the current status of OCS upgrade playbook.

Comment 6 Jose A. Rivera 2019-01-23 22:26:41 UTC
This is already taken care of in the downstream builds.