Bug 1663306

Summary:	Check GlusterFS for cluster health fails when gluster nodes are SchedulingDisabled
Product:	OpenShift Container Platform	Reporter:	Andrew Collins <ancollin>
Component:	Installer	Assignee:	Jose A. Rivera <jarrpa>
Installer sub component:	openshift-ansible	QA Contact:	Ashmitha Ambastha <asambast>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	medium
Priority:	medium	CC:	ancollin, aos-bugs, asambast, jokerman, kramdoss, mmccomas, pprakash
Version:	3.10.0	Flags:	ancollin: needinfo-
Target Milestone:	---
Target Release:	3.11.z
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-06-17 20:21:25 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Andrew Collins 2019-01-03 18:11:40 UTC

Description of problem:
When upgrading from OCP 3.9 to 3.10, "Check GlusterFS for Cluster Health" task fails consistently.  The gluster nodes are "Ready,SchedulingDisabled", and gluster volumes are all connected and healed.

Version-Release number of the following components:
rpm -q openshift-ansible
openshift-ansible-3.10.83-1.git.0.12699eb.el7.noarch
rpm -q ansible
ansible-2.4.6.0-1.el7ae.noarch
ansible --version
ansible 2.4.6.0
  python version = 2.7.5 (default, Feb 20 2018, 09:19:12) [GCC 4.8.5 (Red Hat 4.8.5-28)]

How reproducible:
100%

Steps to Reproduce:
1. Cordon gluster nodes to give them SchedulingDisabled status (oc adm cordon <gluster nodes>)
2. Attempt to run upgrade_control_plane.yml

Actual results:
Please include the entire output from the last TASK line through the end of output if an error is generated

2019-01-03 11:48:40,311 p=74676 u=root |  TASK [/usr/share/ansible/openshift-ansible/roles/openshift_storage_glusterfs : Check for GlusterFS cluster health] ***********************************
FAILED - RETRYING: Check for GlusterFS cluster health (113 retries left).Result was: 
    "attempts": 8,
    "changed": false,
    "failed": true,
    "invocation": {
        "module_args": {
            "cluster_name": "storage",
            "exclude_node": "locp002a.rnd.pncint.net",
            "oc_bin": "oc",
            "oc_conf": "/etc/origin/master/admin.kubeconfig",
            "oc_namespace": "glusterfs"
        }
    },
    "msg": "Unable to find suitable pod in get pods output: NAME                                          READY     STATUS    RESTARTS   AGE       IP              NODE\nglusterblock-storage-provisioner-dc-1-jljkg   1/1       Running   1          13h      xx.xx.xx.xx     locp005a.rnd.pncint.net\nglusterfs-storage-kbvdj                       1/1       Running   20         23d      xx.xx.xx.xx   locp013a.rnd.pncint.net\nglusterfs-storage-rvpbs                       1/1       Running   0          25m      xx.xx.xx.xx   locp011a.rnd.pncint.net\nglusterfs-storage-tlb8d                       1/1       Running   14         23d      xx.xx.xx.xx   locp012a.rnd.pncint.net\nheketi-storage-1-k24fz                        1/1       Running   1          23d      xx.xx.xx.xx     locp004a.rnd.pncint.net\n",
    "retries": 121,
    "state": "unknown"


Expected results:
Upgrade completes as expected.

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 1 Andrew Collins 2019-01-03 18:13:06 UTC

Was able to fix by changing lib_utils/library/glusterfs_check_containerized.py line 83 from:

fields[1] == "Ready"

to:

"Ready" in fields[1]

Comment 2 Scott Dodson 2019-01-03 18:16:32 UTC

Can you open a PR?

Comment 3 Andrew Collins 2019-01-09 00:04:04 UTC

Sure thing! https://github.com/openshift/openshift-ansible/pull/10970

Comment 4 Scott Dodson 2019-01-31 15:58:43 UTC

PR merged, in openshift-ansible-3.10.99-1 and later

Comment 10 errata-xmlrpc 2020-06-17 20:21:25 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2477