Bug 1663306

Summary: Check GlusterFS for cluster health fails when gluster nodes are SchedulingDisabled
Product: OpenShift Container Platform Reporter: Andrew Collins <ancollin>
Component: InstallerAssignee: Jose A. Rivera <jarrpa>
Installer sub component: openshift-ansible QA Contact: Ashmitha Ambastha <asambast>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: medium CC: ancollin, aos-bugs, asambast, jokerman, kramdoss, mmccomas, pprakash
Version: 3.10.0Flags: ancollin: needinfo-
Target Milestone: ---   
Target Release: 3.11.z   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-06-17 20:21:25 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Andrew Collins 2019-01-03 18:11:40 UTC
Description of problem:
When upgrading from OCP 3.9 to 3.10, "Check GlusterFS for Cluster Health" task fails consistently.  The gluster nodes are "Ready,SchedulingDisabled", and gluster volumes are all connected and healed.

Version-Release number of the following components:
rpm -q openshift-ansible
openshift-ansible-3.10.83-1.git.0.12699eb.el7.noarch
rpm -q ansible
ansible-2.4.6.0-1.el7ae.noarch
ansible --version
ansible 2.4.6.0
  python version = 2.7.5 (default, Feb 20 2018, 09:19:12) [GCC 4.8.5 (Red Hat 4.8.5-28)]

How reproducible:
100%

Steps to Reproduce:
1. Cordon gluster nodes to give them SchedulingDisabled status (oc adm cordon <gluster nodes>)
2. Attempt to run upgrade_control_plane.yml

Actual results:
Please include the entire output from the last TASK line through the end of output if an error is generated

2019-01-03 11:48:40,311 p=74676 u=root |  TASK [/usr/share/ansible/openshift-ansible/roles/openshift_storage_glusterfs : Check for GlusterFS cluster health] ***********************************
FAILED - RETRYING: Check for GlusterFS cluster health (113 retries left).Result was: 
    "attempts": 8,
    "changed": false,
    "failed": true,
    "invocation": {
        "module_args": {
            "cluster_name": "storage",
            "exclude_node": "locp002a.rnd.pncint.net",
            "oc_bin": "oc",
            "oc_conf": "/etc/origin/master/admin.kubeconfig",
            "oc_namespace": "glusterfs"
        }
    },
    "msg": "Unable to find suitable pod in get pods output: NAME                                          READY     STATUS    RESTARTS   AGE       IP              NODE\nglusterblock-storage-provisioner-dc-1-jljkg   1/1       Running   1          13h      xx.xx.xx.xx     locp005a.rnd.pncint.net\nglusterfs-storage-kbvdj                       1/1       Running   20         23d      xx.xx.xx.xx   locp013a.rnd.pncint.net\nglusterfs-storage-rvpbs                       1/1       Running   0          25m      xx.xx.xx.xx   locp011a.rnd.pncint.net\nglusterfs-storage-tlb8d                       1/1       Running   14         23d      xx.xx.xx.xx   locp012a.rnd.pncint.net\nheketi-storage-1-k24fz                        1/1       Running   1          23d      xx.xx.xx.xx     locp004a.rnd.pncint.net\n",
    "retries": 121,
    "state": "unknown"


Expected results:
Upgrade completes as expected.

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 1 Andrew Collins 2019-01-03 18:13:06 UTC
Was able to fix by changing lib_utils/library/glusterfs_check_containerized.py line 83 from:

fields[1] == "Ready"

to:

"Ready" in fields[1]

Comment 2 Scott Dodson 2019-01-03 18:16:32 UTC
Can you open a PR?

Comment 3 Andrew Collins 2019-01-09 00:04:04 UTC
Sure thing! https://github.com/openshift/openshift-ansible/pull/10970

Comment 4 Scott Dodson 2019-01-31 15:58:43 UTC
PR merged, in openshift-ansible-3.10.99-1 and later

Comment 10 errata-xmlrpc 2020-06-17 20:21:25 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2477