Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1540685

Summary:	[RFE] openshift-ansible upgrade with CNS should check health of gluster before cascading to subsequent node
Product:	OpenShift Container Platform	Reporter:	Aaren <aren.dej>
Component:	Cluster Version Operator	Assignee:	Michael Gugino <mgugino>
Status:	CLOSED CURRENTRELEASE	QA Contact:	liujia <jiajliu>
Severity:	high	Docs Contact:
Priority:	high
Version:	3.6.0	CC:	aos-bugs, aren.dej, jokerman, mmccomas, wsun
Target Milestone:	---
Target Release:	3.11.0
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Enhancement
Doc Text:	Feature: Upgrade now verifys native gluster health before proceeding to next node. Reason: Ensure gluster remains healthy. Result: Gluster volumes remain healthy during upgrades.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-12-21 15:16:33 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Aaren 2018-01-31 18:24:30 UTC

Description of problem:


Version-Release number of the following components:
rpm -q openshift-ansible
rpm -q ansible
ansible --version

How reproducible:

Steps to Reproduce:
1. build openshift 3.6 with openshift-ansible (advanced install)
2. use openshift-ansible to setup CNS 3.6 on 3 dedicated nodes
3. use openshift-ansible to upgrade said cluster to OCP 3.7

Actual results:
Please include the entire output from the last TASK line through the end of output if an error is generated

Not a generated error.
The problem is that when the upgrade runs, the playbooks seem to cascade over the gluster/CNS nodes but do not check for gluster pool health before upgrading the next, and so data/quorum problems ensue.

Expected results:

To upgrade from OCP 3.6 to OCP 3.7, without having the CNS gluster pools go bad due to healing in progress while upgrading and rebooting nodes. 

Additional info:
Please attach logs from ansible-playbook with the -vvv flag
(^ does not apply?)

If other details are required, please list them.

related to https://bugzilla.redhat.com/show_bug.cgi?id=1540680

suggest: assign to: jrivera

Comment 2 Scott Dodson 2018-08-01 15:33:51 UTC

https://github.com/openshift/openshift-ansible/pull/9348 is the pull request.

Comment 3 Scott Dodson 2018-08-14 21:25:07 UTC

Should be in openshift-ansible-3.11.0-0.15.0

Comment 4 Wenkai Shi 2018-09-10 08:58:25 UTC

Failed to verified with version openshift-ansible-3.11.0-0.28.0.git.0.730d4be.el7.noarch, code has been merged. openshift-ansible upgrade with CNS is check health of gluster, but meet error:

# rpm -q openshift-ansible
openshift-ansible-3.11.0-0.28.0.git.0.730d4be.el7.noarch

# ansible-playbook -i inventory -vv /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_11/upgrade_nodes.yml
...
TASK [openshift_storage_glusterfs : Check for GlusterFS cluster health] **********************************************************************************************************************
task path: /usr/share/ansible/openshift-ansible/roles/openshift_storage_glusterfs/tasks/cluster_health.yml:4
Monday 10 September 2018  16:55:06 +0800 (0:00:00.192)       0:06:54.986 ****** 
fatal: [ec2-18-215-234-117.compute-1.amazonaws.com]: FAILED! => {"msg": "The task includes an option with an undefined variable. The error was: 'first_master_client_binary' is undefined\n\nThe error appears to have been in '/usr/share/ansible/openshift-ansible/roles/openshift_storage_glusterfs/tasks/cluster_health.yml': line 4, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n# lib_utils/library/glusterfs_check_containerized.py\n- name: Check for GlusterFS cluster health\n  ^ here\n"}
...

Failure summary:


  1. Hosts:    ec2-18-215-234-117.compute-1.amazonaws.com
     Play:     Drain and upgrade nodes
     Task:     Check for GlusterFS cluster health
     Message:  The task includes an option with an undefined variable. The error was: 'first_master_client_binary' is undefined
               
               The error appears to have been in '/usr/share/ansible/openshift-ansible/roles/openshift_storage_glusterfs/tasks/cluster_health.yml': line 4, column 3, but may
               be elsewhere in the file depending on the exact syntax problem.
               
               The offending line appears to be:
               
               # lib_utils/library/glusterfs_check_containerized.py
               - name: Check for GlusterFS cluster health
                 ^ here

Comment 5 Michael Gugino 2018-09-10 16:23:23 UTC

This was fixed via https://github.com/openshift/openshift-ansible/pull/9924

Comment 6 Wenkai Shi 2018-09-12 06:07:00 UTC

Verified with version openshift-ansible-3.11.0-0.32.0.git.0.b27b349.el7.noarch, it can upgrade succeed while CNS deployed.

Comment 7 Wenkai Shi 2018-09-13 05:48:15 UTC

Move to VERIFIED per comment #6

Comment 9 Luke Meyer 2018-12-21 15:16:33 UTC

Closing bugs that were verified and targeted for GA but for some reason were not picked up by errata. This bug fix should be present in current 3.11 release content.