Bug 1669194

Summary: Sanity Check in upgrade and prerequisite playbook is slow and removed vars check does not work
Product: OpenShift Container Platform Reporter: Matthew Robson <mrobson>
Component: InstallerAssignee: Michael Gugino <mgugino>
Installer sub component: openshift-ansible QA Contact: Weihua Meng <wmeng>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: unspecified CC: gpei, hongkliu, mgugino, mifiedle, wmeng
Version: 3.11.0   
Target Milestone: ---   
Target Release: 3.11.z   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-02-20 14:11:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Matthew Robson 2019-01-24 15:10:18 UTC
Description of problem:

Sanity check is taking well over 60 minutes to run

10:52:07,767 p=127636 u=root |  TASK [Run variable sanity checks] *******************************************************************************************

2019-01-21 10:52:07,767 p=127636 u=root |  task path: /usr/share/ansible/openshift-ansible/playbooks/init/sanity_checks.yml:14

2019-01-21 12:08:16,698 p=127636 u=root |  ok: [nodename] => {    "changed": false,    "msg": "Sanity Checks passed"}

Doing some additional debugging, the OCS nodes take the majority of the time inside check_for_removed_vars

Version-Release number of the following components:

3.11.59

How reproducible:

Always

Steps to Reproduce:
1. Run upgrade or check, especially with large OCS nodes.
2.
3.

Actual results:
Very slow versus 3.9

Expected results:
Quick execution

Comment 2 Matthew Robson 2019-01-24 15:13:36 UTC
PR with a fix: https://github.com/openshift/openshift-ansible/pull/11061

Comment 3 Matthew Robson 2019-01-24 15:14:33 UTC
Quick test without and with the fix shows over 2x speed improvement.


Without fix - 7m 37s

2019-01-23 12:46:39,467 p=39217 u=root |  TASK [Run variable sanity checks] **********************************************

2019-01-23 12:54:16,036 p=39217 u=root |  ok: [nodename] => {
    "changed": false,
    "msg": "New Sanity Checks passed"
}

With Fix - 3m 17s

2019-01-23 13:14:57,100 p=71065 u=root |  TASK [Run variable sanity checks] **********************************************

2019-01-23 13:18:14,905 p=71065 u=root |  ok: [nodename] => {
    "changed": false,
    "msg": "New Sanity Checks passed"
}

Comment 4 Scott Dodson 2019-01-24 15:50:54 UTC
https://github.com/openshift/openshift-ansible/pull/11061 merged

Comment 8 Weihua Meng 2019-02-12 07:34:59 UTC
Hi, Mike

I tested with cluster of 6 glusterfs nodes(3 for docker registry), for upgrade time, there is no difference between
openshift-ansible-3.11.59-1.git.0.ba8e948.el7.noarch
openshift-ansible-3.11.82-1.git.0.f29227a.el7.noarch

Could you help? 
Thanks.

Comment 9 Matthew Robson 2019-02-12 15:35:30 UTC
How many devices / volumes / pvc do you have? Where we see this issue, there are around 700 volumes in use.

Comment 12 Weihua Meng 2019-02-13 00:21:12 UTC
move to verified according to comment 10 

Thanks for help, Matthew and Mike.

Comment 14 errata-xmlrpc 2019-02-20 14:11:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0326