Bug 1669194

Summary:	Sanity Check in upgrade and prerequisite playbook is slow and removed vars check does not work
Product:	OpenShift Container Platform	Reporter:	Matthew Robson <mrobson>
Component:	Installer	Assignee:	Michael Gugino <mgugino>
Installer sub component:	openshift-ansible	QA Contact:	Weihua Meng <wmeng>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	unspecified	CC:	gpei, hongkliu, mgugino, mifiedle, wmeng
Version:	3.11.0
Target Milestone:	---
Target Release:	3.11.z
Hardware:	All
OS:	All
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-02-20 14:11:02 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Matthew Robson 2019-01-24 15:10:18 UTC

Description of problem:

Sanity check is taking well over 60 minutes to run

10:52:07,767 p=127636 u=root |  TASK [Run variable sanity checks] *******************************************************************************************

2019-01-21 10:52:07,767 p=127636 u=root |  task path: /usr/share/ansible/openshift-ansible/playbooks/init/sanity_checks.yml:14

2019-01-21 12:08:16,698 p=127636 u=root |  ok: [nodename] => {    "changed": false,    "msg": "Sanity Checks passed"}

Doing some additional debugging, the OCS nodes take the majority of the time inside check_for_removed_vars

Version-Release number of the following components:

3.11.59

How reproducible:

Always

Steps to Reproduce:
1. Run upgrade or check, especially with large OCS nodes.
2.
3.

Actual results:
Very slow versus 3.9

Expected results:
Quick execution

Comment 2 Matthew Robson 2019-01-24 15:13:36 UTC

PR with a fix: https://github.com/openshift/openshift-ansible/pull/11061

Comment 3 Matthew Robson 2019-01-24 15:14:33 UTC

Quick test without and with the fix shows over 2x speed improvement.


Without fix - 7m 37s

2019-01-23 12:46:39,467 p=39217 u=root |  TASK [Run variable sanity checks] **********************************************

2019-01-23 12:54:16,036 p=39217 u=root |  ok: [nodename] => {
    "changed": false,
    "msg": "New Sanity Checks passed"
}

With Fix - 3m 17s

2019-01-23 13:14:57,100 p=71065 u=root |  TASK [Run variable sanity checks] **********************************************

2019-01-23 13:18:14,905 p=71065 u=root |  ok: [nodename] => {
    "changed": false,
    "msg": "New Sanity Checks passed"
}

Comment 4 Scott Dodson 2019-01-24 15:50:54 UTC

https://github.com/openshift/openshift-ansible/pull/11061 merged

Comment 8 Weihua Meng 2019-02-12 07:34:59 UTC

Hi, Mike

I tested with cluster of 6 glusterfs nodes(3 for docker registry), for upgrade time, there is no difference between
openshift-ansible-3.11.59-1.git.0.ba8e948.el7.noarch
openshift-ansible-3.11.82-1.git.0.f29227a.el7.noarch

Could you help? 
Thanks.

Comment 9 Matthew Robson 2019-02-12 15:35:30 UTC

How many devices / volumes / pvc do you have? Where we see this issue, there are around 700 volumes in use.

Comment 12 Weihua Meng 2019-02-13 00:21:12 UTC

move to verified according to comment 10 

Thanks for help, Matthew and Mike.

Comment 14 errata-xmlrpc 2019-02-20 14:11:02 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0326