Bug 1682924 - Slow Initialization During Upgrades
Summary: Slow Initialization During Upgrades
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 3.11.0
Hardware: All
OS: Linux
high
high
Target Milestone: ---
: 3.11.z
Assignee: Vadim Rutkovsky
QA Contact: Weihua Meng
URL:
Whiteboard:
Depends On:
Blocks: 1685583
TreeView+ depends on / blocked
 
Reported: 2019-02-25 20:27 UTC by Matthew Robson
Modified: 2019-06-26 09:08 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: sanity checks were using inefficient code to validate hostvars Consequence: sanity check time took several minutes on a large set of hosts Fix: hostvars are stored in the class, so there are not being copied on every check Result: sanity checks take less time to complete
Clone Of:
: 1685583 (view as bug list)
Environment:
Last Closed: 2019-06-26 09:07:55 UTC
Target Upstream Version:


Attachments (Terms of Use)
instrumented sanity_checks.py (21.85 KB, text/plain)
2019-02-26 14:06 UTC, Matthew Robson
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:1605 0 None None None 2019-06-26 09:08:04 UTC

Description Matthew Robson 2019-02-25 20:27:14 UTC
Description of problem:

Seeing Initialization taking a very long time in an average sized cluster on bare metal.

Took 1h 50 minutes for a cluster with 30 nodes.

Slowest areas are:
Initialize basic host facts - 27m 45s
Initialize cluster facts - 7m 57s
Verify Requirements - 1h 4m


2019-02-12 12:21:00,346 p=103939 u=root |  PLAYBOOK: upgrade_control_plane.yml ***************************************************************************************
2019-02-12 12:21:00,450 p=103939 u=root |  PLAY [Initialization Checkpoint Start] ************************************************************************************
2019-02-12 12:21:00,613 p=103939 u=root |  PLAY [Populate config host groups] ****************************************************************************************
2019-02-12 12:21:02,680 p=103939 u=root |  PLAY [Ensure that all non-node hosts are accessible] **********************************************************************
2019-02-12 12:21:07,270 p=103939 u=root |  PLAY [Initialize basic host facts] ****************************************************************************************
2019-02-12 12:48:52,411 p=103939 u=root |  PLAY [Retrieve existing master configs and validate] **********************************************************************
2019-02-12 12:49:03,127 p=103939 u=root |  PLAY [Initialize special first-master variables] **************************************************************************
2019-02-12 12:49:07,291 p=103939 u=root |  PLAY [Disable web console if required] ************************************************************************************
2019-02-12 12:49:08,220 p=103939 u=root |  PLAY [Setup yum repositories for all hosts] *******************************************************************************
2019-02-12 12:49:08,234 p=103939 u=root |  PLAY [Install packages necessary for installer] ***************************************************************************
2019-02-12 12:53:28,439 p=103939 u=root |  PLAY [Initialize cluster facts] *******************************************************************************************
2019-02-12 13:02:25,591 p=103939 u=root |  PLAY [Initialize etcd host variables] *************************************************************************************
2019-02-12 13:02:34,710 p=103939 u=root |  PLAY [Determine openshift_version to configure on first master] ***********************************************************
2019-02-12 13:02:44,718 p=103939 u=root |  PLAY [Set openshift_version for etcd, node, and master hosts] *************************************************************
2019-02-12 13:06:11,011 p=103939 u=root |  PLAY [Verify Requirements] ************************************************************************************************
2019-02-12 14:10:11,469 p=103939 u=root |  PLAY [Verify Node NetworkManager] *****************************************************************************************
2019-02-12 14:10:11,482 p=103939 u=root |  PLAY [Initialization Checkpoint End] **************************************************************************************

Inside Verify Requirements I had them add some extra logging to determine what was taking so long.

Master 1 is fast for each stage, taking 22 seconds:

2019-02-12 13:06:15,320 p=103939 u=root |   [WARNING]: Sanity Running 2019-02-12 13:06:15
2019-02-12 13:06:15,321 p=103939 u=root |   [WARNING]: Checking host master1 at 2019-02-12 13:06:15
2019-02-12 13:06:15,321 p=103939 u=root |   [WARNING]: Checking start at 2019-02-12 13:06:15
2019-02-12 13:06:15,654 p=103939 u=root |   [WARNING]: Checking hostvars at 2019-02-12 13:06:15
2019-02-12 13:06:15,942 p=103939 u=root |   [WARNING]: Checking deployment type at 2019-02-12 13:06:15
2019-02-12 13:06:16,206 p=103939 u=root |   [WARNING]: Checking python version at 2019-02-12 13:06:16
2019-02-12 13:06:16,480 p=103939 u=root |   [WARNING]: Checking image tag at 2019-02-12 13:06:16
2019-02-12 13:06:18,111 p=103939 u=root |   [WARNING]: Checking package version at 2019-02-12 13:06:18
2019-02-12 13:06:18,382 p=103939 u=root |   [WARNING]: Checking release format at 2019-02-12 13:06:18
2019-02-12 13:06:20,106 p=103939 u=root |   [WARNING]: Checking plugin check at 2019-02-12 13:06:20
2019-02-12 13:06:20,661 p=103939 u=root |   [WARNING]: Checking hostname vars at 2019-02-12 13:06:20
2019-02-12 13:06:21,189 p=103939 u=root |   [WARNING]: Checking session auth at 2019-02-12 13:06:21
2019-02-12 13:06:23,382 p=103939 u=root |   [WARNING]: Checking nfs configs at 2019-02-12 13:06:23
2019-02-12 13:06:23,648 p=103939 u=root |   [WARNING]: Checking htpasswd at 2019-02-12 13:06:23
2019-02-12 13:06:36,760 p=103939 u=root |   [WARNING]: Checking removed vars at 2019-02-12 13:06:36
2019-02-12 13:06:37,070 p=103939 u=root |   [WARNING]: Checking end at 2019-02-12 13:06:37

The 3 infra nodes running logging, metrics, router, registry are also fast.

The nodes are a bit slower, example node1 at 1m 47s with 1m 2s being spent in removed vars checking.

2019-02-12 13:08:35,083 p=103939 u=root |   [WARNING]: Checking host node1 at 2019-02-12 13:08:35
2019-02-12 13:08:35,083 p=103939 u=root |   [WARNING]: Checking start at 2019-02-12 13:08:35
2019-02-12 13:08:36,299 p=103939 u=root |   [WARNING]: Checking hostvars at 2019-02-12 13:08:36
2019-02-12 13:08:37,491 p=103939 u=root |   [WARNING]: Checking deployment type at 2019-02-12 13:08:37
2019-02-12 13:08:38,681 p=103939 u=root |   [WARNING]: Checking python version at 2019-02-12 13:08:38
2019-02-12 13:08:40,466 p=103939 u=root |   [WARNING]: Checking image tag at 2019-02-12 13:08:40
2019-02-12 13:08:41,678 p=103939 u=root |   [WARNING]: Checking package version at 2019-02-12 13:08:41
2019-02-12 13:08:42,886 p=103939 u=root |   [WARNING]: Checking release format at 2019-02-12 13:08:42
2019-02-12 13:08:50,602 p=103939 u=root |   [WARNING]: Checking plugin check at 2019-02-12 13:08:50
2019-02-12 13:08:52,976 p=103939 u=root |   [WARNING]: Checking hostname vars at 2019-02-12 13:08:52
2019-02-12 13:08:55,934 p=103939 u=root |   [WARNING]: Checking session auth at 2019-02-12 13:08:55
2019-02-12 13:09:06,288 p=103939 u=root |   [WARNING]: Checking nfs configs at 2019-02-12 13:09:06
2019-02-12 13:09:08,081 p=103939 u=root |   [WARNING]: Checking htpasswd at 2019-02-12 13:09:08
2019-02-12 13:10:11,018 p=103939 u=root |   [WARNING]: Checking removed vars at 2019-02-12 13:10:11
2019-02-12 13:10:12,194 p=103939 u=root |   [WARNING]: Checking end at 2019-02-12 13:10:12

The real issue is the 4 OCS nodes. They start at 13:39:02 and finish at 14:09:19,840, taking over 30 minutes.

The interesting thing is ocs1-3 are the original 3 and ocs4 was just added about a month ago. Over time, the OCS nodes are getting slower and slower. For all of the nodes, the majority of time is still being spent inside the removed vars check which is already better due to the fix from https://bugzilla.redhat.com/show_bug.cgi?id=1669194 

OCS1: 9m 22s
OCS2: 9m 30s
OCS3: 9m 12s
OCS4: 2m 13s


2019-02-12 13:39:02,933 p=103939 u=root |   [WARNING]: Checking host ocs1 at 2019-02-12 13:39:02
2019-02-12 13:39:02,934 p=103939 u=root |   [WARNING]: Checking start at 2019-02-12 13:39:02
2019-02-12 13:39:10,182 p=103939 u=root |   [WARNING]: Checking hostvars at 2019-02-12 13:39:10
2019-02-12 13:39:18,036 p=103939 u=root |   [WARNING]: Checking deployment type at 2019-02-12 13:39:18
2019-02-12 13:39:25,117 p=103939 u=root |   [WARNING]: Checking python version at 2019-02-12 13:39:25
2019-02-12 13:39:32,984 p=103939 u=root |   [WARNING]: Checking image tag at 2019-02-12 13:39:32
2019-02-12 13:39:40,342 p=103939 u=root |   [WARNING]: Checking package version at 2019-02-12 13:39:40
2019-02-12 13:39:47,678 p=103939 u=root |   [WARNING]: Checking release format at 2019-02-12 13:39:47
2019-02-12 13:40:33,424 p=103939 u=root |   [WARNING]: Checking plugin check at 2019-02-12 13:40:33
2019-02-12 13:40:48,014 p=103939 u=root |   [WARNING]: Checking hostname vars at 2019-02-12 13:40:48
2019-02-12 13:41:03,790 p=103939 u=root |   [WARNING]: Checking session auth at 2019-02-12 13:41:03
2019-02-12 13:42:03,932 p=103939 u=root |   [WARNING]: Checking nfs configs at 2019-02-12 13:42:03
2019-02-12 13:42:11,357 p=103939 u=root |   [WARNING]: Checking htpasswd at 2019-02-12 13:42:11
2019-02-12 13:48:17,384 p=103939 u=root |   [WARNING]: Checking removed vars at 2019-02-12 13:48:17
2019-02-12 13:48:24,959 p=103939 u=root |   [WARNING]: Checking end at 2019-02-12 13:48:24
2019-02-12 13:48:24,959 p=103939 u=root |   [WARNING]: Checking host ocs2 at 2019-02-12 13:48:24
2019-02-12 13:48:24,960 p=103939 u=root |   [WARNING]: Checking start at 2019-02-12 13:48:24
2019-02-12 13:48:33,253 p=103939 u=root |   [WARNING]: Checking hostvars at 2019-02-12 13:48:33
2019-02-12 13:48:40,713 p=103939 u=root |   [WARNING]: Checking deployment type at 2019-02-12 13:48:40
2019-02-12 13:48:48,294 p=103939 u=root |   [WARNING]: Checking python version at 2019-02-12 13:48:48
2019-02-12 13:48:56,530 p=103939 u=root |   [WARNING]: Checking image tag at 2019-02-12 13:48:56
2019-02-12 13:49:04,153 p=103939 u=root |   [WARNING]: Checking package version at 2019-02-12 13:49:04
2019-02-12 13:49:12,223 p=103939 u=root |   [WARNING]: Checking release format at 2019-02-12 13:49:12
2019-02-12 13:49:58,253 p=103939 u=root |   [WARNING]: Checking plugin check at 2019-02-12 13:49:58
2019-02-12 13:50:13,667 p=103939 u=root |   [WARNING]: Checking hostname vars at 2019-02-12 13:50:13
2019-02-12 13:50:29,338 p=103939 u=root |   [WARNING]: Checking session auth at 2019-02-12 13:50:29
2019-02-12 13:51:30,683 p=103939 u=root |   [WARNING]: Checking nfs configs at 2019-02-12 13:51:30
2019-02-12 13:51:37,915 p=103939 u=root |   [WARNING]: Checking htpasswd at 2019-02-12 13:51:37
2019-02-12 13:57:46,247 p=103939 u=root |   [WARNING]: Checking removed vars at 2019-02-12 13:57:46
2019-02-12 13:57:54,076 p=103939 u=root |   [WARNING]: Checking end at 2019-02-12 13:57:54
2019-02-12 13:57:54,076 p=103939 u=root |   [WARNING]: Checking host ocs3 at 2019-02-12 13:57:54
2019-02-12 13:57:54,077 p=103939 u=root |   [WARNING]: Checking start at 2019-02-12 13:57:54
2019-02-12 13:58:01,908 p=103939 u=root |   [WARNING]: Checking hostvars at 2019-02-12 13:58:01
2019-02-12 13:58:09,264 p=103939 u=root |   [WARNING]: Checking deployment type at 2019-02-12 13:58:09
2019-02-12 13:58:16,563 p=103939 u=root |   [WARNING]: Checking python version at 2019-02-12 13:58:16
2019-02-12 13:58:24,192 p=103939 u=root |   [WARNING]: Checking image tag at 2019-02-12 13:58:24
2019-02-12 13:58:31,350 p=103939 u=root |   [WARNING]: Checking package version at 2019-02-12 13:58:31
2019-02-12 13:58:39,336 p=103939 u=root |   [WARNING]: Checking release format at 2019-02-12 13:58:39
2019-02-12 13:59:23,447 p=103939 u=root |   [WARNING]: Checking plugin check at 2019-02-12 13:59:23
2019-02-12 13:59:37,735 p=103939 u=root |   [WARNING]: Checking hostname vars at 2019-02-12 13:59:37
2019-02-12 13:59:52,792 p=103939 u=root |   [WARNING]: Checking session auth at 2019-02-12 13:59:52
2019-02-12 14:00:52,227 p=103939 u=root |   [WARNING]: Checking nfs configs at 2019-02-12 14:00:52
2019-02-12 14:00:59,527 p=103939 u=root |   [WARNING]: Checking htpasswd at 2019-02-12 14:00:59
2019-02-12 14:06:59,539 p=103939 u=root |   [WARNING]: Checking removed vars at 2019-02-12 14:06:59
2019-02-12 14:07:06,656 p=103939 u=root |   [WARNING]: Checking end at 2019-02-12 14:07:06
2019-02-12 14:07:06,657 p=103939 u=root |   [WARNING]: Checking host ocs4 at 2019-02-12 14:07:06
2019-02-12 14:07:06,657 p=103939 u=root |   [WARNING]: Checking start at 2019-02-12 14:07:06
2019-02-12 14:07:08,938 p=103939 u=root |   [WARNING]: Checking hostvars at 2019-02-12 14:07:08
2019-02-12 14:07:10,521 p=103939 u=root |   [WARNING]: Checking deployment type at 2019-02-12 14:07:10
2019-02-12 14:07:12,099 p=103939 u=root |   [WARNING]: Checking python version at 2019-02-12 14:07:12
2019-02-12 14:07:14,330 p=103939 u=root |   [WARNING]: Checking image tag at 2019-02-12 14:07:14
2019-02-12 14:07:15,899 p=103939 u=root |   [WARNING]: Checking package version at 2019-02-12 14:07:15
2019-02-12 14:07:17,462 p=103939 u=root |   [WARNING]: Checking release format at 2019-02-12 14:07:17
2019-02-12 14:07:28,160 p=103939 u=root |   [WARNING]: Checking plugin check at 2019-02-12 14:07:28
2019-02-12 14:07:31,890 p=103939 u=root |   [WARNING]: Checking hostname vars at 2019-02-12 14:07:31
2019-02-12 14:07:35,760 p=103939 u=root |   [WARNING]: Checking session auth at 2019-02-12 14:07:35
2019-02-12 14:07:49,934 p=103939 u=root |   [WARNING]: Checking nfs configs at 2019-02-12 14:07:49
2019-02-12 14:07:52,203 p=103939 u=root |   [WARNING]: Checking htpasswd at 2019-02-12 14:07:52
2019-02-12 14:09:18,141 p=103939 u=root |   [WARNING]: Checking removed vars at 2019-02-12 14:09:18
2019-02-12 14:09:19,840 p=103939 u=root |   [WARNING]: Checking end at 2019-02-12 14:09:19

Do we need to check removed vars against every single node in the inventory if it's only checking the static hosts file?


Version-Release number of the following components:

3.11.69

How reproducible:

Always

Steps to Reproduce:
1. Run the control plane or node upgrades
2.
3.

Actual results:
Please include the entire output from the last TASK line through the end of output if an error is generated

See above and attached logs.

Expected results:

Much faster initialization

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Attached

Comment 6 Scott Dodson 2019-02-26 13:53:59 UTC
Matt,

Do you have the instrumented code or patches that we can look at? Reading the data you've provided it isn't clear if it's htpasswd or the removed vars that's actually consuming time. Is the warning message emitted on entry or exit of the given functions?

2019-02-12 14:00:59,527 p=103939 u=root |   [WARNING]: Checking htpasswd at 2019-02-12 14:00:59

-- 6 minutes here --

2019-02-12 14:06:59,539 p=103939 u=root |   [WARNING]: Checking removed vars at 2019-02-12 14:06:59
2019-02-12 14:07:06,656 p=103939 u=root |   [WARNING]: Checking end at 2019-02-12 14:07:06

Comment 7 Matthew Robson 2019-02-26 14:05:12 UTC
Sorry, should have included that. Its instrumented at exit - should have done entry and exit both to make it easier to read. I will attached the sanity_checks.py file.

So the difference between removed and htpasswd is how long removed took.

self.check_htpasswd_provider(hostvars, host)
display.warning("Checking htpasswd at %s" % datetime.datetime.fromtimestamp(time.time()).strftime('%Y-%m-%d %H:%M:%S'))
check_for_removed_vars(hostvars, host)
display.warning("Checking removed vars at %s" % datetime.datetime.fromtimestamp(time.time()).strftime('%Y-%m-%d %H:%M:%S'))

Comment 8 Matthew Robson 2019-02-26 14:06:28 UTC
Created attachment 1538830 [details]
instrumented sanity_checks.py

This is the file from 3.11.59 with the first sanity check fix applied along with the instrumentation.

Comment 9 Vadim Rutkovsky 2019-02-28 16:24:25 UTC
Created 3.11 PR - https://github.com/openshift/openshift-ansible/pull/11268

Comment 10 Matthew Robson 2019-02-28 16:32:59 UTC
Vadim, is there an easy way to run the initialization / sanity check on it's own so we can test this at the customer?

Comment 14 Vadim Rutkovsky 2019-03-08 11:53:02 UTC
Fix is available in openshift-ansible-3.11.92-1

Comment 15 Weihua Meng 2019-03-18 06:09:57 UTC
Fixed.

openshift-ansible-3.11.96-1

In my test

With 3.11.69, TASK [Run variable sanity checks] costs 27 seconds

with 3.11.96, TASK [Run variable sanity checks] costs less than 1 second

Comment 17 errata-xmlrpc 2019-06-26 09:07:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:1605


Note You need to log in before you can comment on or make changes to this bug.