Hide Forgot
Description of problem: Seeing Initialization taking a very long time in an average sized cluster on bare metal. Took 1h 50 minutes for a cluster with 30 nodes. Slowest areas are: Initialize basic host facts - 27m 45s Initialize cluster facts - 7m 57s Verify Requirements - 1h 4m 2019-02-12 12:21:00,346 p=103939 u=root | PLAYBOOK: upgrade_control_plane.yml *************************************************************************************** 2019-02-12 12:21:00,450 p=103939 u=root | PLAY [Initialization Checkpoint Start] ************************************************************************************ 2019-02-12 12:21:00,613 p=103939 u=root | PLAY [Populate config host groups] **************************************************************************************** 2019-02-12 12:21:02,680 p=103939 u=root | PLAY [Ensure that all non-node hosts are accessible] ********************************************************************** 2019-02-12 12:21:07,270 p=103939 u=root | PLAY [Initialize basic host facts] **************************************************************************************** 2019-02-12 12:48:52,411 p=103939 u=root | PLAY [Retrieve existing master configs and validate] ********************************************************************** 2019-02-12 12:49:03,127 p=103939 u=root | PLAY [Initialize special first-master variables] ************************************************************************** 2019-02-12 12:49:07,291 p=103939 u=root | PLAY [Disable web console if required] ************************************************************************************ 2019-02-12 12:49:08,220 p=103939 u=root | PLAY [Setup yum repositories for all hosts] ******************************************************************************* 2019-02-12 12:49:08,234 p=103939 u=root | PLAY [Install packages necessary for installer] *************************************************************************** 2019-02-12 12:53:28,439 p=103939 u=root | PLAY [Initialize cluster facts] ******************************************************************************************* 2019-02-12 13:02:25,591 p=103939 u=root | PLAY [Initialize etcd host variables] ************************************************************************************* 2019-02-12 13:02:34,710 p=103939 u=root | PLAY [Determine openshift_version to configure on first master] *********************************************************** 2019-02-12 13:02:44,718 p=103939 u=root | PLAY [Set openshift_version for etcd, node, and master hosts] ************************************************************* 2019-02-12 13:06:11,011 p=103939 u=root | PLAY [Verify Requirements] ************************************************************************************************ 2019-02-12 14:10:11,469 p=103939 u=root | PLAY [Verify Node NetworkManager] ***************************************************************************************** 2019-02-12 14:10:11,482 p=103939 u=root | PLAY [Initialization Checkpoint End] ************************************************************************************** Inside Verify Requirements I had them add some extra logging to determine what was taking so long. Master 1 is fast for each stage, taking 22 seconds: 2019-02-12 13:06:15,320 p=103939 u=root | [WARNING]: Sanity Running 2019-02-12 13:06:15 2019-02-12 13:06:15,321 p=103939 u=root | [WARNING]: Checking host master1 at 2019-02-12 13:06:15 2019-02-12 13:06:15,321 p=103939 u=root | [WARNING]: Checking start at 2019-02-12 13:06:15 2019-02-12 13:06:15,654 p=103939 u=root | [WARNING]: Checking hostvars at 2019-02-12 13:06:15 2019-02-12 13:06:15,942 p=103939 u=root | [WARNING]: Checking deployment type at 2019-02-12 13:06:15 2019-02-12 13:06:16,206 p=103939 u=root | [WARNING]: Checking python version at 2019-02-12 13:06:16 2019-02-12 13:06:16,480 p=103939 u=root | [WARNING]: Checking image tag at 2019-02-12 13:06:16 2019-02-12 13:06:18,111 p=103939 u=root | [WARNING]: Checking package version at 2019-02-12 13:06:18 2019-02-12 13:06:18,382 p=103939 u=root | [WARNING]: Checking release format at 2019-02-12 13:06:18 2019-02-12 13:06:20,106 p=103939 u=root | [WARNING]: Checking plugin check at 2019-02-12 13:06:20 2019-02-12 13:06:20,661 p=103939 u=root | [WARNING]: Checking hostname vars at 2019-02-12 13:06:20 2019-02-12 13:06:21,189 p=103939 u=root | [WARNING]: Checking session auth at 2019-02-12 13:06:21 2019-02-12 13:06:23,382 p=103939 u=root | [WARNING]: Checking nfs configs at 2019-02-12 13:06:23 2019-02-12 13:06:23,648 p=103939 u=root | [WARNING]: Checking htpasswd at 2019-02-12 13:06:23 2019-02-12 13:06:36,760 p=103939 u=root | [WARNING]: Checking removed vars at 2019-02-12 13:06:36 2019-02-12 13:06:37,070 p=103939 u=root | [WARNING]: Checking end at 2019-02-12 13:06:37 The 3 infra nodes running logging, metrics, router, registry are also fast. The nodes are a bit slower, example node1 at 1m 47s with 1m 2s being spent in removed vars checking. 2019-02-12 13:08:35,083 p=103939 u=root | [WARNING]: Checking host node1 at 2019-02-12 13:08:35 2019-02-12 13:08:35,083 p=103939 u=root | [WARNING]: Checking start at 2019-02-12 13:08:35 2019-02-12 13:08:36,299 p=103939 u=root | [WARNING]: Checking hostvars at 2019-02-12 13:08:36 2019-02-12 13:08:37,491 p=103939 u=root | [WARNING]: Checking deployment type at 2019-02-12 13:08:37 2019-02-12 13:08:38,681 p=103939 u=root | [WARNING]: Checking python version at 2019-02-12 13:08:38 2019-02-12 13:08:40,466 p=103939 u=root | [WARNING]: Checking image tag at 2019-02-12 13:08:40 2019-02-12 13:08:41,678 p=103939 u=root | [WARNING]: Checking package version at 2019-02-12 13:08:41 2019-02-12 13:08:42,886 p=103939 u=root | [WARNING]: Checking release format at 2019-02-12 13:08:42 2019-02-12 13:08:50,602 p=103939 u=root | [WARNING]: Checking plugin check at 2019-02-12 13:08:50 2019-02-12 13:08:52,976 p=103939 u=root | [WARNING]: Checking hostname vars at 2019-02-12 13:08:52 2019-02-12 13:08:55,934 p=103939 u=root | [WARNING]: Checking session auth at 2019-02-12 13:08:55 2019-02-12 13:09:06,288 p=103939 u=root | [WARNING]: Checking nfs configs at 2019-02-12 13:09:06 2019-02-12 13:09:08,081 p=103939 u=root | [WARNING]: Checking htpasswd at 2019-02-12 13:09:08 2019-02-12 13:10:11,018 p=103939 u=root | [WARNING]: Checking removed vars at 2019-02-12 13:10:11 2019-02-12 13:10:12,194 p=103939 u=root | [WARNING]: Checking end at 2019-02-12 13:10:12 The real issue is the 4 OCS nodes. They start at 13:39:02 and finish at 14:09:19,840, taking over 30 minutes. The interesting thing is ocs1-3 are the original 3 and ocs4 was just added about a month ago. Over time, the OCS nodes are getting slower and slower. For all of the nodes, the majority of time is still being spent inside the removed vars check which is already better due to the fix from https://bugzilla.redhat.com/show_bug.cgi?id=1669194 OCS1: 9m 22s OCS2: 9m 30s OCS3: 9m 12s OCS4: 2m 13s 2019-02-12 13:39:02,933 p=103939 u=root | [WARNING]: Checking host ocs1 at 2019-02-12 13:39:02 2019-02-12 13:39:02,934 p=103939 u=root | [WARNING]: Checking start at 2019-02-12 13:39:02 2019-02-12 13:39:10,182 p=103939 u=root | [WARNING]: Checking hostvars at 2019-02-12 13:39:10 2019-02-12 13:39:18,036 p=103939 u=root | [WARNING]: Checking deployment type at 2019-02-12 13:39:18 2019-02-12 13:39:25,117 p=103939 u=root | [WARNING]: Checking python version at 2019-02-12 13:39:25 2019-02-12 13:39:32,984 p=103939 u=root | [WARNING]: Checking image tag at 2019-02-12 13:39:32 2019-02-12 13:39:40,342 p=103939 u=root | [WARNING]: Checking package version at 2019-02-12 13:39:40 2019-02-12 13:39:47,678 p=103939 u=root | [WARNING]: Checking release format at 2019-02-12 13:39:47 2019-02-12 13:40:33,424 p=103939 u=root | [WARNING]: Checking plugin check at 2019-02-12 13:40:33 2019-02-12 13:40:48,014 p=103939 u=root | [WARNING]: Checking hostname vars at 2019-02-12 13:40:48 2019-02-12 13:41:03,790 p=103939 u=root | [WARNING]: Checking session auth at 2019-02-12 13:41:03 2019-02-12 13:42:03,932 p=103939 u=root | [WARNING]: Checking nfs configs at 2019-02-12 13:42:03 2019-02-12 13:42:11,357 p=103939 u=root | [WARNING]: Checking htpasswd at 2019-02-12 13:42:11 2019-02-12 13:48:17,384 p=103939 u=root | [WARNING]: Checking removed vars at 2019-02-12 13:48:17 2019-02-12 13:48:24,959 p=103939 u=root | [WARNING]: Checking end at 2019-02-12 13:48:24 2019-02-12 13:48:24,959 p=103939 u=root | [WARNING]: Checking host ocs2 at 2019-02-12 13:48:24 2019-02-12 13:48:24,960 p=103939 u=root | [WARNING]: Checking start at 2019-02-12 13:48:24 2019-02-12 13:48:33,253 p=103939 u=root | [WARNING]: Checking hostvars at 2019-02-12 13:48:33 2019-02-12 13:48:40,713 p=103939 u=root | [WARNING]: Checking deployment type at 2019-02-12 13:48:40 2019-02-12 13:48:48,294 p=103939 u=root | [WARNING]: Checking python version at 2019-02-12 13:48:48 2019-02-12 13:48:56,530 p=103939 u=root | [WARNING]: Checking image tag at 2019-02-12 13:48:56 2019-02-12 13:49:04,153 p=103939 u=root | [WARNING]: Checking package version at 2019-02-12 13:49:04 2019-02-12 13:49:12,223 p=103939 u=root | [WARNING]: Checking release format at 2019-02-12 13:49:12 2019-02-12 13:49:58,253 p=103939 u=root | [WARNING]: Checking plugin check at 2019-02-12 13:49:58 2019-02-12 13:50:13,667 p=103939 u=root | [WARNING]: Checking hostname vars at 2019-02-12 13:50:13 2019-02-12 13:50:29,338 p=103939 u=root | [WARNING]: Checking session auth at 2019-02-12 13:50:29 2019-02-12 13:51:30,683 p=103939 u=root | [WARNING]: Checking nfs configs at 2019-02-12 13:51:30 2019-02-12 13:51:37,915 p=103939 u=root | [WARNING]: Checking htpasswd at 2019-02-12 13:51:37 2019-02-12 13:57:46,247 p=103939 u=root | [WARNING]: Checking removed vars at 2019-02-12 13:57:46 2019-02-12 13:57:54,076 p=103939 u=root | [WARNING]: Checking end at 2019-02-12 13:57:54 2019-02-12 13:57:54,076 p=103939 u=root | [WARNING]: Checking host ocs3 at 2019-02-12 13:57:54 2019-02-12 13:57:54,077 p=103939 u=root | [WARNING]: Checking start at 2019-02-12 13:57:54 2019-02-12 13:58:01,908 p=103939 u=root | [WARNING]: Checking hostvars at 2019-02-12 13:58:01 2019-02-12 13:58:09,264 p=103939 u=root | [WARNING]: Checking deployment type at 2019-02-12 13:58:09 2019-02-12 13:58:16,563 p=103939 u=root | [WARNING]: Checking python version at 2019-02-12 13:58:16 2019-02-12 13:58:24,192 p=103939 u=root | [WARNING]: Checking image tag at 2019-02-12 13:58:24 2019-02-12 13:58:31,350 p=103939 u=root | [WARNING]: Checking package version at 2019-02-12 13:58:31 2019-02-12 13:58:39,336 p=103939 u=root | [WARNING]: Checking release format at 2019-02-12 13:58:39 2019-02-12 13:59:23,447 p=103939 u=root | [WARNING]: Checking plugin check at 2019-02-12 13:59:23 2019-02-12 13:59:37,735 p=103939 u=root | [WARNING]: Checking hostname vars at 2019-02-12 13:59:37 2019-02-12 13:59:52,792 p=103939 u=root | [WARNING]: Checking session auth at 2019-02-12 13:59:52 2019-02-12 14:00:52,227 p=103939 u=root | [WARNING]: Checking nfs configs at 2019-02-12 14:00:52 2019-02-12 14:00:59,527 p=103939 u=root | [WARNING]: Checking htpasswd at 2019-02-12 14:00:59 2019-02-12 14:06:59,539 p=103939 u=root | [WARNING]: Checking removed vars at 2019-02-12 14:06:59 2019-02-12 14:07:06,656 p=103939 u=root | [WARNING]: Checking end at 2019-02-12 14:07:06 2019-02-12 14:07:06,657 p=103939 u=root | [WARNING]: Checking host ocs4 at 2019-02-12 14:07:06 2019-02-12 14:07:06,657 p=103939 u=root | [WARNING]: Checking start at 2019-02-12 14:07:06 2019-02-12 14:07:08,938 p=103939 u=root | [WARNING]: Checking hostvars at 2019-02-12 14:07:08 2019-02-12 14:07:10,521 p=103939 u=root | [WARNING]: Checking deployment type at 2019-02-12 14:07:10 2019-02-12 14:07:12,099 p=103939 u=root | [WARNING]: Checking python version at 2019-02-12 14:07:12 2019-02-12 14:07:14,330 p=103939 u=root | [WARNING]: Checking image tag at 2019-02-12 14:07:14 2019-02-12 14:07:15,899 p=103939 u=root | [WARNING]: Checking package version at 2019-02-12 14:07:15 2019-02-12 14:07:17,462 p=103939 u=root | [WARNING]: Checking release format at 2019-02-12 14:07:17 2019-02-12 14:07:28,160 p=103939 u=root | [WARNING]: Checking plugin check at 2019-02-12 14:07:28 2019-02-12 14:07:31,890 p=103939 u=root | [WARNING]: Checking hostname vars at 2019-02-12 14:07:31 2019-02-12 14:07:35,760 p=103939 u=root | [WARNING]: Checking session auth at 2019-02-12 14:07:35 2019-02-12 14:07:49,934 p=103939 u=root | [WARNING]: Checking nfs configs at 2019-02-12 14:07:49 2019-02-12 14:07:52,203 p=103939 u=root | [WARNING]: Checking htpasswd at 2019-02-12 14:07:52 2019-02-12 14:09:18,141 p=103939 u=root | [WARNING]: Checking removed vars at 2019-02-12 14:09:18 2019-02-12 14:09:19,840 p=103939 u=root | [WARNING]: Checking end at 2019-02-12 14:09:19 Do we need to check removed vars against every single node in the inventory if it's only checking the static hosts file? Version-Release number of the following components: 3.11.69 How reproducible: Always Steps to Reproduce: 1. Run the control plane or node upgrades 2. 3. Actual results: Please include the entire output from the last TASK line through the end of output if an error is generated See above and attached logs. Expected results: Much faster initialization Additional info: Please attach logs from ansible-playbook with the -vvv flag Attached
Matt, Do you have the instrumented code or patches that we can look at? Reading the data you've provided it isn't clear if it's htpasswd or the removed vars that's actually consuming time. Is the warning message emitted on entry or exit of the given functions? 2019-02-12 14:00:59,527 p=103939 u=root | [WARNING]: Checking htpasswd at 2019-02-12 14:00:59 -- 6 minutes here -- 2019-02-12 14:06:59,539 p=103939 u=root | [WARNING]: Checking removed vars at 2019-02-12 14:06:59 2019-02-12 14:07:06,656 p=103939 u=root | [WARNING]: Checking end at 2019-02-12 14:07:06
Sorry, should have included that. Its instrumented at exit - should have done entry and exit both to make it easier to read. I will attached the sanity_checks.py file. So the difference between removed and htpasswd is how long removed took. self.check_htpasswd_provider(hostvars, host) display.warning("Checking htpasswd at %s" % datetime.datetime.fromtimestamp(time.time()).strftime('%Y-%m-%d %H:%M:%S')) check_for_removed_vars(hostvars, host) display.warning("Checking removed vars at %s" % datetime.datetime.fromtimestamp(time.time()).strftime('%Y-%m-%d %H:%M:%S'))
Created attachment 1538830 [details] instrumented sanity_checks.py This is the file from 3.11.59 with the first sanity check fix applied along with the instrumentation.
Created 3.11 PR - https://github.com/openshift/openshift-ansible/pull/11268
Vadim, is there an easy way to run the initialization / sanity check on it's own so we can test this at the customer?
Fix is available in openshift-ansible-3.11.92-1
Fixed. openshift-ansible-3.11.96-1 In my test With 3.11.69, TASK [Run variable sanity checks] costs 27 seconds with 3.11.96, TASK [Run variable sanity checks] costs less than 1 second
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:1605