Bug 1360440 - openshift-ansible install extremely slow in openshift 3.3 with ansible 2.1
Summary: openshift-ansible install extremely slow in openshift 3.3 with ansible 2.1
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 3.2.1
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: ---
: 3.2.1
Assignee: Andrew Butcher
QA Contact: Gaoyun Pei
URL:
Whiteboard:
: 1360433 1361559 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-07-26 17:49 UTC by Mike Fiedler
Modified: 2023-09-14 03:28 UTC (History)
16 users (show)

Fixed In Version: ansible-2.2.0-0.2.pi.el7
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-08-18 19:29:42 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
pbench data ansible pids (187.51 KB, image/png)
2016-07-27 10:58 UTC, Jeremy Eder
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2016:1639 0 normal SHIPPED_LIVE OpenShift Enterprise atomic-openshift-utils bug fix and enhancement update 2016-08-18 23:26:45 UTC

Description Mike Fiedler 2016-07-26 17:49:17 UTC
Description of problem:

I'm installing a 300 node cluster on EC2.   IN 3.2 this install would take about 2.5 hours.   On 3.3 the install is still running after 6 hours and is getting slower as it goes on.   Each time it runs a step which skips all 300 nodes (examples are cluster set_fact, openshift_version: fail, docker version checks) it is taking over 2 minutes to skip all 300 nodes.   After 6 hours the nodes are not event installed yet.

- I am using the EC2 internal hostname (e.g. ip-172-31-38-181.us-west-2.compute.internal) in the inventory
- docker images for install are pre-pulled to each node
- ansible forks=100
- ansible is 2.1.0.0-1

All of the steps are slow, but the extreme slowness of the skip steps (and the numerous repetitions of them) seem to indicate a problem


Version-Release number of selected component (if applicable): 3.3.0.10


How reproducible: 1 attempt to install 300 nodes, I may not get another given the high cost.


Steps to Reproduce:
1.  Create a core cluster with 3 m4.4xlarge master, 3 m4.4xlarge etcd, 1 m4.xlarge master load balancer, 2 m4.xlarge router/registry.   Ensure this cluster is running well. 
2. Create 300 additional instaces and run the byo/openshift-node/scaleup.yml playbook (running from one of the m4.4xlarge masters - 16 vCPU, 64GB RAM)
3. The install will run for a very long time.   As it progresses, it will get visibly slower.   After 6 hours, time how long a step which skips all nodes takes.

Actual results:

Install not complete after 6 hours

Expected results:

Install completes in similar time frame as 3.2 (~2.5 hours).

Additional info:

I have some pbench tools running - will add a pointer to any interesting data shortly.

Comment 1 Mike Fiedler 2016-07-26 17:52:28 UTC
"nodes not event installed yet" = no atomic-openshift-node package installed.

Comment 2 Jeremy Eder 2016-07-26 17:55:58 UTC
This problem shows up at a much smaller scale.  Even 10 nodes is brutally slow.
As reported to aos-devel, I had an install fail last week.

But the important thing if you'll see the snippet below, is that the install was already taking over 1 hour.  For just 3 masters, 2 infra and 10 nodes.


PLAY RECAP *********************************************************************
192.2.0.10                 : ok=131  changed=15   unreachable=0    failed=1   
192.2.0.11                 : ok=131  changed=15   unreachable=0    failed=1   
192.2.0.12                 : ok=131  changed=15   unreachable=0    failed=1   
192.2.0.13                 : ok=131  changed=15   unreachable=0    failed=1   
192.2.0.14                 : ok=131  changed=15   unreachable=0    failed=1   
192.2.0.15                 : ok=131  changed=15   unreachable=0    failed=1   
192.2.0.16                 : ok=131  changed=15   unreachable=0    failed=1   
192.2.0.17                 : ok=131  changed=15   unreachable=0    failed=1   
192.2.0.18                 : ok=131  changed=15   unreachable=0    failed=1   
192.2.0.19                 : ok=131  changed=15   unreachable=0    failed=1   
192.2.0.5                  : ok=385  changed=30   unreachable=0    failed=1   
192.2.0.6                  : ok=286  changed=23   unreachable=0    failed=1   
192.2.0.7                  : ok=286  changed=23   unreachable=0    failed=1   
192.2.0.8                  : ok=155  changed=15   unreachable=0    failed=1   
192.2.0.9                  : ok=131  changed=15   unreachable=0    failed=1   
localhost                  : ok=16   changed=9    unreachable=0    failed=0   

Saturday 23 July 2016  18:43:23 -0400 (0:00:00.010)       1:01:44.270 ********* 
=============================================================================== 
openshift_node : Configure Node settings ------------------------------ 147.89s
openshift_node : Install sdn-ovs package ------------------------------- 82.27s
openshift_node : Check for existence of virt_sandbox_use_nfs seboolean -- 70.62s
openshift_node : Check for existence of virt_use_nfs seboolean --------- 69.42s
openshift_node : Check for existence of virt_sandbox_use_fusefs seboolean -- 69.32s
openshift_node : Check for existence of virt_use_fusefs seboolean ------ 68.90s
openshift_node : Install Ceph storage plugin dependencies -------------- 61.40s
openshift_master : Restore Master Proxy Config Options ----------------- 55.49s
openshift_node : Install GlusterFS storage plugin dependencies --------- 50.82s
openshift_node : Set seboolean to allow gluster storage plugin access from containers(sandbox) -- 47.84s
openshift_node : Install NFS storage plugin dependencies --------------- 47.75s
openshift_node : Install iSCSI storage plugin dependencies ------------- 47.35s
openshift_node : Set seboolean to allow nfs storage plugin access from containers -- 46.98s
openshift_node : Set seboolean to allow nfs storage plugin access from containers(sandbox) -- 46.94s
openshift_node : Set seboolean to allow gluster storage plugin access from containers -- 46.92s
openshift_master_certificates : file ----------------------------------- 41.18s
openshift_node : Install Node dependencies docker service file --------- 32.51s
openshift_node : Install Node docker service file ---------------------- 32.39s
openshift_master : Create the ha systemd unit files -------------------- 31.80s
openshift_node : Create the openvswitch service env file --------------- 31.74s

real    61m47.068s
user    96m11.908s
sys     13m52.690s

Comment 3 Mike Fiedler 2016-07-26 18:24:40 UTC
pbench data is here:  http://perf-infra.ec2.breakage.org/pbench/results/ip-172-31-8-4/slow-ansible/tools-default/ip-172-31-8-4/

It was taken during several of the "Skipping..." steps with one or two green steps in between.

Comment 4 Mike Fiedler 2016-07-26 20:19:29 UTC
pbench data shows ansible maxing out 2 full cores while skipping nodes

Comment 5 Jeremy Eder 2016-07-27 10:58:27 UTC
Created attachment 1184591 [details]
pbench data ansible pids

Comment 6 Jeremy Eder 2016-07-27 10:59:05 UTC
Attached the specific picture Mike's referring to.

Comment 7 Devan Goodwin 2016-07-27 12:59:45 UTC
I can reproduce just by shortcutting the whole process with a "fail" very early on in playbooks/common/openshift-cluster/config.yml. Currently running to my fail point in 31 seconds, if I checkout 3.3.0-1 it takes only 11, so we're running about 3x as slow right now. It seems like it affects *everything* as well, even simple things like skipped tasks and debug statements look absurdly slow to my eye.

Ansible version does not appear to be causing it, it's constant in my tests.

Trying to get to the bottom of it now with bisect.

Comment 8 Jan Provaznik 2016-07-27 14:06:54 UTC
Hi, surprisingly on my setup it seems the performance issue is related to ansible version:

setup:
openshift-ansible with "git checkout 08791978fdf3ee385760761d4fc6bc47febf1732" (before ansible 2.1.0 requirement)
origin-1.2.1

running "ansible-playbook -vvvv --inventory /var/lib/ansible/inventory /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-node/scaleup.yml"

takes ~1 minute on 1.9.6 and >6 minutes with any higher ansible version, tested versions:
v2.0.0.1-1
v2.1.0.0-1
v2.1.1.0-0.5.rc5
origin/devel (newest) is not compatible with openshift-ansible playbooks, failing with error:

fatal: [ansible2-origin-node-2nwfib9m.example.com]: FAILED! => {"changed": false
, "failed": true, "module_stderr": "", "module_stdout": "Traceback (most recent 
call last):\r\n  File \"/tmp/ansible_a0IOmm/ansible_module_openshift_facts.py\",
 line 2168, in <module>\r\n    main()\r\n  File \"/tmp/ansible_a0IOmm/ansible_mo
dule_openshift_facts.py\", line 2149, in main\r\n    protected_facts_to_overwrit
e)\r\n  File \"/tmp/ansible_a0IOmm/ansible_module_openshift_facts.py\", line 162
0, in __init__\r\n    self.system_facts = ansible_facts(module, ['hardware', 'ne
twork', 'virtual', 'facter'])\r\n  File \"/tmp/ansible_a0IOmm/ansible_modlib.zip
/ansible/module_utils/facts.py\", line 3243, in ansible_facts\r\n  File \"/tmp/a
nsible_a0IOmm/ansible_modlib.zip/ansible/module_utils/facts.py\", line 1002, in 
populate\r\n  File \"/tmp/ansible_a0IOmm/ansible_modlib.zip/ansible/module_utils
/facts.py\", line 132, in wrapper\r\nUnboundLocalError: local variable 'seconds'
 referenced before assignment\r\n", "msg": "MODULE FAILURE", "parsed": false}

Comment 9 Devan Goodwin 2016-07-27 14:50:15 UTC
Possible my testing steps were off as my failpoint did include radically different things depending on what git hash I have checked out.

I am still trying to determine if something changed, I have strong suspicion I was running this ok with 2.1 recently and it got very bad last week. Will fall back to ansible version testing soon if I can't find a "good" revision.

Comment 10 Devan Goodwin 2016-07-27 15:49:17 UTC
Jan I think you were very right.

My testing on openshift-ansible:

Fresh 3.3 rpm install single master, git master, ansible 2.1.0: 26 minutes (WHAT?)

Rerunning config.yml (i.e. a maintenance run): 16:17

I went back to openshift-ansible 3.3.2-1 tag and re-ran config.yml: 14:56'

I tried disabling the ansible profiling module, no change.

Then I installed ansible 1.9.4 and re-tried with 3.3.2-1: 3:31

It would appear Ansible 2.1 is about 4x as slow.

Comment 11 Devan Goodwin 2016-07-27 17:32:58 UTC
Suggested workaround would be using latest 3.2 openshift-ansible rpms and ansible 1.9.4.

Problem reportedly caused by new dynamic includes in Ansible 2.

Moving to NEW in case someone else wants to take a crack at this, I need to set aside for now.

Comment 12 Aleksandar Kostadinov 2016-07-28 20:41:05 UTC
Probably this one [1]. I wonder if we can get support from ansible dev team.

[1] https://github.com/ansible/ansible/issues/16749

Comment 13 Mike Fiedler 2016-07-29 02:10:17 UTC
Marking this TestBlock for scalability testing.   On a new run on a OpenStack cluster, the install of 3x master, 3x etcd, 2x router/registry and 2x nodes took 1hr 45 min.   We then tried to run the node scaleup playbook to 300 nodes and after 6 hours it had not yet installed the rpms - all the time was spent repeatedly checking Docker versions, configuring docker, etc.   Most steps were skipping all nodes.

Comment 14 Devan Goodwin 2016-07-29 11:32:13 UTC
That seems like a good possibility Aleksandar, looking at timings for very simple tasks in some of our roles I saw them start out very fast, get slower, and later get fast again when possibly we'd returned to the root of the tree and started down another branch.

Comment 15 Scott Dodson 2016-08-02 15:10:30 UTC
*** Bug 1360433 has been marked as a duplicate of this bug. ***

Comment 18 Scott Dodson 2016-08-04 18:48:15 UTC
*** Bug 1361559 has been marked as a duplicate of this bug. ***

Comment 22 Scott Dodson 2016-08-09 22:03:28 UTC
Addressed by updating to a patched version of ansible derived from upstream devel branch.

Comment 24 Jeremy Eder 2016-08-10 12:14:21 UTC
Clearing needinfo on jdetiber as it's out of date.

Using ansible-2.2.0-0.2.pi.el7 plus abutcher openshift-ansible performance improvements branch, a 100-node openshift cluster can be deployed in 21m48s.  This is a significant improvement over ansible-2.1.

3 masters, 3 etcd, 1 lb, 2 infra, 100 nodes. openshift-3.3.0.13.

Wednesday 10 August 2016  07:36:02 -0400 (0:00:00.262)       0:21:48.628 ****** 
=============================================================================== 
openshift_manage_node : Label nodes ----------------------------------- 150.19s
openshift_manage_node : Set node schedulability ----------------------- 108.51s
openshift_facts : Gather Cluster facts and set is_containerized if needed -- 49.31s
openshift_manage_node : Wait for Node Registration --------------------- 32.43s
openshift_master : pause ----------------------------------------------- 15.12s
openshift_master : pause ----------------------------------------------- 15.12s
openshift_facts : Gather Cluster facts and set is_containerized if needed --- 8.42s
openshift_clock : Set clock facts --------------------------------------- 7.78s
openshift_node : Set node facts ----------------------------------------- 7.46s
openshift_facts : Gather Cluster facts and set is_containerized if needed --- 7.35s
openshift_facts : Gather Cluster facts and set is_containerized if needed --- 7.26s
openshift_facts : Gather Cluster facts and set is_containerized if needed --- 7.23s
openshift_docker_facts : Set docker facts ------------------------------- 7.15s
openshift_facts : Gather Cluster facts and set is_containerized if needed --- 7.13s
openshift_cloud_provider : Set cloud provider facts --------------------- 7.12s
openshift_common : Set common Cluster facts ----------------------------- 7.08s
openshift_docker_facts : Set docker facts ------------------------------- 7.06s
openshift_facts --------------------------------------------------------- 6.80s
openshift_facts --------------------------------------------------------- 6.63s
openshift_common : Set version facts ------------------------------------ 6.59s

Comment 25 Aleksandar Kostadinov 2016-08-10 12:38:58 UTC
@Jeremy, curious to know how does it compare with 3.2 vs ansible 1.9.x

Comment 26 Pete MacKinnon 2016-08-10 13:39:46 UTC
Could Ansible fact caching could be enabled (if not already) to render further speedups?

http://docs.ansible.com/ansible/playbooks_variables.html#fact-caching

May need a redis dep.

Comment 27 Scott Dodson 2016-08-10 14:08:34 UTC
This bug was filed due to a very specific regression in performance when upgrading to ansible 2.x. Lets keep this bug focused on that and open other bugs as appropriate.

Comment 28 Jeremy Eder 2016-08-10 15:06:30 UTC
Alex -- 3.2 @ 100 nodes was around 38 minutes.  BUT:  my new base VM images are far more sophisticated and pre-load a lot more than I was doing in the 3.2 days.  So it's not apples-apples.  I expect now 3.2+1.9.4 and 3.3+2.2pi are now roughly on-par.  I'm working to get even more data at larger scale as well.

Pete -- fact caching is enabled https://github.com/openshift/svt/blob/master/image_provisioner/ansible.cfg

Comment 30 Gaoyun Pei 2016-08-12 06:43:05 UTC
When building 1 master+1 node which is QE's basic testing env using openshift-ansible-3.2.22-1, it cost 30 min with ansible-2.2.0-0.3.prerelease.el7.noarch for the installation, about 44 min with ansible-2.1.1.0-1.el7.noarch.


Installation with ansible-2.2.0
PLAY RECAP *********************************************************************
localhost                  : ok=13   changed=7    unreachable=0    failed=0   
openshift-x.com : ok=143  changed=43   unreachable=0    failed=0   
openshift-x.com : ok=448  changed=110  unreachable=0    failed=0   

Friday 12 August 2016  04:46:13 +0000 (0:00:00.232)       0:30:29.036 ********* 
=============================================================================== 
openshift_common : Install the base package for versioning ------------- 81.83s
openshift_common : Install the base package for versioning ------------- 63.66s
openshift_version : Gather common package version ---------------------- 21.88s
openshift_common : Set version facts ----------------------------------- 21.13s
os_firewall : Add iptables allow rules --------------------------------- 18.76s
openshift_node : Install Ceph storage plugin dependencies -------------- 18.11s
openshift_master : Start and enable master ----------------------------- 17.80s
setup ------------------------------------------------------------------ 15.62s
setup ------------------------------------------------------------------ 15.60s
setup ------------------------------------------------------------------ 15.60s
openshift_manageiq : Configure role/user permissions ------------------- 15.21s
openshift_repos : Remove any yum repo files for other deployment types RHEL/CentOS -- 14.50s
openshift_repos : Remove any yum repo files for other deployment types RHEL/CentOS -- 14.31s
openshift_repos : Remove any yum repo files for other deployment types RHEL/CentOS -- 14.31s
openshift_repos : Remove any yum repo files for other deployment types RHEL/CentOS -- 14.30s
openshift_node : Install Node package ---------------------------------- 14.28s
openshift_repos : Remove any yum repo files for other deployment types RHEL/CentOS -- 14.19s
openshift_repos : Remove any yum repo files for other deployment types RHEL/CentOS -- 14.08s
openshift_repos : Remove any yum repo files for other deployment types RHEL/CentOS -- 14.06s
openshift_repos : Remove any yum repo files for other deployment types RHEL/CentOS -- 14.02s


Installation with ansible-2.1.1.0
PLAY RECAP *********************************************************************
localhost                  : ok=13   changed=7    unreachable=0    failed=0   
openshift-x.com : ok=448  changed=107  unreachable=0    failed=0   
openshift-x.com : ok=143  changed=44   unreachable=0    failed=0   

Friday 12 August 2016  06:36:40 +0000 (0:00:01.470)       0:44:00.320 ********* 
=============================================================================== 
openshift_common : Install the base package for versioning ------------- 72.92s
openshift_common : Install the base package for versioning ------------- 63.21s
openshift_storage_nfs : Install nfs-utils ------------------------------ 31.10s
openshift_master : Start and enable master ----------------------------- 29.84s
openshift_common : Set common Cluster facts ---------------------------- 26.70s
openshift_version : Gather common package version ---------------------- 21.83s
openshift_repos : Remove any yum repo files for other deployment types RHEL/CentOS -- 20.86s
openshift_common : Set version facts ----------------------------------- 20.67s
openshift_master_certificates : file ----------------------------------- 19.86s
openshift_repos : Remove any yum repo files for other deployment types RHEL/CentOS -- 18.87s
openshift_node : Install Ceph storage plugin dependencies -------------- 17.96s
os_firewall : Add iptables allow rules --------------------------------- 17.91s
openshift_master_certificates : Check status of master certificates ---- 16.17s
openshift_ca : Create the master certificates if they do not already exist -- 16.15s
openshift_docker_facts : Set docker facts ------------------------------ 15.08s
openshift_common : Set version facts ----------------------------------- 14.77s
openshift_node : Install Node package ---------------------------------- 14.71s
openshift_node_certificates : Check status of node certificates -------- 14.25s
openshift_manageiq : Configure role/user permissions ------------------- 12.67s
openshift_repos : Remove any yum repo files for other deployment types RHEL/CentOS -- 12.53s


We could see a noticeable improvement for the same installation when building with ansible-2.2.0, so move this bug to verified. 
If the time cost by much larger scale is still not acceptable, pls feel free to add comment here.

Comment 31 Devan Goodwin 2016-08-12 11:29:27 UTC
Gaoyun: Do you have a comparable number for Ansible 1.9.4 to setup a similar cluster? (just for my own curiosity)

Comment 32 Gaoyun Pei 2016-08-12 15:16:41 UTC
(In reply to Devan Goodwin from comment #31)
> Gaoyun: Do you have a comparable number for Ansible 1.9.4 to setup a similar
> cluster? (just for my own curiosity)

Since openshift-ansible-3.2.22-1 requires ansible >= 2.1, so I just tried the same installation with openshift-ansible-3.2.13-1 + ansible-1.9.4, it's about 29 min.

Comment 33 Devan Goodwin 2016-08-12 15:21:33 UTC
Ok that is really good news, thanks!

Comment 35 errata-xmlrpc 2016-08-18 19:29:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1639

Comment 36 Red Hat Bugzilla 2023-09-14 03:28:43 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days


Note You need to log in before you can comment on or make changes to this bug.