Description of problem: I'm installing a 300 node cluster on EC2. IN 3.2 this install would take about 2.5 hours. On 3.3 the install is still running after 6 hours and is getting slower as it goes on. Each time it runs a step which skips all 300 nodes (examples are cluster set_fact, openshift_version: fail, docker version checks) it is taking over 2 minutes to skip all 300 nodes. After 6 hours the nodes are not event installed yet. - I am using the EC2 internal hostname (e.g. ip-172-31-38-181.us-west-2.compute.internal) in the inventory - docker images for install are pre-pulled to each node - ansible forks=100 - ansible is 2.1.0.0-1 All of the steps are slow, but the extreme slowness of the skip steps (and the numerous repetitions of them) seem to indicate a problem Version-Release number of selected component (if applicable): 3.3.0.10 How reproducible: 1 attempt to install 300 nodes, I may not get another given the high cost. Steps to Reproduce: 1. Create a core cluster with 3 m4.4xlarge master, 3 m4.4xlarge etcd, 1 m4.xlarge master load balancer, 2 m4.xlarge router/registry. Ensure this cluster is running well. 2. Create 300 additional instaces and run the byo/openshift-node/scaleup.yml playbook (running from one of the m4.4xlarge masters - 16 vCPU, 64GB RAM) 3. The install will run for a very long time. As it progresses, it will get visibly slower. After 6 hours, time how long a step which skips all nodes takes. Actual results: Install not complete after 6 hours Expected results: Install completes in similar time frame as 3.2 (~2.5 hours). Additional info: I have some pbench tools running - will add a pointer to any interesting data shortly.
"nodes not event installed yet" = no atomic-openshift-node package installed.
This problem shows up at a much smaller scale. Even 10 nodes is brutally slow. As reported to aos-devel, I had an install fail last week. But the important thing if you'll see the snippet below, is that the install was already taking over 1 hour. For just 3 masters, 2 infra and 10 nodes. PLAY RECAP ********************************************************************* 192.2.0.10 : ok=131 changed=15 unreachable=0 failed=1 192.2.0.11 : ok=131 changed=15 unreachable=0 failed=1 192.2.0.12 : ok=131 changed=15 unreachable=0 failed=1 192.2.0.13 : ok=131 changed=15 unreachable=0 failed=1 192.2.0.14 : ok=131 changed=15 unreachable=0 failed=1 192.2.0.15 : ok=131 changed=15 unreachable=0 failed=1 192.2.0.16 : ok=131 changed=15 unreachable=0 failed=1 192.2.0.17 : ok=131 changed=15 unreachable=0 failed=1 192.2.0.18 : ok=131 changed=15 unreachable=0 failed=1 192.2.0.19 : ok=131 changed=15 unreachable=0 failed=1 192.2.0.5 : ok=385 changed=30 unreachable=0 failed=1 192.2.0.6 : ok=286 changed=23 unreachable=0 failed=1 192.2.0.7 : ok=286 changed=23 unreachable=0 failed=1 192.2.0.8 : ok=155 changed=15 unreachable=0 failed=1 192.2.0.9 : ok=131 changed=15 unreachable=0 failed=1 localhost : ok=16 changed=9 unreachable=0 failed=0 Saturday 23 July 2016 18:43:23 -0400 (0:00:00.010) 1:01:44.270 ********* =============================================================================== openshift_node : Configure Node settings ------------------------------ 147.89s openshift_node : Install sdn-ovs package ------------------------------- 82.27s openshift_node : Check for existence of virt_sandbox_use_nfs seboolean -- 70.62s openshift_node : Check for existence of virt_use_nfs seboolean --------- 69.42s openshift_node : Check for existence of virt_sandbox_use_fusefs seboolean -- 69.32s openshift_node : Check for existence of virt_use_fusefs seboolean ------ 68.90s openshift_node : Install Ceph storage plugin dependencies -------------- 61.40s openshift_master : Restore Master Proxy Config Options ----------------- 55.49s openshift_node : Install GlusterFS storage plugin dependencies --------- 50.82s openshift_node : Set seboolean to allow gluster storage plugin access from containers(sandbox) -- 47.84s openshift_node : Install NFS storage plugin dependencies --------------- 47.75s openshift_node : Install iSCSI storage plugin dependencies ------------- 47.35s openshift_node : Set seboolean to allow nfs storage plugin access from containers -- 46.98s openshift_node : Set seboolean to allow nfs storage plugin access from containers(sandbox) -- 46.94s openshift_node : Set seboolean to allow gluster storage plugin access from containers -- 46.92s openshift_master_certificates : file ----------------------------------- 41.18s openshift_node : Install Node dependencies docker service file --------- 32.51s openshift_node : Install Node docker service file ---------------------- 32.39s openshift_master : Create the ha systemd unit files -------------------- 31.80s openshift_node : Create the openvswitch service env file --------------- 31.74s real 61m47.068s user 96m11.908s sys 13m52.690s
pbench data is here: http://perf-infra.ec2.breakage.org/pbench/results/ip-172-31-8-4/slow-ansible/tools-default/ip-172-31-8-4/ It was taken during several of the "Skipping..." steps with one or two green steps in between.
pbench data shows ansible maxing out 2 full cores while skipping nodes
Created attachment 1184591 [details] pbench data ansible pids
Attached the specific picture Mike's referring to.
I can reproduce just by shortcutting the whole process with a "fail" very early on in playbooks/common/openshift-cluster/config.yml. Currently running to my fail point in 31 seconds, if I checkout 3.3.0-1 it takes only 11, so we're running about 3x as slow right now. It seems like it affects *everything* as well, even simple things like skipped tasks and debug statements look absurdly slow to my eye. Ansible version does not appear to be causing it, it's constant in my tests. Trying to get to the bottom of it now with bisect.
Hi, surprisingly on my setup it seems the performance issue is related to ansible version: setup: openshift-ansible with "git checkout 08791978fdf3ee385760761d4fc6bc47febf1732" (before ansible 2.1.0 requirement) origin-1.2.1 running "ansible-playbook -vvvv --inventory /var/lib/ansible/inventory /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-node/scaleup.yml" takes ~1 minute on 1.9.6 and >6 minutes with any higher ansible version, tested versions: v2.0.0.1-1 v2.1.0.0-1 v2.1.1.0-0.5.rc5 origin/devel (newest) is not compatible with openshift-ansible playbooks, failing with error: fatal: [ansible2-origin-node-2nwfib9m.example.com]: FAILED! => {"changed": false , "failed": true, "module_stderr": "", "module_stdout": "Traceback (most recent call last):\r\n File \"/tmp/ansible_a0IOmm/ansible_module_openshift_facts.py\", line 2168, in <module>\r\n main()\r\n File \"/tmp/ansible_a0IOmm/ansible_mo dule_openshift_facts.py\", line 2149, in main\r\n protected_facts_to_overwrit e)\r\n File \"/tmp/ansible_a0IOmm/ansible_module_openshift_facts.py\", line 162 0, in __init__\r\n self.system_facts = ansible_facts(module, ['hardware', 'ne twork', 'virtual', 'facter'])\r\n File \"/tmp/ansible_a0IOmm/ansible_modlib.zip /ansible/module_utils/facts.py\", line 3243, in ansible_facts\r\n File \"/tmp/a nsible_a0IOmm/ansible_modlib.zip/ansible/module_utils/facts.py\", line 1002, in populate\r\n File \"/tmp/ansible_a0IOmm/ansible_modlib.zip/ansible/module_utils /facts.py\", line 132, in wrapper\r\nUnboundLocalError: local variable 'seconds' referenced before assignment\r\n", "msg": "MODULE FAILURE", "parsed": false}
Possible my testing steps were off as my failpoint did include radically different things depending on what git hash I have checked out. I am still trying to determine if something changed, I have strong suspicion I was running this ok with 2.1 recently and it got very bad last week. Will fall back to ansible version testing soon if I can't find a "good" revision.
Jan I think you were very right. My testing on openshift-ansible: Fresh 3.3 rpm install single master, git master, ansible 2.1.0: 26 minutes (WHAT?) Rerunning config.yml (i.e. a maintenance run): 16:17 I went back to openshift-ansible 3.3.2-1 tag and re-ran config.yml: 14:56' I tried disabling the ansible profiling module, no change. Then I installed ansible 1.9.4 and re-tried with 3.3.2-1: 3:31 It would appear Ansible 2.1 is about 4x as slow.
Suggested workaround would be using latest 3.2 openshift-ansible rpms and ansible 1.9.4. Problem reportedly caused by new dynamic includes in Ansible 2. Moving to NEW in case someone else wants to take a crack at this, I need to set aside for now.
Probably this one [1]. I wonder if we can get support from ansible dev team. [1] https://github.com/ansible/ansible/issues/16749
Marking this TestBlock for scalability testing. On a new run on a OpenStack cluster, the install of 3x master, 3x etcd, 2x router/registry and 2x nodes took 1hr 45 min. We then tried to run the node scaleup playbook to 300 nodes and after 6 hours it had not yet installed the rpms - all the time was spent repeatedly checking Docker versions, configuring docker, etc. Most steps were skipping all nodes.
That seems like a good possibility Aleksandar, looking at timings for very simple tasks in some of our roles I saw them start out very fast, get slower, and later get fast again when possibly we'd returned to the root of the tree and started down another branch.
*** Bug 1360433 has been marked as a duplicate of this bug. ***
*** Bug 1361559 has been marked as a duplicate of this bug. ***
Addressed by updating to a patched version of ansible derived from upstream devel branch.
Clearing needinfo on jdetiber as it's out of date. Using ansible-2.2.0-0.2.pi.el7 plus abutcher openshift-ansible performance improvements branch, a 100-node openshift cluster can be deployed in 21m48s. This is a significant improvement over ansible-2.1. 3 masters, 3 etcd, 1 lb, 2 infra, 100 nodes. openshift-3.3.0.13. Wednesday 10 August 2016 07:36:02 -0400 (0:00:00.262) 0:21:48.628 ****** =============================================================================== openshift_manage_node : Label nodes ----------------------------------- 150.19s openshift_manage_node : Set node schedulability ----------------------- 108.51s openshift_facts : Gather Cluster facts and set is_containerized if needed -- 49.31s openshift_manage_node : Wait for Node Registration --------------------- 32.43s openshift_master : pause ----------------------------------------------- 15.12s openshift_master : pause ----------------------------------------------- 15.12s openshift_facts : Gather Cluster facts and set is_containerized if needed --- 8.42s openshift_clock : Set clock facts --------------------------------------- 7.78s openshift_node : Set node facts ----------------------------------------- 7.46s openshift_facts : Gather Cluster facts and set is_containerized if needed --- 7.35s openshift_facts : Gather Cluster facts and set is_containerized if needed --- 7.26s openshift_facts : Gather Cluster facts and set is_containerized if needed --- 7.23s openshift_docker_facts : Set docker facts ------------------------------- 7.15s openshift_facts : Gather Cluster facts and set is_containerized if needed --- 7.13s openshift_cloud_provider : Set cloud provider facts --------------------- 7.12s openshift_common : Set common Cluster facts ----------------------------- 7.08s openshift_docker_facts : Set docker facts ------------------------------- 7.06s openshift_facts --------------------------------------------------------- 6.80s openshift_facts --------------------------------------------------------- 6.63s openshift_common : Set version facts ------------------------------------ 6.59s
@Jeremy, curious to know how does it compare with 3.2 vs ansible 1.9.x
Could Ansible fact caching could be enabled (if not already) to render further speedups? http://docs.ansible.com/ansible/playbooks_variables.html#fact-caching May need a redis dep.
This bug was filed due to a very specific regression in performance when upgrading to ansible 2.x. Lets keep this bug focused on that and open other bugs as appropriate.
Alex -- 3.2 @ 100 nodes was around 38 minutes. BUT: my new base VM images are far more sophisticated and pre-load a lot more than I was doing in the 3.2 days. So it's not apples-apples. I expect now 3.2+1.9.4 and 3.3+2.2pi are now roughly on-par. I'm working to get even more data at larger scale as well. Pete -- fact caching is enabled https://github.com/openshift/svt/blob/master/image_provisioner/ansible.cfg
When building 1 master+1 node which is QE's basic testing env using openshift-ansible-3.2.22-1, it cost 30 min with ansible-2.2.0-0.3.prerelease.el7.noarch for the installation, about 44 min with ansible-2.1.1.0-1.el7.noarch. Installation with ansible-2.2.0 PLAY RECAP ********************************************************************* localhost : ok=13 changed=7 unreachable=0 failed=0 openshift-x.com : ok=143 changed=43 unreachable=0 failed=0 openshift-x.com : ok=448 changed=110 unreachable=0 failed=0 Friday 12 August 2016 04:46:13 +0000 (0:00:00.232) 0:30:29.036 ********* =============================================================================== openshift_common : Install the base package for versioning ------------- 81.83s openshift_common : Install the base package for versioning ------------- 63.66s openshift_version : Gather common package version ---------------------- 21.88s openshift_common : Set version facts ----------------------------------- 21.13s os_firewall : Add iptables allow rules --------------------------------- 18.76s openshift_node : Install Ceph storage plugin dependencies -------------- 18.11s openshift_master : Start and enable master ----------------------------- 17.80s setup ------------------------------------------------------------------ 15.62s setup ------------------------------------------------------------------ 15.60s setup ------------------------------------------------------------------ 15.60s openshift_manageiq : Configure role/user permissions ------------------- 15.21s openshift_repos : Remove any yum repo files for other deployment types RHEL/CentOS -- 14.50s openshift_repos : Remove any yum repo files for other deployment types RHEL/CentOS -- 14.31s openshift_repos : Remove any yum repo files for other deployment types RHEL/CentOS -- 14.31s openshift_repos : Remove any yum repo files for other deployment types RHEL/CentOS -- 14.30s openshift_node : Install Node package ---------------------------------- 14.28s openshift_repos : Remove any yum repo files for other deployment types RHEL/CentOS -- 14.19s openshift_repos : Remove any yum repo files for other deployment types RHEL/CentOS -- 14.08s openshift_repos : Remove any yum repo files for other deployment types RHEL/CentOS -- 14.06s openshift_repos : Remove any yum repo files for other deployment types RHEL/CentOS -- 14.02s Installation with ansible-2.1.1.0 PLAY RECAP ********************************************************************* localhost : ok=13 changed=7 unreachable=0 failed=0 openshift-x.com : ok=448 changed=107 unreachable=0 failed=0 openshift-x.com : ok=143 changed=44 unreachable=0 failed=0 Friday 12 August 2016 06:36:40 +0000 (0:00:01.470) 0:44:00.320 ********* =============================================================================== openshift_common : Install the base package for versioning ------------- 72.92s openshift_common : Install the base package for versioning ------------- 63.21s openshift_storage_nfs : Install nfs-utils ------------------------------ 31.10s openshift_master : Start and enable master ----------------------------- 29.84s openshift_common : Set common Cluster facts ---------------------------- 26.70s openshift_version : Gather common package version ---------------------- 21.83s openshift_repos : Remove any yum repo files for other deployment types RHEL/CentOS -- 20.86s openshift_common : Set version facts ----------------------------------- 20.67s openshift_master_certificates : file ----------------------------------- 19.86s openshift_repos : Remove any yum repo files for other deployment types RHEL/CentOS -- 18.87s openshift_node : Install Ceph storage plugin dependencies -------------- 17.96s os_firewall : Add iptables allow rules --------------------------------- 17.91s openshift_master_certificates : Check status of master certificates ---- 16.17s openshift_ca : Create the master certificates if they do not already exist -- 16.15s openshift_docker_facts : Set docker facts ------------------------------ 15.08s openshift_common : Set version facts ----------------------------------- 14.77s openshift_node : Install Node package ---------------------------------- 14.71s openshift_node_certificates : Check status of node certificates -------- 14.25s openshift_manageiq : Configure role/user permissions ------------------- 12.67s openshift_repos : Remove any yum repo files for other deployment types RHEL/CentOS -- 12.53s We could see a noticeable improvement for the same installation when building with ansible-2.2.0, so move this bug to verified. If the time cost by much larger scale is still not acceptable, pls feel free to add comment here.
Gaoyun: Do you have a comparable number for Ansible 1.9.4 to setup a similar cluster? (just for my own curiosity)
(In reply to Devan Goodwin from comment #31) > Gaoyun: Do you have a comparable number for Ansible 1.9.4 to setup a similar > cluster? (just for my own curiosity) Since openshift-ansible-3.2.22-1 requires ansible >= 2.1, so I just tried the same installation with openshift-ansible-3.2.13-1 + ansible-1.9.4, it's about 29 min.
Ok that is really good news, thanks!
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:1639
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days