In the above output, the rebuild of ssh_known_hosts takes close to 1hour (almost 3600s). We've had better results with this: ==================================================================== tasks: # This task is a trick to execute only on the ansible runner (undercloud) and with the compute facts - name: rebuild /etc/ssh/ssh_known_hosts connection: local run_once: true template: src: /home/stack/overcloud/utilities/ssh_known_hosts.j2 dest: /home/stack/overcloud/utilities/ssh_known_hosts_generated owner: stack group: stack - name: Copy generated /etc/ssh/ssh_known_hosts to compute hosts become: true copy: src: /home/stack/overcloud/utilities/ssh_known_hosts_generated dest: /etc/ssh/ssh_known_hosts owner: root group: root mode: 0644 notify: - Restart Nova containers ==================================================================== $ cat ssh_known_hosts.j2 #jinja2: trim_blocks: "True", lstrip_blocks: "True" {% for host in ansible_play_hosts_all %} {# Iterate over all IPs on each host #} {% for ip in hostvars[host]['ansible_all_ipv4_addresses'] %} {% if "172.31" not in ip %} {% endif %} {% for ssh_pubkey,shortname in ssh_pubkey_names.items() %} {% if ssh_pubkey in hostvars[host] %} {% if "172.31" not in ip %} {{ ip }} {{ shortname }} {{ hostvars[host][ssh_pubkey] }} {% endif %} {% endif %} {% endfor %} {% endfor %} {# Iterate over all possible hostnames on each host #} {% for ssh_suffix in ssh_host_suffixes %} {% for ssh_pubkey,shortname in ssh_pubkey_names.items() %} {% if ssh_pubkey in hostvars[host] %} {{ hostvars[host]['ansible_hostname']}}.{{ ssh_suffix }} {{ shortname }} {{ hostvars[host][ssh_pubkey] }} {% endif %} {% endfor %} {% endfor %} {% endfor %} ====================================================================
The overall duration of: # Run the actual deployment export playbook=`find tripleo-ansible -name deploy_steps_playbook.yaml` export ANSIBLE_FORCE_COLOR=1 export ANSIBLE_CALLBACK_WHITELIST=profile_tasks echo "Running playbook ${playbook} on ${INVENTORY_LIMIT}" time ansible-playbook \ -i tripleo-ansible/inventory.yaml \ --private-key /home/stack/.ssh/id_rsa \ --become ${playbook} \ --limit "${INVENTORY_LIMIT}" ...is close to 10 hours for 90 computes.
There isn't an ansible.cfg aside from the default in /etc/ansible/ansible.cfg testing with forks=100 overloaded the director and tasks started failing (the VM only has 16vcpu and 64gb RAM). I´m now testing with an ansible.cfg which is a copy of /etc/ansible/ansible.cfg with those changes: $ diff /etc/ansible/ansible.cfg ansible.cfg |grep '>' > forks = 32 > display_skipped_hosts = False > pipelining = True
Having skipped hosts = false makes it clearer that there are tasks that use cpu time when they don't actually need to do anything. Here an example from step4: PLAY [External deployment step 4] ************************************************************************************************************************************************* PLAY [Overcloud deploy step tasks for 4] ****************************************************************************************************************************************** PLAY [Overcloud common deploy step tasks 4] *************************************************************************************************************************************** TASK [Create and ensure setype for /var/log/containers directory] ***************************************************************************************************************** Saturday 21 March 2020 02:50:08 +0000 (0:01:03.585) 8:04:15.638 ******** TASK [Create /var/lib/tripleo-config directory] *********************************************************************************************************************************** Saturday 21 March 2020 02:51:13 +0000 (0:01:04.961) 8:05:20.600 ******** TASK [Check if puppet step_config.pp manifest exists] ***************************************************************************************************************************** Saturday 21 March 2020 02:52:16 +0000 (0:01:02.973) 8:06:23.574 ******** TASK [Set fact when file existed] ************************************************************************************************************************************************* Saturday 21 March 2020 02:53:19 +0000 (0:01:03.329) 8:07:26.903 ******** TASK [Write the puppet step_config manifest] ************************************************************************************************************************************** Saturday 21 March 2020 02:54:21 +0000 (0:01:02.326) 8:08:29.230 ******** TASK [Create /var/lib/docker-puppet] ********************************************************************************************************************************************** Saturday 21 March 2020 02:55:27 +0000 (0:01:05.220) 8:09:34.451 ******** TASK [Check if docker-puppet puppet_config.yaml configuration file exists] ******************************************************************************************************** Saturday 21 March 2020 02:56:29 +0000 (0:01:02.759) 8:10:37.211 ******** TASK [Set fact when file existed] ************************************************************************************************************************************************* Saturday 21 March 2020 02:57:32 +0000 (0:01:02.420) 8:11:39.632 ******** TASK [Write docker-puppet.json file] ********************************************************************************************************************************************** Saturday 21 March 2020 02:58:33 +0000 (0:01:01.319) 8:12:40.951 ******** TASK [Create /var/lib/docker-config-scripts] ************************************************************************************************************************************** Saturday 21 March 2020 02:59:36 +0000 (0:01:03.106) 8:13:44.058 ******** TASK [Clean old /var/lib/docker-container-startup-configs.json file] ************************************************************************************************************** Saturday 21 March 2020 03:00:39 +0000 (0:01:02.594) 8:14:46.653 ******** TASK [Check if docker_config_scripts.yaml file exists] **************************************************************************************************************************** Saturday 21 March 2020 03:01:41 +0000 (0:01:02.519) 8:15:49.173 ******** TASK [Set fact when file existed] ************************************************************************************************************************************************* Saturday 21 March 2020 03:02:45 +0000 (0:01:03.878) 8:16:53.051 ******** TASK [Write docker config scripts] ************************************************************************************************************************************************ Saturday 21 March 2020 03:03:46 +0000 (0:01:00.962) 8:17:54.013 ******** TASK [Set docker_config_default fact] ********************************************************************************************************************************************* Saturday 21 March 2020 03:04:51 +0000 (0:01:04.612) 8:18:58.626 ******** (check the time deltas on the right)
More output from step4: TASK [Set docker_startup_configs_with_default fact] ******************************************************************************************************************************* Saturday 21 March 2020 03:08:05 +0000 (0:01:03.469) 8:22:12.655 ******** TASK [Write docker-container-startup-configs] ************************************************************************************************************************************* Saturday 21 March 2020 03:09:10 +0000 (0:01:04.862) 8:23:17.518 ******** TASK [Write per-step docker-container-startup-configs] **************************************************************************************************************************** Saturday 21 March 2020 03:10:13 +0000 (0:01:03.496) 8:24:21.015 ******** TASK [Create /var/lib/kolla/config_files directory] ******************************************************************************************************************************* Saturday 21 March 2020 03:11:22 +0000 (0:01:08.628) 8:25:29.643 ******** TASK [Check if kolla_config.yaml file exists] ************************************************************************************************************************************* Saturday 21 March 2020 03:12:23 +0000 (0:01:01.316) 8:26:30.959 ******** TASK [Set fact when file existed] ************************************************************************************************************************************************* Saturday 21 March 2020 03:13:27 +0000 (0:01:03.819) 8:27:34.779 ******** TASK [Write kolla config json files] ********************************************************************************************************************************************** Saturday 21 March 2020 03:14:29 +0000 (0:01:01.615) 8:28:36.395 ******** TASK [Clean /var/lib/docker-puppet/docker-puppet-tasks*.json files] *************************************************************************************************************** Saturday 21 March 2020 03:15:36 +0000 (0:01:07.554) 8:29:43.950 ********
Hi James, spent most of the last 36 hours running deploy.sh and timing it. The results are quite disappointing (no major difference was seen, not even with mitogen): ************************************* ATTEMPT #1) *************************************: $ diff /etc/ansible/ansible.cfg ansible.cfg |grep '>' > forks = 32 > display_skipped_hosts = False > pipelining = True RESULT: PASS real 621m40.403s user 700m52.525s sys 200m50.139s ************************************* ATTEMPT #2) *************************************: $ diff /etc/ansible/ansible.cfg ansible.cfg |grep '>' > forks = 32 > display_skipped_hosts = False RESULT: PASS real 631m21.758s user 716m16.938s sys 325m27.261s ************************************* ATTEMPT #3) *************************************: $ diff /etc/ansible/ansible.cfg ansible.cfg |grep '>' > forks = 100 RESULT: FAIL (not enough memory) ************************************* ATTEMPT #4) *************************************: $ diff /etc/ansible/ansible.cfg ansible.cfg |grep '>' > forks = 32 > display_skipped_hosts = False > pipelining = True ( mitogen : ENABLED ) RESULT: PASS real 616m32.520s user 666m26.979s sys 131m2.002s ************************************* ATTEMPT #5) *************************************: $ diff /etc/ansible/ansible.cfg ansible.cfg |grep '>' > forks = 100 > display_skipped_hosts = False > pipelining = True ( mitogen : ENABLED ) RESULT: PASS real 667m19.010s user 705m48.422s sys 173m40.255s
On the other hand, looking at ansible-playbook-command.sh under /var/lib/mistral, I'm wondering if mitogen would make use of /home/stack/ansible.cfg, perhaps only /etc/ansible/ansible.cfg.
Mistral on osp13 does seem to use something else: [root@tenlab1-director 4c2f0acb-d1c8-4beb-a1a3-972f7ac8299c]# ls -la /var/lib/mistral/4c2f0acb-d1c8-4beb-a1a3-972f7ac8299c/ansible.cfg -rw-r--r--. 1 mistral mistral 644 May 8 2019 /var/lib/mistral/4c2f0acb-d1c8-4beb-a1a3-972f7ac8299c/ansible.cfg [root@tenlab1-director 4c2f0acb-d1c8-4beb-a1a3-972f7ac8299c]# cat /var/lib/mistral/4c2f0acb-d1c8-4beb-a1a3-972f7ac8299c/ansible.cfg [defaults] roles_path = /etc/ansible/roles:/usr/share/ansible/roles retry_files_enabled = False log_path = /var/lib/mistral/4c2f0acb-d1c8-4beb-a1a3-972f7ac8299c/ansible.log forks = 25 timeout = 30 gather_timeout = 30 [inventory] [privilege_escalation] [paramiko_connection] [ssh_connection] ssh_args = -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ControlMaster=auto -o ControlPersist=30m -o ServerAliveInterval=5 -o ServerAliveCountMax=5 control_path_dir = /var/lib/mistral/4c2f0acb-d1c8-4beb-a1a3-972f7ac8299c/ansible-ssh retries = 8 pipelining = True [persistent_connection] [accelerate] [selinux] [colors] [diff]
I would like to set the record straight on a few things here. (1) Mitogen is not a Python library that we ship in RHEL or RHOSP meaning it is an unsupported solution. We have done testing with it in the past and saw negligible improvements. Instead, we are looking into improved Ansible Tower integration with RHOSP for helping scale Ansible. (2) config-download is a technology preview in RHOSP 13. We have done a massive amount of backports to improve the speed and functionality of it in the past. As RHOSP 13 continues to age, it has become increasingly more complex and time consuming to backport all of the latest updates of config-download. For us to fully backport all of the required changes would essentially have us rewriting RHOSP 13 to be RHOSP 16. (3) RHOSP 16 has been our release where we have been targeting full long-term support of config-download with a focus on large scale deployments. Our tests have shown that the deployment time is x5 faster. We highly recommend upgrading to this long-life release for the best possible experience. (4) We will continue to backport performance improvements where ever we can. However, none of those backports will magically make RHOSP 13 as fast as 16 when it comes to deployment time.
A couple of observations on the output: 1) tripleo-ssh-known-hosts took > 1 hour TASK [tripleo-ssh-known-hosts : Add hosts key in /etc/ssh/ssh_known_hosts for live/cold-migration] *** Saturday 14 March 2020 02:51:07 +0000 (0:01:18.577) 0:09:50.614 ******** ...SNIP... ok: [tenlab1-compute068] => (item=tenlab1-compute068) PLAY [Server deployments] ****************************************************** TASK [include] ***************************************************************** Saturday 14 March 2020 03:50:15 +0000 (0:59:08.088) 1:08:58.702 ******** 2) Each task is taking at least ~1 minute even if it's skipped. I wonder if this is related to number of forks? Is this using the default fork count? @Vincent, would you be able to try some patches in this environment to see if we can speed this up?
For the record, I also emulated 100 nodes and I don't see such delays to run tasks against 100 nodes via ssh. Are you sure there aren't any network related issues (or dns, etc) when running ansible against these hosts? It would be helpful if you could also provide the ansible.cfg that is being used. If you run a basic noop playbook against these hosts does it have delays between the task execution? For example the following playbook took ~1min 30seconds against 100 hosts: - hosts: all gather_facts: false tasks: - name: Generate sleep time set_fact: sleep_time: "{{ inventory_hostname[-2:] | int % 3 }}" - name: Sleep time debug: var: sleep_time - name: Do random sleep (1) shell: sleep "{{ sleep_time }}" - name: Do random sleep (2) shell: sleep "{{ sleep_time }}" - name: Do random sleep (3) shell: sleep "{{ sleep_time }}" - name: Do random sleep (4) shell: sleep "{{ sleep_time }}" - name: Do random sleep (5) shell: sleep "{{ sleep_time }}" - name: Do random sleep (6) shell: sleep "{{ sleep_time }}" - name: Do random sleep (7) shell: sleep "{{ sleep_time }}" - name: Do random sleep (8) shell: sleep "{{ sleep_time }}" - name: Do random sleep (9) shell: sleep "{{ sleep_time }}" - name: Do random sleep (10) shell: sleep "{{ sleep_time }}"
Ok so I was able to reproduce tasks taking >1 minute for 100 hosts. It appears that if you use the default ansible.cfg, it will run terribly slow and would reproduce the sluggishness from the logs. It appears that we don't have the `openstack tripleo config generate ansible` command like we do in OSP16.x which would generate you the same ansible.cfg we run the deployment with when not letting mistral invoke the ansible deployment. For the record, the ansible.cfg we use in newer versions would be something to the effect of... ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ [defaults] retry_files_enabled = False roles_path = /root/.ansible/roles:/usr/share/ansible/tripleo-roles:/usr/share/ansible/roles:/etc/ansible/roles:/usr/share/openstack-tripleo-validations/roles library = /root/.ansible/plugins/modules:/usr/share/ansible/tripleo-plugins/modules:/usr/share/ansible/plugins/modules:/usr/share/ansible-modules:/usr/share/openstack-tripleo-validations/library callback_plugins = ~/.ansible/plugins/callback:/usr/share/ansible/tripleo-plugins/callback:/usr/share/ansible/plugins/callback:/usr/share/openstack-tripleo-validations/callback_plugins callback_whitelist = profile_tasks action_plugins = ~/.ansible/plugins/action:/usr/share/ansible/plugins/action:/usr/share/ansible/tripleo-plugins/action:/usr/share/openstack-tripleo-validations/action_plugins lookup_plugins = /root/.ansible/plugins/lookup:/usr/share/ansible/tripleo-plugins/lookup:/usr/share/ansible/plugins/lookup:/usr/share/openstack-tripleo-validations/lookup_plugins filter_plugins = ~/.ansible/plugins/filter:/usr/share/ansible/plugins/filter:/usr/share/ansible/tripleo-plugins/filter:/usr/share/openstack-tripleo-validations/filter_plugins forks = 80 timeout = 30 gather_timeout = 30 gathering = smart fact_caching = jsonfile fact_caching_connection = ~/.ansible/fact_cache internal_poll_interval = 0.05 interpreter_python = auto fact_caching_timeout = 7200 [inventory] [privilege_escalation] [paramiko_connection] [ssh_connection] ssh_args = -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ControlMaster=auto -o ControlPersist=30m -o ServerAliveInterval=5 -o ServerAliveCountMax=5 -o PreferredAuthentications=publickey retries = 8 pipelining = True scp_if_ssh = True [persistent_connection] [accelerate] [selinux] [colors] [diff] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ At a minimum I would recommend setting the following options: [defaults] forks = 80 timeout = 30 gather_timeout = 30 gathering = smart fact_caching = jsonfile fact_caching_connection = ~/.ansible/fact_cache internal_poll_interval = 0.05 fact_caching_timeout = 7200 [ssh_connection] pipelining = True ssh_args = -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ControlMaster=auto -o ControlPersist=30m -o ServerAliveInterval=64 -o ServerAliveCountMax=1024 -o Compression=no -o TCPKeepAlive=yes -o VerifyHostKeyDNS=no -o ForwardX11=no -o ForwardAgent=yes -o PreferredAuthentications=publickey -T For comparison the following ansible.cfg took 8m 43s to run the playbook from Comment 17 against 100 hosts. [defaults] host_key_checking = False [ssh_connection] ssh_args = -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no By configuring the internal_poll_interval, forks, pipelining and tuning ssh, the same playbook only took 1m 37s [defaults] internal_poll_interval = 0.05 forks = 25 host_key_checking = False [ssh_connection] pipelining = True ssh_args = -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ControlMaster=auto -o ControlPersist=30m -o ServerAliveInterval=64 -o ServerAliveCountMax=1024 -o Compression=no -o TCPKeepAlive=yes -o VerifyHostKeyDNS=no -o ForwardX11=no -o ForwardAgent=yes -o PreferredAuthentications=publickey -T The version of the ansible.cfg from /var/lib/mistral in Comment 13 would be an excellent starting point and a copy should be used to start with. The other thing I would check is to ensure that 192.168.122.1 is not listed in /etc/resolv.conf on the deployed hosts. There was a bug in some guest images where this was being left in and caused various sluggish responses when dns was queried (e.g. yum/ssh/etc). Vincent, please give these options a shot and let us know the results. Thanks
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 13.0 director bug fix advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4388