1815202 – On OSP13, deployment with 90 hosts takes 10 hours when using config-download

Bug 1815202 - On OSP13, deployment with 90 hosts takes 10 hours when using config-download

Summary: On OSP13, deployment with 90 hosts takes 10 hours when using config-download

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	python-tripleoclient
Sub Component:
Version:	13.0 (Queens)
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Alex Schultz
QA Contact:	David Rosenfeld
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-03-19 18:38 UTC by Vincent S. Cojot
Modified:	2021-03-24 13:44 UTC (History)
CC List:	15 users (show)
Fixed In Version:	python-tripleoclient-9.3.1-8.el7ost
Doc Type:	Enhancement
Doc Text:	When using the config-download Tech Preview functionality, the generated Ansible playbooks do not include a default ansible.cfg that is tailored to the config-download playbooks. The default Ansible settings are not not ideal for large scale deployments. This enhancement allows you to use the following command to generate an ansible.cfg that can be used with the config-download playbooks: $ openstack tripleo config generate ansible
Clone Of:
Environment:
Last Closed:	2020-10-28 18:23:41 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1883600	None	None	None	2020-06-15 20:32:17 UTC
OpenStack gerrit	735681	None	MERGED	Generate ansible.cfg for UC/standalone deployments	2021-02-02 07:17:54 UTC
Red Hat Product Errata	RHBA-2020:4388	None	None	None	2020-10-28 18:23:57 UTC

Comment 2 Vincent S. Cojot 2020-03-19 18:44:01 UTC

In the above output, the rebuild of ssh_known_hosts takes close to 1hour (almost 3600s). We've had better results with this:

====================================================================
  tasks:
    # This task is a trick to execute only on the ansible runner (undercloud) and with the compute facts
    - name: rebuild /etc/ssh/ssh_known_hosts
      connection: local
      run_once: true
      template:
        src: /home/stack/overcloud/utilities/ssh_known_hosts.j2
        dest: /home/stack/overcloud/utilities/ssh_known_hosts_generated
        owner: stack
        group: stack

    - name: Copy generated /etc/ssh/ssh_known_hosts to compute hosts
      become: true
      copy:
        src: /home/stack/overcloud/utilities/ssh_known_hosts_generated
        dest: /etc/ssh/ssh_known_hosts
        owner: root
        group: root
        mode: 0644
      notify:
        - Restart Nova containers
====================================================================
$ cat ssh_known_hosts.j2
#jinja2: trim_blocks: "True", lstrip_blocks: "True"

{% for host in ansible_play_hosts_all %}

    {# Iterate over all IPs on each host #}
    {% for ip in hostvars[host]['ansible_all_ipv4_addresses'] %}
        {% if "172.31" not in ip  %}
        {% endif %}
        {% for ssh_pubkey,shortname in ssh_pubkey_names.items() %}
            {% if ssh_pubkey in hostvars[host] %}
                {% if "172.31" not in ip  %}
{{ ip }} {{ shortname }} {{ hostvars[host][ssh_pubkey] }}
                {% endif %}
            {% endif %}
        {% endfor %}
    {% endfor %}

    {# Iterate over all possible hostnames on each host #}
    {% for ssh_suffix in ssh_host_suffixes %}
        {% for ssh_pubkey,shortname in ssh_pubkey_names.items() %}
            {% if ssh_pubkey in hostvars[host] %}
{{ hostvars[host]['ansible_hostname']}}.{{ ssh_suffix }} {{ shortname }} {{ hostvars[host][ssh_pubkey] }}
            {% endif %}
        {% endfor %}
    {% endfor %}

{% endfor %}
====================================================================

Comment 3 Vincent S. Cojot 2020-03-19 18:46:15 UTC

The overall duration of:

# Run the actual deployment
export playbook=`find tripleo-ansible -name deploy_steps_playbook.yaml`
export ANSIBLE_FORCE_COLOR=1
export ANSIBLE_CALLBACK_WHITELIST=profile_tasks
echo "Running playbook ${playbook} on ${INVENTORY_LIMIT}"
time ansible-playbook \
    -i tripleo-ansible/inventory.yaml \
    --private-key /home/stack/.ssh/id_rsa \
    --become ${playbook} \
    --limit "${INVENTORY_LIMIT}"

...is close to 10 hours for 90 computes.

Comment 5 Vincent S. Cojot 2020-03-21 03:05:46 UTC

There isn't an ansible.cfg aside from the default in /etc/ansible/ansible.cfg
testing with forks=100 overloaded the director and tasks started failing (the VM only has 16vcpu and 64gb RAM).

I´m now testing with an ansible.cfg which is a copy of /etc/ansible/ansible.cfg with those changes:

$ diff /etc/ansible/ansible.cfg ansible.cfg |grep '>'
> forks          = 32
> display_skipped_hosts = False
> pipelining = True

Comment 6 Vincent S. Cojot 2020-03-21 03:07:23 UTC

Having skipped hosts = false makes it clearer that there are tasks that use cpu time when they don't actually need to do anything. Here an example from step4:

PLAY [External deployment step 4] *************************************************************************************************************************************************

PLAY [Overcloud deploy step tasks for 4] ******************************************************************************************************************************************

PLAY [Overcloud common deploy step tasks 4] ***************************************************************************************************************************************

TASK [Create and ensure setype for /var/log/containers directory] *****************************************************************************************************************
Saturday 21 March 2020  02:50:08 +0000 (0:01:03.585)       8:04:15.638 ********

TASK [Create /var/lib/tripleo-config directory] ***********************************************************************************************************************************
Saturday 21 March 2020  02:51:13 +0000 (0:01:04.961)       8:05:20.600 ********

TASK [Check if puppet step_config.pp manifest exists] *****************************************************************************************************************************
Saturday 21 March 2020  02:52:16 +0000 (0:01:02.973)       8:06:23.574 ********

TASK [Set fact when file existed] *************************************************************************************************************************************************
Saturday 21 March 2020  02:53:19 +0000 (0:01:03.329)       8:07:26.903 ********

TASK [Write the puppet step_config manifest] **************************************************************************************************************************************
Saturday 21 March 2020  02:54:21 +0000 (0:01:02.326)       8:08:29.230 ********

TASK [Create /var/lib/docker-puppet] **********************************************************************************************************************************************
Saturday 21 March 2020  02:55:27 +0000 (0:01:05.220)       8:09:34.451 ********

TASK [Check if docker-puppet puppet_config.yaml configuration file exists] ********************************************************************************************************
Saturday 21 March 2020  02:56:29 +0000 (0:01:02.759)       8:10:37.211 ********

TASK [Set fact when file existed] *************************************************************************************************************************************************
Saturday 21 March 2020  02:57:32 +0000 (0:01:02.420)       8:11:39.632 ********

TASK [Write docker-puppet.json file] **********************************************************************************************************************************************
Saturday 21 March 2020  02:58:33 +0000 (0:01:01.319)       8:12:40.951 ********

TASK [Create /var/lib/docker-config-scripts] **************************************************************************************************************************************
Saturday 21 March 2020  02:59:36 +0000 (0:01:03.106)       8:13:44.058 ********

TASK [Clean old /var/lib/docker-container-startup-configs.json file] **************************************************************************************************************
Saturday 21 March 2020  03:00:39 +0000 (0:01:02.594)       8:14:46.653 ********

TASK [Check if docker_config_scripts.yaml file exists] ****************************************************************************************************************************
Saturday 21 March 2020  03:01:41 +0000 (0:01:02.519)       8:15:49.173 ********

TASK [Set fact when file existed] *************************************************************************************************************************************************
Saturday 21 March 2020  03:02:45 +0000 (0:01:03.878)       8:16:53.051 ********

TASK [Write docker config scripts] ************************************************************************************************************************************************
Saturday 21 March 2020  03:03:46 +0000 (0:01:00.962)       8:17:54.013 ********

TASK [Set docker_config_default fact] *********************************************************************************************************************************************
Saturday 21 March 2020  03:04:51 +0000 (0:01:04.612)       8:18:58.626 ********

(check the time deltas on the right)

Comment 7 Vincent S. Cojot 2020-03-21 03:16:11 UTC

More output from step4:

TASK [Set docker_startup_configs_with_default fact] *******************************************************************************************************************************
Saturday 21 March 2020  03:08:05 +0000 (0:01:03.469)       8:22:12.655 ********

TASK [Write docker-container-startup-configs] *************************************************************************************************************************************
Saturday 21 March 2020  03:09:10 +0000 (0:01:04.862)       8:23:17.518 ********

TASK [Write per-step docker-container-startup-configs] ****************************************************************************************************************************
Saturday 21 March 2020  03:10:13 +0000 (0:01:03.496)       8:24:21.015 ********

TASK [Create /var/lib/kolla/config_files directory] *******************************************************************************************************************************
Saturday 21 March 2020  03:11:22 +0000 (0:01:08.628)       8:25:29.643 ********

TASK [Check if kolla_config.yaml file exists] *************************************************************************************************************************************
Saturday 21 March 2020  03:12:23 +0000 (0:01:01.316)       8:26:30.959 ********

TASK [Set fact when file existed] *************************************************************************************************************************************************
Saturday 21 March 2020  03:13:27 +0000 (0:01:03.819)       8:27:34.779 ********

TASK [Write kolla config json files] **********************************************************************************************************************************************
Saturday 21 March 2020  03:14:29 +0000 (0:01:01.615)       8:28:36.395 ********

TASK [Clean /var/lib/docker-puppet/docker-puppet-tasks*.json files] ***************************************************************************************************************
Saturday 21 March 2020  03:15:36 +0000 (0:01:07.554)       8:29:43.950 ********

Comment 11 Vincent S. Cojot 2020-03-23 12:50:43 UTC

Hi James, spent most of the last 36 hours running deploy.sh and timing it. The results are quite disappointing (no major difference was seen, not even with mitogen):

************************************* ATTEMPT #1) *************************************:
$ diff /etc/ansible/ansible.cfg ansible.cfg |grep '>'
> forks          = 32
> display_skipped_hosts = False
> pipelining = True

RESULT: PASS
real    621m40.403s
user    700m52.525s
sys     200m50.139s


************************************* ATTEMPT #2) *************************************:
$ diff /etc/ansible/ansible.cfg ansible.cfg |grep '>'
> forks          = 32
> display_skipped_hosts = False

RESULT: PASS
real    631m21.758s
user    716m16.938s
sys     325m27.261s

************************************* ATTEMPT #3) *************************************:
$ diff /etc/ansible/ansible.cfg ansible.cfg |grep '>'
> forks          = 100

RESULT: FAIL
(not enough memory)

************************************* ATTEMPT #4) *************************************:
$ diff /etc/ansible/ansible.cfg ansible.cfg |grep '>'
> forks          = 32
> display_skipped_hosts = False
> pipelining = True
( mitogen : ENABLED )

RESULT: PASS
real    616m32.520s
user    666m26.979s
sys     131m2.002s

************************************* ATTEMPT #5) *************************************:
$ diff /etc/ansible/ansible.cfg ansible.cfg |grep '>'
> forks          = 100
> display_skipped_hosts = False
> pipelining = True
( mitogen : ENABLED )

RESULT: PASS
real    667m19.010s
user    705m48.422s
sys     173m40.255s

Comment 12 Vincent S. Cojot 2020-03-23 12:55:39 UTC

On the other hand, looking at ansible-playbook-command.sh under /var/lib/mistral, I'm wondering if mitogen would make use of /home/stack/ansible.cfg, perhaps only /etc/ansible/ansible.cfg.

Comment 13 Vincent S. Cojot 2020-03-23 13:50:55 UTC

Mistral on osp13 does seem to use something else:

[root@tenlab1-director 4c2f0acb-d1c8-4beb-a1a3-972f7ac8299c]# ls -la /var/lib/mistral/4c2f0acb-d1c8-4beb-a1a3-972f7ac8299c/ansible.cfg 
-rw-r--r--. 1 mistral mistral 644 May  8  2019 /var/lib/mistral/4c2f0acb-d1c8-4beb-a1a3-972f7ac8299c/ansible.cfg
[root@tenlab1-director 4c2f0acb-d1c8-4beb-a1a3-972f7ac8299c]# cat /var/lib/mistral/4c2f0acb-d1c8-4beb-a1a3-972f7ac8299c/ansible.cfg
[defaults]
roles_path = /etc/ansible/roles:/usr/share/ansible/roles
retry_files_enabled = False
log_path = /var/lib/mistral/4c2f0acb-d1c8-4beb-a1a3-972f7ac8299c/ansible.log
forks = 25
timeout = 30
gather_timeout = 30

[inventory]

[privilege_escalation]

[paramiko_connection]

[ssh_connection]
ssh_args = -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ControlMaster=auto -o ControlPersist=30m -o ServerAliveInterval=5 -o ServerAliveCountMax=5
control_path_dir = /var/lib/mistral/4c2f0acb-d1c8-4beb-a1a3-972f7ac8299c/ansible-ssh
retries = 8
pipelining = True

[persistent_connection]

[accelerate]

[selinux]

[colors]

[diff]

Comment 15 Luke Short 2020-04-23 18:36:54 UTC

I would like to set the record straight on a few things here.

(1) Mitogen is not a Python library that we ship in RHEL or RHOSP meaning it is an unsupported solution. We have done testing with it in the past and saw negligible improvements. Instead, we are looking into improved Ansible Tower integration with RHOSP for helping scale Ansible.

(2) config-download is a technology preview in RHOSP 13. We have done a massive amount of backports to improve the speed and functionality of it in the past. As RHOSP 13 continues to age, it has become increasingly more complex and time consuming to backport all of the latest updates of config-download. For us to fully backport all of the required changes would essentially have us rewriting RHOSP 13 to be RHOSP 16.

(3) RHOSP 16 has been our release where we have been targeting full long-term support of config-download with a focus on large scale deployments. Our tests have shown that the deployment time is x5 faster. We highly recommend upgrading to this long-life release for the best possible experience.

(4) We will continue to backport performance improvements where ever we can. However, none of those backports will magically make RHOSP 13 as fast as 16 when it comes to deployment time.

Comment 16 Alex Schultz 2020-06-10 16:08:17 UTC

A couple of observations on the output:

1) tripleo-ssh-known-hosts took > 1 hour

TASK [tripleo-ssh-known-hosts : Add hosts key in /etc/ssh/ssh_known_hosts for live/cold-migration] ***
Saturday 14 March 2020  02:51:07 +0000 (0:01:18.577)       0:09:50.614 ********
...SNIP...
ok: [tenlab1-compute068] => (item=tenlab1-compute068)

PLAY [Server deployments] ******************************************************

TASK [include] *****************************************************************
Saturday 14 March 2020  03:50:15 +0000 (0:59:08.088)       1:08:58.702 ******** 


2) Each task is taking at least ~1 minute even if it's skipped. I wonder if this is related to number of forks?  Is this using the default fork count?


@Vincent, would you be able to try some patches in this environment to see if we can speed this up?

Comment 17 Alex Schultz 2020-06-10 23:21:21 UTC

For the record, I also emulated 100 nodes and I don't see such delays to run tasks against 100 nodes via ssh.  Are you sure there aren't any network related issues (or dns, etc) when running ansible against these hosts?  It would be helpful if you could also provide the ansible.cfg that is being used.  If you run a basic noop playbook against these hosts does it have delays between the task execution?

For example the following playbook took ~1min 30seconds against 100 hosts:

- hosts: all
  gather_facts: false
  tasks:
    - name: Generate sleep time
      set_fact:
        sleep_time: "{{ inventory_hostname[-2:] | int % 3 }}"
    - name: Sleep time
      debug:
        var: sleep_time
    - name: Do random sleep (1)
      shell: sleep "{{ sleep_time }}"
    - name: Do random sleep (2)
      shell: sleep "{{ sleep_time }}"
    - name: Do random sleep (3)
      shell: sleep "{{ sleep_time }}"
    - name: Do random sleep (4)
      shell: sleep "{{ sleep_time }}"
    - name: Do random sleep (5)
      shell: sleep "{{ sleep_time }}"
    - name: Do random sleep (6)
      shell: sleep "{{ sleep_time }}"
    - name: Do random sleep (7)
      shell: sleep "{{ sleep_time }}"
    - name: Do random sleep (8)
      shell: sleep "{{ sleep_time }}"
    - name: Do random sleep (9)
      shell: sleep "{{ sleep_time }}"
    - name: Do random sleep (10)
      shell: sleep "{{ sleep_time }}"

Comment 19 Alex Schultz 2020-06-15 19:29:40 UTC

Ok so I was able to reproduce tasks taking >1 minute for 100 hosts. It appears that if you use the default ansible.cfg, it will run terribly slow and would reproduce the sluggishness from the logs.  It appears that we don't have the `openstack tripleo config generate ansible` command like we do in OSP16.x which would generate you the same ansible.cfg we run the deployment with when not letting mistral invoke the ansible deployment.


For the record, the ansible.cfg we use in newer versions would be something to the effect of...
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[defaults]
retry_files_enabled = False
roles_path = /root/.ansible/roles:/usr/share/ansible/tripleo-roles:/usr/share/ansible/roles:/etc/ansible/roles:/usr/share/openstack-tripleo-validations/roles
library = /root/.ansible/plugins/modules:/usr/share/ansible/tripleo-plugins/modules:/usr/share/ansible/plugins/modules:/usr/share/ansible-modules:/usr/share/openstack-tripleo-validations/library
callback_plugins = ~/.ansible/plugins/callback:/usr/share/ansible/tripleo-plugins/callback:/usr/share/ansible/plugins/callback:/usr/share/openstack-tripleo-validations/callback_plugins
callback_whitelist = profile_tasks
action_plugins = ~/.ansible/plugins/action:/usr/share/ansible/plugins/action:/usr/share/ansible/tripleo-plugins/action:/usr/share/openstack-tripleo-validations/action_plugins
lookup_plugins = /root/.ansible/plugins/lookup:/usr/share/ansible/tripleo-plugins/lookup:/usr/share/ansible/plugins/lookup:/usr/share/openstack-tripleo-validations/lookup_plugins
filter_plugins = ~/.ansible/plugins/filter:/usr/share/ansible/plugins/filter:/usr/share/ansible/tripleo-plugins/filter:/usr/share/openstack-tripleo-validations/filter_plugins
forks = 80
timeout = 30
gather_timeout = 30
gathering = smart
fact_caching = jsonfile
fact_caching_connection = ~/.ansible/fact_cache
internal_poll_interval = 0.05
interpreter_python = auto
fact_caching_timeout = 7200

[inventory]

[privilege_escalation]

[paramiko_connection]

[ssh_connection]
ssh_args = -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ControlMaster=auto -o ControlPersist=30m -o ServerAliveInterval=5 -o ServerAliveCountMax=5 -o PreferredAuthentications=publickey
retries = 8
pipelining = True
scp_if_ssh = True

[persistent_connection]

[accelerate]

[selinux]

[colors]

[diff]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

At a minimum I would recommend setting the following options:

[defaults]
forks = 80
timeout = 30
gather_timeout = 30
gathering = smart
fact_caching = jsonfile
fact_caching_connection = ~/.ansible/fact_cache
internal_poll_interval = 0.05
fact_caching_timeout = 7200
[ssh_connection]
pipelining = True
ssh_args = -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ControlMaster=auto -o ControlPersist=30m -o ServerAliveInterval=64 -o ServerAliveCountMax=1024 -o Compression=no -o TCPKeepAlive=yes -o VerifyHostKeyDNS=no -o ForwardX11=no -o ForwardAgent=yes -o PreferredAuthentications=publickey -T


For comparison the following ansible.cfg took 8m 43s to run the playbook from Comment 17 against 100 hosts.

[defaults]
host_key_checking = False
[ssh_connection]
ssh_args = -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no

By configuring the internal_poll_interval, forks, pipelining and tuning ssh, the same playbook only took 1m 37s

[defaults]
internal_poll_interval = 0.05
forks = 25
host_key_checking = False
[ssh_connection]
pipelining = True
ssh_args = -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ControlMaster=auto -o ControlPersist=30m -o ServerAliveInterval=64 -o ServerAliveCountMax=1024 -o Compression=no -o TCPKeepAlive=yes -o VerifyHostKeyDNS=no -o ForwardX11=no -o ForwardAgent=yes -o PreferredAuthentications=publickey -T


The version of the ansible.cfg from /var/lib/mistral in Comment 13 would be an excellent starting point and a copy should be used to start with.

The other thing I would check is to ensure that 192.168.122.1 is not listed in /etc/resolv.conf on the deployed hosts. There was a bug in some guest images where this was being left in and caused various sluggish responses when dns was queried (e.g. yum/ssh/etc).

Vincent, please give these options a shot and let us know the results. Thanks

Comment 35 errata-xmlrpc 2020-10-28 18:23:41 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 13.0 director bug fix advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4388

Note You need to log in before you can comment on or make changes to this bug.