Summary: | Ceph-ansible does not honor --limit passed to openstack overcloud deploy | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Dave Wilson <dwilson> |
Component: | tripleo-ansible | Assignee: | John Fulton <johfulto> |
Status: | CLOSED ERRATA | QA Contact: | Yogev Rabl <yrabl> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 16.1 (Train) | CC: | fpantano, gfidente, jamsmith, johfulto, lshort, mburns, nwolf, pgrist, psahoo, smalleni |
Target Milestone: | z2 | Keywords: | Triaged |
Target Release: | 16.1 (Train on RHEL 8.2) | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | tripleo-ansible-0.5.1-1.20200914163922.902c3c8.el8ost | Doc Type: | Bug Fix |
Doc Text: |
This update increases the speed of stack updates in certain cases.
+
Previously, stack update performance was degraded when the Ansible --limit option was not passed to ceph-ansible. During a stack update, ceph-ansible sometimes made idempotent updates on nodes even if the --limit argument was used.
+
Now director intercepts the Ansible --limit option and passes it to the ceph-ansible excecution. The --limit option passed to commands starting with 'openstack overcloud' deploy is passed to the ceph-ansible execution to reduce the time required for stack updates.
+
[IMPORTANT]
Always include the undercloud in the limit list when using this feature with ceph-ansible.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2020-10-28 15:38:12 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: |
Description
Dave Wilson
2020-07-09 00:51:09 UTC
For example, the config-download command is openstack overcloud deploy --templates \ -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/podman.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/docker-ha.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/services/neutron-ovn-ha.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/disable-telemetry.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/ceph-ansible/ceph-ansible.yaml \ -e /home/stack/containers-prepare-parameter.yaml \ -e /home/stack/templates/network-environment.yaml \ -e /home/stack/templates/deploy.yaml \ -e /home/stack/templates/ceph.yaml \ -e /home/stack/templates/networking_ovn_hotfix_template.yaml \ -r /home/stack/templates/roles_data.yaml \ --ntp-server clock1.rdu2.redhat.com,clock.redhat.com,clock2.redhat.com \ --validation-warnings-fatal --config-download-only --config-download-timeout 960 --limit Undercloud,Controller,overcloud-fc640compute-10 which results in the right ansible-playbook-command.sh being generated in /var/lib/mistral/overcloud/ansible-playbook-command.sh ansible-playbook-3 -v /var/lib/mistral/overcloud/deploy_steps_playbook.yaml --limit Undercloud:Controller:overcloud-fc640compute-10 --become --timeout 600 --inventory-file /var/lib/mistral/overcloud/tripleo-ansible-inventory.yaml --skip-tags opendev-validation "$@" However, the ceph-ansible command does not have the --limit option and basically runs on all the existing nodes (several 100s) which ends up taking 2.5+ hrs just for ceph ansible even to scale up computes by 1 node (undercloud) [stack@f17-h23-000-1029p ceph-ansible]$ cat ceph_ansible_command.sh #!/usr/bin/env bash set -e echo "Running $0" >> /var/lib/mistral/overcloud/ceph-ansible/ceph_ansible_command.log ANSIBLE_ACTION_PLUGINS=/usr/share/ceph-ansible/plugins/actions/ ANSIBLE_CALLBACK_PLUGINS=/usr/share/ceph-ansible/plugins/callback/ ANSIBLE_FILTER_PLUGINS=/usr/share/ceph-ansible/plugins/filter/ ANSIBLE_ROLES_PATH=/usr/share/ceph-ansible/roles/ ANSIBLE_LOG_PATH="/var/lib/mistral/overcloud/ceph-ansible/ceph_ansible_command.log" ANSIBLE_LIBRARY=/usr/share/ceph-ansible/library/ ANSIBLE_CONFIG=/usr/share/ceph-ansible/ansible.cfg ANSIBLE_REMOTE_TEMP="/tmp/ceph_ansible_tmp" ANSIBLE_FORKS=25 ANSIBLE_GATHER_TIMEOUT=60 ANSIBLE_CALLBACK_WHITELIST=profile_tasks ANSIBLE_STDOUT_CALLBACK=default ansible-playbook --private-key /var/lib/mistral/overcloud/ssh_private_key -e ansible_python_interpreter=/usr/bin/python3 -v --skip-tags package-install,with_pkg --extra-vars @/var/lib/mistral/overcloud/ceph-ansible/extra_vars.yml -i /var/lib/mistral/overcloud/ceph-ansible/inventory.yml /usr/share/ceph-ansible/site-container.yml.sample 2>&1 As per the below reference bz update, I guess ceph client has already enhanced with --limit option from the ceph-ansible-4.0.15 package. But we experienced this issue in the latest ceph-ansible package. https://bugzilla.redhat.com/show_bug.cgi?id=1798781#c2 $ rpm -qa |grep ceph ceph-ansible-4.0.23-1.el8cp.noarch puppet-ceph-3.1.2-0.20200603075505.3b8ab1f.el8ost.noarch Red Hat OpenStack Platform release 16.1.0 RC (Train) OSP Puddle: 16.1_20200625.1 Luke, It seems {{ ansible_limit }} variable not work in OSP16.1. please suggest, if there any additional arguments required in overcloud deploy command. We currently *do not* set the --limit option for ceph-ansible from {{ ansible_limit }}, it is set from the {{ ceph_ansible_limit }} var [1] which can be set passing -e to ansible-playbook 1. https://github.com/openstack/tripleo-ansible/blob/stable/train/tripleo_ansible/roles/tripleo-ceph-run-ansible/tasks/main.yml#L55 So what would be the workflow for someone kicking off a deploy using "openstack overcloud deploy"(In reply to Giulio Fidente from comment #3) > We currently *do not* set the --limit option for ceph-ansible from {{ > ansible_limit }}, it is set from the {{ ceph_ansible_limit }} var [1] which > can be set passing -e to ansible-playbook Which "naislbe-playbook" is this? Is it the ansible-playbook-command.sh or ceph_ansible_command.sh > > 1. > https://github.com/openstack/tripleo-ansible/blob/stable/train/ > tripleo_ansible/roles/tripleo-ceph-run-ansible/tasks/main.yml#L55 What would the workflow of the user be when driving this from the "openstack overcloud deploy" command? Looks like it it is by passing -e ceph_ansible_limit to ansible-playbook-command.sh? Even then, it is not clear how a user would drive this from the openstack cli during deploy without needing to tweak the playbook manually. (In reply to Sai Sindhur Malleni from comment #5) > Looks like it it is by passing -e ceph_ansible_limit to > ansible-playbook-command.sh? Even then, it is not clear how a user would > drive this from the openstack cli during deploy without needing to tweak the > playbook manually. agreed, that is just a workaround; I triaged the bug and we'll work on it in a 16.1 zstream to make the ceph-ansible call reuse {{ ansible_limit }} value (In reply to Sai Sindhur Malleni from comment #5) > Looks like it it is by passing -e ceph_ansible_limit to > ansible-playbook-command.sh? Even then, it is not clear how a user would > drive this from the openstack cli during deploy without needing to tweak the > playbook manually. alternatively, can you try the "DeploymentServerBlacklist" THT parameter? [1] blacklisted hostnames will *not* end in the ceph-ansible inventory at all, basically implementing something close to what --limit would do 1. https://github.com/openstack/tripleo-heat-templates/blob/stable/train/deployment/ceph-ansible/ceph-base.yaml#L53 Ack. Thanks for clarifying. I will give that a try. Just for future reference, the time taken for ansible config-download run when scaling up the number of compute nodes from 471 to 472 using --limit passed to the openstack overcloud deploy command openstack overcloud deploy --templates \ -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/podman.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/docker-ha.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/services/neutron-ovn-ha.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/disable-telemetry.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/ceph-ansible/ceph-ansible.yaml \ -e /home/stack/containers-prepare-parameter.yaml \ -e /home/stack/templates/network-environment.yaml \ -e /home/stack/templates/deploy.yaml \ -e /home/stack/templates/ceph.yaml \ -e /home/stack/templates/networking_ovn_hotfix_template.yaml \ -r /home/stack/templates/roles_data.yaml \ --ntp-server clock1.rdu2.redhat.com,clock.redhat.com,clock2.redhat.com \ --validation-warnings-fatal --config-download-only --config-download-timeout 960 --limit Undercloud,Controller,overcloud-fc640compute-10 =============================================================================== tripleo-ceph-run-ansible : run ceph-ansible ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 8830.51s Wait for containers to start for step 2 using paunch -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 152.24s Wait for containers to start for step 3 using paunch --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 84.41s Pre-fetch all the containers --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 69.84s Wait for container-puppet tasks (generate config) to finish -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 55.32s Wait for containers to start for step 5 using paunch --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 45.45s tripleo-ceph-uuid : run nodes-uuid command ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 42.59s Run tripleo-container-image-prepare logged to: /var/log/tripleo-container-image-prepare.log ------------------------------------------------------------------------------------------------------------------------------------------ 38.03s Run NetworkConfig script ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 36.75s Write kolla config json files -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 23.23s Creating container startup configs for step_4 ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 20.58s Wait for containers to start for step 4 using paunch --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 19.62s tripleo-ceph-run-ansible : search output of ceph-ansible run(s) non-zero return codes ------------------------------------------------------------------------------------------------------------------------------------------------ 18.16s Wait for container-puppet tasks (bootstrap tasks) for step 4 to finish --------------------------------------------------------------------------------------------------------------------------------------------------------------- 16.41s tripleo-hosts-entries : Render out the hosts entries --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 15.68s Wait for puppet host configuration to finish ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 13.34s Wait for puppet host configuration to finish ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 13.31s Wait for container-puppet tasks (bootstrap tasks) for step 3 to finish --------------------------------------------------------------------------------------------------------------------------------------------------------------- 13.27s Wait for puppet host configuration to finish ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 13.18s Wait for puppet host configuration to finish ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 13.13s We can see that by default majority of time is spent in ceph-ansible as it runs against all the nodes. (In reply to Sai Sindhur Malleni from comment #8) > tripleo-ceph-run-ansible : run ceph-ansible > ----------------------------------------------------------------------------- > ----------------------------------------------------------------------------- > ------------------------------ 8830.51s > > We can see that by default majority of time is spent in ceph-ansible as it > runs against all the nodes. yes, thanks for collecting this data! if you can please try the DeploymentServerBlacklist param and see if it works as intended (In reply to Giulio Fidente from comment #9) > (In reply to Sai Sindhur Malleni from comment #8) > > tripleo-ceph-run-ansible : run ceph-ansible > > ----------------------------------------------------------------------------- > > ----------------------------------------------------------------------------- > > ------------------------------ 8830.51s > > > > We can see that by default majority of time is spent in ceph-ansible as it > > runs against all the nodes. > > yes, thanks for collecting this data! if you can please try the > DeploymentServerBlacklist param and see if it works as intended So looking at that, it looks like I have to list out every node I don't want ansible to run on? So that's 470+ nodes in my case. Is that right or am I missing something? (In reply to Sai Sindhur Malleni from comment #10) > (In reply to Giulio Fidente from comment #9) > > (In reply to Sai Sindhur Malleni from comment #8) > > > tripleo-ceph-run-ansible : run ceph-ansible > > > ----------------------------------------------------------------------------- > > > ----------------------------------------------------------------------------- > > > ------------------------------ 8830.51s > > > > > > We can see that by default majority of time is spent in ceph-ansible as it > > > runs against all the nodes. > > > > yes, thanks for collecting this data! if you can please try the > > DeploymentServerBlacklist param and see if it works as intended > So looking at that, it looks like I have to list out every node I don't want > ansible to run on? So that's 470+ nodes in my case. Is that right or am I > missing something? correct, it's a blacklist so you have to list all nodes which don't want to be reconfigured ... the only good news is names should follow all the same pattern when deployed by director (In reply to Giulio Fidente from comment #11) > (In reply to Sai Sindhur Malleni from comment #10) > > (In reply to Giulio Fidente from comment #9) > > > (In reply to Sai Sindhur Malleni from comment #8) > > > > tripleo-ceph-run-ansible : run ceph-ansible > > > > ----------------------------------------------------------------------------- > > > > ----------------------------------------------------------------------------- > > > > ------------------------------ 8830.51s > > > > > > > > We can see that by default majority of time is spent in ceph-ansible as it > > > > runs against all the nodes. > > > > > > yes, thanks for collecting this data! if you can please try the > > > DeploymentServerBlacklist param and see if it works as intended > > So looking at that, it looks like I have to list out every node I don't want > > ansible to run on? So that's 470+ nodes in my case. Is that right or am I > > missing something? > > correct, it's a blacklist so you have to list all nodes which don't want to > be reconfigured ... the only good news is names should follow all the same > pattern when deployed by director Ack, Just fyi, I have different composable roles and different naming schemes for all my compute nodes :-). I guess, I'm just trying to figure out the best path for me currently as you already triaged this and are going to fix it, so not worried about that. Quick question, is the -e ceph_ansible_limit expected to be passed to ansible-playboook-command.sh or the ceph_ansible_command.sh Something like ansible-playbook-3 -vvv /var/lib/mistral/overcloud/deploy_steps_playbook.yaml --limit Undercloud:Controller:overcloud-p1029compute-54 --become --timeout 600 --inventory-file /var/lib/mistral/overcloud/tripleo-ansible-inventory.yaml --skip-tags opendev-validation "$@" -e "ceph_ansible_limit=Undercloud:Controller:overcloud-p1029compute-54" seems to be working. (In reply to Sai Sindhur Malleni from comment #12) > Ack, Just fyi, I have different composable roles and different naming > schemes for all my compute nodes :-). I guess, I'm just trying to figure out > the best path for me currently as you already triaged this and are going to > fix it, so not worried about that. Quick question, is the -e > ceph_ansible_limit expected to be passed to ansible-playboook-command.sh or > the ceph_ansible_command.sh /var/lib/mistral/config-download-latest/ceph-ansible/ceph_ansible_command.sh it will just be appended to the command line for ansible-playbook like it already happens in [1] 1. https://github.com/openstack/tripleo-ansible/blob/stable/train/tripleo_ansible/roles/tripleo-ceph-run-ansible/tasks/main.yml#L55 (In reply to Pradipta Kumar Sahoo from comment #2) > As per the below reference bz update, I guess ceph client has already > enhanced with --limit option from the ceph-ansible-4.0.15 package. But we > experienced this issue in the latest ceph-ansible package. > > https://bugzilla.redhat.com/show_bug.cgi?id=1798781#c2 > > $ rpm -qa |grep ceph > ceph-ansible-4.0.23-1.el8cp.noarch > puppet-ceph-3.1.2-0.20200603075505.3b8ab1f.el8ost.noarch > > Red Hat OpenStack Platform release 16.1.0 RC (Train) > OSP Puddle: 16.1_20200625.1 > > Luke, It seems {{ ansible_limit }} variable not work in OSP16.1. please > suggest, if there any additional arguments required in overcloud deploy > command. I think this question is answered by comment #15 As per Luke: Ansible has a "magic" variable for limit called `ansible_limit`. Whatever is passed as --limit will be saved to it. We can probably re-use that with the Ceph bits. *** Bug 1856965 has been marked as a duplicate of this bug. *** Using Ansible --limit with ceph-ansible --------------------------------------- When using config-download to configure Ceph, if Ansible's `--limit` option is used, then it is passed to the execution of ceph-ansible too. This is the case for Train and newer. In the previous section an example was provided where Ceph was deployed with TripleO. The examples below show how to update the deployment and pass the `--limit` option. If oc0-cephstorage-0 had a disk failure and a factory clean disk was put in place of the failed disk, then the following could be run so that the new disk is used to bring up the missing OSD and so that ceph-ansible is only run on the nodes where it needs to be run. This is useful to reduce the time it takes to update the deployment:: openstack overcloud deploy --templates -r /home/stack/roles_data.yaml -n /usr/share/openstack-tripleo-heat-templates/network_data_dashboard.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/ceph-ansible/ceph-ansible.yaml -e ~/my-ceph-settings.yaml --limit oc0-controller-0:oc0-controller-2:oc0-controller-1:oc0-cephstorage-0:undercloud If config-download has generated a `ansible-playbook-command.sh` script, then that script may also be run with the `--limit` option and it will be passed to ceph-ansible:: ./ansible-playbook-command.sh --limit oc0-controller-0:oc0-controller-2:oc0-controller-1:oc0-cephstorage-0:undercloud In the above example the controllers are included because the Ceph Mons need Ansible to change their OSD definitions. Both commands above would do the same thing. The former would only be needed if there were Heat environment file updates. After either of the above has run the `~/config-download/config-download-latest/ceph-ansible/ceph_ansible_command.sh` file should contain the `--limit` option. .. warning:: You must always include the undercloud in the limit list or ceph-ansible will not be executed when using `--limit`. This is necessary because the ceph-ansible execution happens through the external_deploy_steps_tasks playbook and that playbook only runs on the undercloud. Docbug for this feature: bz 1871177 Verified on tripleo-ansible-0.5.1-1.20200914163922.902c3c8.el8ost.noarch cat /var/lib/mistral/overcloud/ceph-ansible/ceph_ansible_command.sh #!/usr/bin/env bash set -e echo "Running $0" >> /var/lib/mistral/overcloud/ceph-ansible/ceph_ansible_command.log ANSIBLE_ACTION_PLUGINS=/usr/share/ceph-ansible/plugins/actions/ ANSIBLE_CALLBACK_PLUGINS=/usr/share/ceph-ansible/plugins/callback/ ANSIBLE_FILTER_PLUGINS=/usr/share/ceph-ansible/plugins/filter/ ANSIBLE_ROLES_PATH=/usr/share/ceph-ansible/roles/ ANSIBLE_LOG_PATH="/var/lib/mistral/overcloud/ceph-ansible/ceph_ansible_command.log" ANSIBLE_SSH_CONTROL_PATH_DIR="/tmp/ceph_ansible_control_path" ANSIBLE_LIBRARY=/usr/share/ceph-ansible/library/ ANSIBLE_CONFIG=/usr/share/ceph-ansible/ansible.cfg ANSIBLE_REMOTE_TEMP="/tmp/ceph_ansible_tmp" ANSIBLE_FORKS=25 ANSIBLE_GATHER_TIMEOUT=60 ANSIBLE_CALLBACK_WHITELIST=profile_tasks ANSIBLE_STDOUT_CALLBACK=default ansible-playbook --private-key /var/lib/mistral/overcloud/ssh_private_key -e ansible_python_interpreter=/usr/bin/python3 -vv --skip-tags package-install,with_pkg --extra-vars @/var/lib/mistral/overcloud/ceph-ansible/extra_vars.yml --limit Undercloud:Controller:compute-1 -i /var/lib/mistral/overcloud/ceph-ansible/inventory.yml /usr/share/ceph-ansible/site-container.yml.sample 2>&1 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 16.1 bug fix and enhancement advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2020:4284 |