Description of problem: When scaling from 200 nodes to 250 nodes in OSP 16, we see ansible-playbook process consume around 50 cores on the undercloud during the update /etc/hosts task which runs for around 22 minutes Version-Release number of selected component (if applicable): RHOS_TRUNK-16.0-RHEL-8-20200113.n.0 How reproducible: 100% at scale Steps to Reproduce: 1. Run a scale out on a large deployment 2. Monitor CPU usage of ansible 3. Actual results: Ansible consumes around 50 cores on the undercloud Expected results: Consuming 50 cores isn't acceptable Additional info: https://snapshot.raintank.io/dashboard/snapshot/Xujs6L7FAsCM8Kpzc63khcOUi8RBRkbj?orgId=2
*** Bug 1794014 has been marked as a duplicate of this bug. ***
*** Bug 1794013 has been marked as a duplicate of this bug. ***
Wrong link in bug description Here is the link to CPU consumption https://snapshot.raintank.io/dashboard/snapshot/g3Ije4s6TKilkNE063ykcojJsOqhSHkz?orgId=2 The spike of 5000%=50 cores is during the update /etc/hosts task.
To clarify, the forkcount for ansible-playbook is set to 50?
This issue was not necessary with the fork count (although it does default to 50 and we have increased it to 500 for scale testing). Ansible would process a large Jinja template of all of the hosts in the stack and it would do this processing on every single host. We actually only need to render that template once. I patched this upstream and now have it available in a build for QE to test with.
*** Bug 1792425 has been marked as a duplicate of this bug. ***
Performed two tests: - 1cont, 1comp, 3ceph test. In this case the Update /etc/hosts task from /var/lib/mistral/overcloud/ansible.log took 499ms: 2020-02-05 15:16:34,265 p=734 u=mistral | TASK [tripleo-hosts-entries : Update /etc/hosts] ******************************* 2020-02-05 15:16:34,265 p=734 u=mistral | Wednesday 05 February 2020 15:16:34 +0000 (0:00:00.469) 0:02:13.105 **** 2020-02-05 15:16:35,324 p=734 u=mistral | changed: [ceph-0] 2020-02-05 15:16:35,383 p=734 u=mistral | changed: [ceph-1] 2020-02-05 15:16:35,401 p=734 u=mistral | changed: [ceph-2] 2020-02-05 15:16:35,618 p=734 u=mistral | changed: [compute-0] 2020-02-05 15:16:35,764 p=734 u=mistral | changed: [controller-0] - a 10 node scaling test. In this case the Update /etc/hosts task from /var/lib/mistral/overcloud/ansible.log took 16.152s. 2020-02-05 21:48:26,382 p=19469 u=mistral | TASK [tripleo-hosts-entries : Update /etc/hosts] ******************************* 2020-02-05 21:48:26,382 p=19469 u=mistral | Wednesday 05 February 2020 21:48:26 +0000 (0:00:01.596) 0:05:48.368 **** 2020-02-05 21:48:36,027 p=19469 u=mistral | changed: [compute-6] 2020-02-05 21:48:37,212 p=19469 u=mistral | changed: [ceph-0] 2020-02-05 21:48:37,848 p=19469 u=mistral | changed: [compute-10] 2020-02-05 21:48:38,146 p=19469 u=mistral | changed: [compute-0] 2020-02-05 21:48:38,291 p=19469 u=mistral | changed: [ceph-2] 2020-02-05 21:48:38,589 p=19469 u=mistral | changed: [compute-1] 2020-02-05 21:48:39,483 p=19469 u=mistral | changed: [ceph-1] 2020-02-05 21:48:39,584 p=19469 u=mistral | changed: [compute-11] 2020-02-05 21:48:39,829 p=19469 u=mistral | changed: [compute-2] 2020-02-05 21:48:39,836 p=19469 u=mistral | changed: [compute-3] 2020-02-05 21:48:40,439 p=19469 u=mistral | changed: [compute-4] 2020-02-05 21:48:40,751 p=19469 u=mistral | changed: [controller-1] 2020-02-05 21:48:41,010 p=19469 u=mistral | changed: [compute-7] 2020-02-05 21:48:41,015 p=19469 u=mistral | changed: [compute-8] 2020-02-05 21:48:41,060 p=19469 u=mistral | changed: [compute-9] 2020-02-05 21:48:41,251 p=19469 u=mistral | changed: [compute-5] 2020-02-05 21:48:41,767 p=19469 u=mistral | changed: [controller-2] 2020-02-05 21:48:42,534 p=19469 u=mistral | changed: [controller-0] In both cases the update took much less than a minute as specified in Comment 6.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0655