Bug 1794012 - Ansible consumes a large amount of CPU and RAM resources when running the update /etc/hosts task on a scale deployment
Summary: Ansible consumes a large amount of CPU and RAM resources when running the upd...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: tripleo-ansible
Version: 16.0 (Train)
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: Luke Short
QA Contact: Sasha Smolyak
URL:
Whiteboard:
: 1792425 1794013 1794014 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-01-22 13:56 UTC by Sai Sindhur Malleni
Modified: 2020-03-03 09:45 UTC (History)
5 users (show)

Fixed In Version: tripleo-ansible-0.4.2-0.20200207140442.b750574.el8ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-03-03 09:45:05 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 704152 0 None MERGED Generate the /etc/hosts content once. 2020-07-30 05:15:52 UTC
Red Hat Product Errata RHBA-2020:0655 0 None None None 2020-03-03 09:45:28 UTC

Description Sai Sindhur Malleni 2020-01-22 13:56:04 UTC
Description of problem:
When scaling from 200 nodes to 250 nodes in OSP 16, we see ansible-playbook process consume around 50 cores on the undercloud during the update /etc/hosts task which runs for around 22 minutes

Version-Release number of selected component (if applicable):
RHOS_TRUNK-16.0-RHEL-8-20200113.n.0

How reproducible:
100% at scale

Steps to Reproduce:
1. Run a scale out on a large deployment
2. Monitor CPU usage of ansible
3.

Actual results:
Ansible consumes around 50 cores on the undercloud

Expected results:
Consuming 50 cores isn't acceptable

Additional info:
https://snapshot.raintank.io/dashboard/snapshot/Xujs6L7FAsCM8Kpzc63khcOUi8RBRkbj?orgId=2

Comment 1 Sai Sindhur Malleni 2020-01-22 14:54:32 UTC
*** Bug 1794014 has been marked as a duplicate of this bug. ***

Comment 2 Sai Sindhur Malleni 2020-01-22 14:54:39 UTC
*** Bug 1794013 has been marked as a duplicate of this bug. ***

Comment 3 Sai Sindhur Malleni 2020-01-22 16:04:48 UTC
Wrong link in bug description
Here is the link to CPU consumption
https://snapshot.raintank.io/dashboard/snapshot/g3Ije4s6TKilkNE063ykcojJsOqhSHkz?orgId=2

The spike of 5000%=50 cores is during the update /etc/hosts task.

Comment 4 James 2020-01-28 16:42:16 UTC
To clarify, the forkcount for ansible-playbook is set to 50?

Comment 5 Luke Short 2020-01-28 16:53:28 UTC
This issue was not necessary with the fork count (although it does default to 50 and we have increased it to 500 for scale testing). Ansible would process a large Jinja template of all of the hosts in the stack and it would do this processing on every single host. We actually only need to render that template once. I patched this upstream and now have it available in a build for QE to test with.

Comment 7 Luke Short 2020-01-28 17:14:54 UTC
*** Bug 1792425 has been marked as a duplicate of this bug. ***

Comment 8 David Rosenfeld 2020-02-05 23:36:04 UTC
Performed two tests:

- 1cont, 1comp, 3ceph test. In this case the Update /etc/hosts task from /var/lib/mistral/overcloud/ansible.log took 499ms: 

2020-02-05 15:16:34,265 p=734 u=mistral |  TASK [tripleo-hosts-entries : Update /etc/hosts] *******************************
2020-02-05 15:16:34,265 p=734 u=mistral |  Wednesday 05 February 2020  15:16:34 +0000 (0:00:00.469)       0:02:13.105 ****
2020-02-05 15:16:35,324 p=734 u=mistral |  changed: [ceph-0]
2020-02-05 15:16:35,383 p=734 u=mistral |  changed: [ceph-1]
2020-02-05 15:16:35,401 p=734 u=mistral |  changed: [ceph-2]
2020-02-05 15:16:35,618 p=734 u=mistral |  changed: [compute-0]
2020-02-05 15:16:35,764 p=734 u=mistral |  changed: [controller-0]


- a 10 node scaling test. In this case the Update /etc/hosts task from /var/lib/mistral/overcloud/ansible.log took 16.152s.

2020-02-05 21:48:26,382 p=19469 u=mistral |  TASK [tripleo-hosts-entries : Update /etc/hosts] *******************************
2020-02-05 21:48:26,382 p=19469 u=mistral |  Wednesday 05 February 2020  21:48:26 +0000 (0:00:01.596)       0:05:48.368 ****
2020-02-05 21:48:36,027 p=19469 u=mistral |  changed: [compute-6]
2020-02-05 21:48:37,212 p=19469 u=mistral |  changed: [ceph-0]
2020-02-05 21:48:37,848 p=19469 u=mistral |  changed: [compute-10]
2020-02-05 21:48:38,146 p=19469 u=mistral |  changed: [compute-0]
2020-02-05 21:48:38,291 p=19469 u=mistral |  changed: [ceph-2]
2020-02-05 21:48:38,589 p=19469 u=mistral |  changed: [compute-1]
2020-02-05 21:48:39,483 p=19469 u=mistral |  changed: [ceph-1]
2020-02-05 21:48:39,584 p=19469 u=mistral |  changed: [compute-11]
2020-02-05 21:48:39,829 p=19469 u=mistral |  changed: [compute-2]
2020-02-05 21:48:39,836 p=19469 u=mistral |  changed: [compute-3]
2020-02-05 21:48:40,439 p=19469 u=mistral |  changed: [compute-4]
2020-02-05 21:48:40,751 p=19469 u=mistral |  changed: [controller-1]
2020-02-05 21:48:41,010 p=19469 u=mistral |  changed: [compute-7]
2020-02-05 21:48:41,015 p=19469 u=mistral |  changed: [compute-8]
2020-02-05 21:48:41,060 p=19469 u=mistral |  changed: [compute-9]
2020-02-05 21:48:41,251 p=19469 u=mistral |  changed: [compute-5]
2020-02-05 21:48:41,767 p=19469 u=mistral |  changed: [controller-2]
2020-02-05 21:48:42,534 p=19469 u=mistral |  changed: [controller-0]

In both cases the update took much less than a minute as specified in Comment 6.

Comment 13 errata-xmlrpc 2020-03-03 09:45:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0655


Note You need to log in before you can comment on or make changes to this bug.