Bug 1794012

Summary: Ansible consumes a large amount of CPU and RAM resources when running the update /etc/hosts task on a scale deployment
Product: Red Hat OpenStack Reporter: Sai Sindhur Malleni <smalleni>
Component: tripleo-ansibleAssignee: Luke Short <lshort>
Status: CLOSED ERRATA QA Contact: Sasha Smolyak <ssmolyak>
Severity: medium Docs Contact:
Priority: medium    
Version: 16.0 (Train)CC: aschultz, bdobreli, drosenfe, jschluet, jtanner
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: tripleo-ansible-0.4.2-0.20200207140442.b750574.el8ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-03-03 09:45:05 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Sai Sindhur Malleni 2020-01-22 13:56:04 UTC
Description of problem:
When scaling from 200 nodes to 250 nodes in OSP 16, we see ansible-playbook process consume around 50 cores on the undercloud during the update /etc/hosts task which runs for around 22 minutes

Version-Release number of selected component (if applicable):
RHOS_TRUNK-16.0-RHEL-8-20200113.n.0

How reproducible:
100% at scale

Steps to Reproduce:
1. Run a scale out on a large deployment
2. Monitor CPU usage of ansible
3.

Actual results:
Ansible consumes around 50 cores on the undercloud

Expected results:
Consuming 50 cores isn't acceptable

Additional info:
https://snapshot.raintank.io/dashboard/snapshot/Xujs6L7FAsCM8Kpzc63khcOUi8RBRkbj?orgId=2

Comment 1 Sai Sindhur Malleni 2020-01-22 14:54:32 UTC
*** Bug 1794014 has been marked as a duplicate of this bug. ***

Comment 2 Sai Sindhur Malleni 2020-01-22 14:54:39 UTC
*** Bug 1794013 has been marked as a duplicate of this bug. ***

Comment 3 Sai Sindhur Malleni 2020-01-22 16:04:48 UTC
Wrong link in bug description
Here is the link to CPU consumption
https://snapshot.raintank.io/dashboard/snapshot/g3Ije4s6TKilkNE063ykcojJsOqhSHkz?orgId=2

The spike of 5000%=50 cores is during the update /etc/hosts task.

Comment 4 James 2020-01-28 16:42:16 UTC
To clarify, the forkcount for ansible-playbook is set to 50?

Comment 5 Luke Short 2020-01-28 16:53:28 UTC
This issue was not necessary with the fork count (although it does default to 50 and we have increased it to 500 for scale testing). Ansible would process a large Jinja template of all of the hosts in the stack and it would do this processing on every single host. We actually only need to render that template once. I patched this upstream and now have it available in a build for QE to test with.

Comment 7 Luke Short 2020-01-28 17:14:54 UTC
*** Bug 1792425 has been marked as a duplicate of this bug. ***

Comment 8 David Rosenfeld 2020-02-05 23:36:04 UTC
Performed two tests:

- 1cont, 1comp, 3ceph test. In this case the Update /etc/hosts task from /var/lib/mistral/overcloud/ansible.log took 499ms: 

2020-02-05 15:16:34,265 p=734 u=mistral |  TASK [tripleo-hosts-entries : Update /etc/hosts] *******************************
2020-02-05 15:16:34,265 p=734 u=mistral |  Wednesday 05 February 2020  15:16:34 +0000 (0:00:00.469)       0:02:13.105 ****
2020-02-05 15:16:35,324 p=734 u=mistral |  changed: [ceph-0]
2020-02-05 15:16:35,383 p=734 u=mistral |  changed: [ceph-1]
2020-02-05 15:16:35,401 p=734 u=mistral |  changed: [ceph-2]
2020-02-05 15:16:35,618 p=734 u=mistral |  changed: [compute-0]
2020-02-05 15:16:35,764 p=734 u=mistral |  changed: [controller-0]


- a 10 node scaling test. In this case the Update /etc/hosts task from /var/lib/mistral/overcloud/ansible.log took 16.152s.

2020-02-05 21:48:26,382 p=19469 u=mistral |  TASK [tripleo-hosts-entries : Update /etc/hosts] *******************************
2020-02-05 21:48:26,382 p=19469 u=mistral |  Wednesday 05 February 2020  21:48:26 +0000 (0:00:01.596)       0:05:48.368 ****
2020-02-05 21:48:36,027 p=19469 u=mistral |  changed: [compute-6]
2020-02-05 21:48:37,212 p=19469 u=mistral |  changed: [ceph-0]
2020-02-05 21:48:37,848 p=19469 u=mistral |  changed: [compute-10]
2020-02-05 21:48:38,146 p=19469 u=mistral |  changed: [compute-0]
2020-02-05 21:48:38,291 p=19469 u=mistral |  changed: [ceph-2]
2020-02-05 21:48:38,589 p=19469 u=mistral |  changed: [compute-1]
2020-02-05 21:48:39,483 p=19469 u=mistral |  changed: [ceph-1]
2020-02-05 21:48:39,584 p=19469 u=mistral |  changed: [compute-11]
2020-02-05 21:48:39,829 p=19469 u=mistral |  changed: [compute-2]
2020-02-05 21:48:39,836 p=19469 u=mistral |  changed: [compute-3]
2020-02-05 21:48:40,439 p=19469 u=mistral |  changed: [compute-4]
2020-02-05 21:48:40,751 p=19469 u=mistral |  changed: [controller-1]
2020-02-05 21:48:41,010 p=19469 u=mistral |  changed: [compute-7]
2020-02-05 21:48:41,015 p=19469 u=mistral |  changed: [compute-8]
2020-02-05 21:48:41,060 p=19469 u=mistral |  changed: [compute-9]
2020-02-05 21:48:41,251 p=19469 u=mistral |  changed: [compute-5]
2020-02-05 21:48:41,767 p=19469 u=mistral |  changed: [controller-2]
2020-02-05 21:48:42,534 p=19469 u=mistral |  changed: [controller-0]

In both cases the update took much less than a minute as specified in Comment 6.

Comment 13 errata-xmlrpc 2020-03-03 09:45:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0655