Bug 1794012

Summary:	Ansible consumes a large amount of CPU and RAM resources when running the update /etc/hosts task on a scale deployment
Product:	Red Hat OpenStack	Reporter:	Sai Sindhur Malleni <smalleni>
Component:	tripleo-ansible	Assignee:	Luke Short <lshort>
Status:	CLOSED ERRATA	QA Contact:	Sasha Smolyak <ssmolyak>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	16.0 (Train)	CC:	aschultz, bdobreli, drosenfe, jschluet, jtanner
Target Milestone:	---	Keywords:	Triaged
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	tripleo-ansible-0.4.2-0.20200207140442.b750574.el8ost	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-03-03 09:45:05 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Sai Sindhur Malleni 2020-01-22 13:56:04 UTC

Description of problem:
When scaling from 200 nodes to 250 nodes in OSP 16, we see ansible-playbook process consume around 50 cores on the undercloud during the update /etc/hosts task which runs for around 22 minutes

Version-Release number of selected component (if applicable):
RHOS_TRUNK-16.0-RHEL-8-20200113.n.0

How reproducible:
100% at scale

Steps to Reproduce:
1. Run a scale out on a large deployment
2. Monitor CPU usage of ansible
3.

Actual results:
Ansible consumes around 50 cores on the undercloud

Expected results:
Consuming 50 cores isn't acceptable

Additional info:
https://snapshot.raintank.io/dashboard/snapshot/Xujs6L7FAsCM8Kpzc63khcOUi8RBRkbj?orgId=2

Comment 1 Sai Sindhur Malleni 2020-01-22 14:54:32 UTC

*** Bug 1794014 has been marked as a duplicate of this bug. ***

Comment 2 Sai Sindhur Malleni 2020-01-22 14:54:39 UTC

*** Bug 1794013 has been marked as a duplicate of this bug. ***

Comment 3 Sai Sindhur Malleni 2020-01-22 16:04:48 UTC

Wrong link in bug description
Here is the link to CPU consumption
https://snapshot.raintank.io/dashboard/snapshot/g3Ije4s6TKilkNE063ykcojJsOqhSHkz?orgId=2

The spike of 5000%=50 cores is during the update /etc/hosts task.

Comment 4 James 2020-01-28 16:42:16 UTC

To clarify, the forkcount for ansible-playbook is set to 50?

Comment 5 Luke Short 2020-01-28 16:53:28 UTC

This issue was not necessary with the fork count (although it does default to 50 and we have increased it to 500 for scale testing). Ansible would process a large Jinja template of all of the hosts in the stack and it would do this processing on every single host. We actually only need to render that template once. I patched this upstream and now have it available in a build for QE to test with.

Comment 7 Luke Short 2020-01-28 17:14:54 UTC

*** Bug 1792425 has been marked as a duplicate of this bug. ***

Comment 8 David Rosenfeld 2020-02-05 23:36:04 UTC

Performed two tests:

- 1cont, 1comp, 3ceph test. In this case the Update /etc/hosts task from /var/lib/mistral/overcloud/ansible.log took 499ms: 

2020-02-05 15:16:34,265 p=734 u=mistral |  TASK [tripleo-hosts-entries : Update /etc/hosts] *******************************
2020-02-05 15:16:34,265 p=734 u=mistral |  Wednesday 05 February 2020  15:16:34 +0000 (0:00:00.469)       0:02:13.105 ****
2020-02-05 15:16:35,324 p=734 u=mistral |  changed: [ceph-0]
2020-02-05 15:16:35,383 p=734 u=mistral |  changed: [ceph-1]
2020-02-05 15:16:35,401 p=734 u=mistral |  changed: [ceph-2]
2020-02-05 15:16:35,618 p=734 u=mistral |  changed: [compute-0]
2020-02-05 15:16:35,764 p=734 u=mistral |  changed: [controller-0]


- a 10 node scaling test. In this case the Update /etc/hosts task from /var/lib/mistral/overcloud/ansible.log took 16.152s.

2020-02-05 21:48:26,382 p=19469 u=mistral |  TASK [tripleo-hosts-entries : Update /etc/hosts] *******************************
2020-02-05 21:48:26,382 p=19469 u=mistral |  Wednesday 05 February 2020  21:48:26 +0000 (0:00:01.596)       0:05:48.368 ****
2020-02-05 21:48:36,027 p=19469 u=mistral |  changed: [compute-6]
2020-02-05 21:48:37,212 p=19469 u=mistral |  changed: [ceph-0]
2020-02-05 21:48:37,848 p=19469 u=mistral |  changed: [compute-10]
2020-02-05 21:48:38,146 p=19469 u=mistral |  changed: [compute-0]
2020-02-05 21:48:38,291 p=19469 u=mistral |  changed: [ceph-2]
2020-02-05 21:48:38,589 p=19469 u=mistral |  changed: [compute-1]
2020-02-05 21:48:39,483 p=19469 u=mistral |  changed: [ceph-1]
2020-02-05 21:48:39,584 p=19469 u=mistral |  changed: [compute-11]
2020-02-05 21:48:39,829 p=19469 u=mistral |  changed: [compute-2]
2020-02-05 21:48:39,836 p=19469 u=mistral |  changed: [compute-3]
2020-02-05 21:48:40,439 p=19469 u=mistral |  changed: [compute-4]
2020-02-05 21:48:40,751 p=19469 u=mistral |  changed: [controller-1]
2020-02-05 21:48:41,010 p=19469 u=mistral |  changed: [compute-7]
2020-02-05 21:48:41,015 p=19469 u=mistral |  changed: [compute-8]
2020-02-05 21:48:41,060 p=19469 u=mistral |  changed: [compute-9]
2020-02-05 21:48:41,251 p=19469 u=mistral |  changed: [compute-5]
2020-02-05 21:48:41,767 p=19469 u=mistral |  changed: [controller-2]
2020-02-05 21:48:42,534 p=19469 u=mistral |  changed: [controller-0]

In both cases the update took much less than a minute as specified in Comment 6.

Comment 13 errata-xmlrpc 2020-03-03 09:45:05 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0655