Bug 1850244

Summary:	Very long playbook execution time on compute and controller that have lots of interfaces
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	David Hill <dhill>
Component:	Ceph-Ansible	Assignee:	Guillaume Abrioux <gabrioux>
Status:	CLOSED NOTABUG	QA Contact:	Vasishta <vashastr>
Severity:	high	Docs Contact:
Priority:	high
Version:	3.3	CC:	aschoen, ceph-eng-bugs, gabrioux, gfidente, gmeno, gsitlani, johfulto, kmehta, lithomas, nchilaka, nthomas, vereddy, ykaul, yrabl
Target Milestone:	z7	Flags:	yrabl: automate_bug+
Target Release:	3.3
Hardware:	x86_64
OS:	All
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-10-25 16:40:58 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1578730

Description David Hill 2020-06-23 19:33:29 UTC

Very long playbook execution time on compute and controller that have lots of interfaces

Across all of the clusters we were seeing timeouts during the following portion of stack updates:


2020-06-20 11:47:58Z [overcloud-fozzie-AllNodesDeploySteps-wkbflvledpui.WorkflowTasks_Step2_Execution]: CREATE_IN_PROGRESS  state changed
This is where ceph-ansible is being ran. In some cases, by looking at /var/log/mistral/ceph-ansible.log, we were seeing the initial portion of the site-docker.yml playbook taking ~24 hours to gather facts:

because ansible would run for ~18 hours or more.

2020-06-19 03:30:20,793 p=18969 u=mistral |  TASK [gather and delegate facts] ***********************************************
2020-06-19 03:30:20,793 p=18969 u=mistral |  Friday 19 June 2020  03:30:20 +0000 (0:00:00.245)       0:00:00.314 *********** 
2020-06-19 05:48:06,716 p=18969 u=mistral |  ok: [10.10.10.13 -> 10.10.10.13] => (item=10.10.10.13)
2020-06-19 11:18:54,263 p=18969 u=mistral |  ok: [10.10.10.13 -> 10.10.10.9] => (item=10.10.10.9)
2020-06-19 16:15:47,032 p=18969 u=mistral |  ok: [10.10.10.13 -> 10.10.10.8] => (item=10.10.10.8)
2020-06-19 16:15:56,109 p=18969 u=mistral |  ok: [10.10.10.13 -> 10.10.10.16] => (item=10.10.10.16)
2020-06-19 16:15:56,171 p=18969 u=mistral |  ok: [10.10.10.13 -> 10.10.10.15] => (item=10.10.10.15)
2020-06-19 16:15:56,235 p=18969 u=mistral |  ok: [10.10.10.13 -> 10.10.10.14] => (item=10.10.10.14)
2020-06-19 16:15:56,606 p=18969 u=mistral |  ok: [10.10.10.13 -> 10.10.10.7] => (item=10.10.10.7)
2020-06-19 19:48:01,037 p=18969 u=mistral |  ok: [10.10.10.13 -> 10.10.10.12] => (item=10.10.10.12)
2020-06-19 19:48:10,701 p=18969 u=mistral |  ok: [10.10.10.13 -> 10.10.10.5] => (item=10.10.10.5)
2020-06-20 01:44:17,531 p=18969 u=mistral |  ok: [10.10.10.13 -> 10.10.10.10] => (item=10.10.10.10)
2020-06-20 02:49:53,998 p=18969 u=mistral |  ok: [10.10.10.13 -> 10.10.10.11] => (item=10.10.10.11)
2020-06-20 02:50:00,836 p=18969 u=mistral |  ok: [10.10.10.13 -> 10.10.10.6] => (item=10.10.10.6)
The length of time it took to complete the ceph-ansible run during this tripleo deployment step seemed to cause the below errors with fetching files from swift on the undercloud, besides the impact of how long stack deployments obviously take...:


e] <html><h1>Unauthorized</h1><p>This server could not verify t'}
        [action_ex_id=a5b25e5d-c6d7-4c3c-93b6-ae666f48ba11, idx=0]: {u'msg': u'Error attempting an operation on container: Container GET failed: http://10.10.10.3:8080/v1/AUTH_463490ef385049a5a03c36d5a67e85fc/overcloud-fozzie_ceph_ansible_fetch_dir?format=json 401 Unauthorized  [first 60 chars of response] <html><h1
>Unauthorized</h1><p>This server could not verify t'}

We were able to speed up the run of ceph-ansible significantly by making a change to the playbook.


After some troubleshooting, it seems that due to the number of interfaces existing on our computes and controllers; Ansible gathering facts for each overcloud node on the networking/hardware details is very slow. Reading thru the playbook it looks like the only fact used during it is ansible_hostname. The change to gather only the minimal subset of facts using !all [1] significantly speeds up the runtime of that task and does not impact the playbook. 


After making this change we saw ceph-ansible runs on all our clusters taking ~1.5 hours to complete and the specific step above taking only seconds:


2020-06-20 21:31:52,230 p=26304 u=mistral |  TASK [gather and delegate facts] ***********************************************
2020-06-20 21:31:52,230 p=26304 u=mistral |  Saturday 20 June 2020  21:31:52 +0000 (0:00:00.283)       0:00:00.516 ********* 
2020-06-20 21:31:53,294 p=26304 u=mistral |  ok: [10.10.10.13 -> 10.10.10.13] => (item=10.10.10.13)
2020-06-20 21:31:54,183 p=26304 u=mistral |  ok: [10.10.10.13 -> 10.10.10.9] => (item=10.10.10.9)
2020-06-20 21:31:54,913 p=26304 u=mistral |  ok: [10.10.10.13 -> 10.10.10.8] => (item=10.10.10.8)
2020-06-20 21:31:55,353 p=26304 u=mistral |  ok: [10.10.10.13 -> 10.10.10.16] => (item=10.10.10.16)
2020-06-20 21:31:55,797 p=26304 u=mistral |  ok: [10.10.10.13 -> 10.10.10.15] => (item=10.10.10.15)
2020-06-20 21:31:56,239 p=26304 u=mistral |  ok: [10.10.10.13 -> 10.10.10.14] => (item=10.10.10.14)
2020-06-20 21:31:56,760 p=26304 u=mistral |  ok: [10.10.10.13 -> 10.10.10.7] => (item=10.10.10.7)
2020-06-20 21:31:57,437 p=26304 u=mistral |  ok: [10.10.10.13 -> 10.10.10.12] => (item=10.10.10.12)
2020-06-20 21:31:58,008 p=26304 u=mistral |  ok: [10.10.10.13 -> 10.10.10.5] => (item=10.10.10.5)
2020-06-20 21:31:58,748 p=26304 u=mistral |  ok: [10.10.10.13 -> 10.10.10.10] => (item=10.10.10.10)
2020-06-20 21:31:59,421 p=26304 u=mistral |  ok: [10.10.10.13 -> 10.10.10.11] => (item=10.10.10.11)
2020-06-20 21:31:59,938 p=26304 u=mistral |  ok: [10.10.10.13 -> 10.10.10.6] => (item=10.10.10.6)
2020-06-20 21:32:00,026 p=26304 u=mistral |  TASK [check if it is atomic host] **********************************************

Comment 7 Veera Raghava Reddy 2020-10-07 11:22:25 UTC

Hi David Hill,
From Customer case 02745633, looks like fixing a H/w issue resolved the reported problem.
Is it valid BZ?

Comment 8 David Hill 2020-10-07 13:03:15 UTC

If it's already fixed in the code (I'm pretty sure it is in later releases), I'm pretty sure we can close this BZ if the case is solved.

Comment 9 Veera Raghava Reddy 2020-10-09 11:42:53 UTC

Hi Guillaume,
For this BZ is there nay fix to verify. from Customer case this seems to be H/w issue.
Please suggest?