1850244 – Very long playbook execution time on compute and controller that have lots of interfaces

Bug 1850244 - Very long playbook execution time on compute and controller that have lots of interfaces

Summary: Very long playbook execution time on compute and controller that have lots of...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	Ceph-Ansible
Sub Component:
Version:	3.3
Hardware:	x86_64
OS:	All
Priority:	high
Severity:	high
Target Milestone:	z7
Target Release:	3.3
Assignee:	Guillaume Abrioux
QA Contact:	Vasishta
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1578730
TreeView+	depends on / blocked

Reported:	2020-06-23 19:33 UTC by David Hill
Modified:	2023-12-15 18:15 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-10-25 16:40:58 UTC
Embargoed:
Dependent Products:
Flags:	yrabl: automate_bug+

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHCEPH-5988	None	None	None	2023-01-20 10:07:07 UTC
Red Hat Knowledge Base (Solution)	4848951	None	None	None	2020-06-23 20:00:18 UTC
Red Hat Knowledge Base (Solution)	5177921	None	None	None	2020-06-23 19:45:30 UTC

Description David Hill 2020-06-23 19:33:29 UTC

Very long playbook execution time on compute and controller that have lots of interfaces

Across all of the clusters we were seeing timeouts during the following portion of stack updates:


2020-06-20 11:47:58Z [overcloud-fozzie-AllNodesDeploySteps-wkbflvledpui.WorkflowTasks_Step2_Execution]: CREATE_IN_PROGRESS  state changed
This is where ceph-ansible is being ran. In some cases, by looking at /var/log/mistral/ceph-ansible.log, we were seeing the initial portion of the site-docker.yml playbook taking ~24 hours to gather facts:

because ansible would run for ~18 hours or more.

2020-06-19 03:30:20,793 p=18969 u=mistral |  TASK [gather and delegate facts] ***********************************************
2020-06-19 03:30:20,793 p=18969 u=mistral |  Friday 19 June 2020  03:30:20 +0000 (0:00:00.245)       0:00:00.314 *********** 
2020-06-19 05:48:06,716 p=18969 u=mistral |  ok: [10.10.10.13 -> 10.10.10.13] => (item=10.10.10.13)
2020-06-19 11:18:54,263 p=18969 u=mistral |  ok: [10.10.10.13 -> 10.10.10.9] => (item=10.10.10.9)
2020-06-19 16:15:47,032 p=18969 u=mistral |  ok: [10.10.10.13 -> 10.10.10.8] => (item=10.10.10.8)
2020-06-19 16:15:56,109 p=18969 u=mistral |  ok: [10.10.10.13 -> 10.10.10.16] => (item=10.10.10.16)
2020-06-19 16:15:56,171 p=18969 u=mistral |  ok: [10.10.10.13 -> 10.10.10.15] => (item=10.10.10.15)
2020-06-19 16:15:56,235 p=18969 u=mistral |  ok: [10.10.10.13 -> 10.10.10.14] => (item=10.10.10.14)
2020-06-19 16:15:56,606 p=18969 u=mistral |  ok: [10.10.10.13 -> 10.10.10.7] => (item=10.10.10.7)
2020-06-19 19:48:01,037 p=18969 u=mistral |  ok: [10.10.10.13 -> 10.10.10.12] => (item=10.10.10.12)
2020-06-19 19:48:10,701 p=18969 u=mistral |  ok: [10.10.10.13 -> 10.10.10.5] => (item=10.10.10.5)
2020-06-20 01:44:17,531 p=18969 u=mistral |  ok: [10.10.10.13 -> 10.10.10.10] => (item=10.10.10.10)
2020-06-20 02:49:53,998 p=18969 u=mistral |  ok: [10.10.10.13 -> 10.10.10.11] => (item=10.10.10.11)
2020-06-20 02:50:00,836 p=18969 u=mistral |  ok: [10.10.10.13 -> 10.10.10.6] => (item=10.10.10.6)
The length of time it took to complete the ceph-ansible run during this tripleo deployment step seemed to cause the below errors with fetching files from swift on the undercloud, besides the impact of how long stack deployments obviously take...:


e] <html><h1>Unauthorized</h1><p>This server could not verify t'}
        [action_ex_id=a5b25e5d-c6d7-4c3c-93b6-ae666f48ba11, idx=0]: {u'msg': u'Error attempting an operation on container: Container GET failed: http://10.10.10.3:8080/v1/AUTH_463490ef385049a5a03c36d5a67e85fc/overcloud-fozzie_ceph_ansible_fetch_dir?format=json 401 Unauthorized  [first 60 chars of response] <html><h1
>Unauthorized</h1><p>This server could not verify t'}

We were able to speed up the run of ceph-ansible significantly by making a change to the playbook.


After some troubleshooting, it seems that due to the number of interfaces existing on our computes and controllers; Ansible gathering facts for each overcloud node on the networking/hardware details is very slow. Reading thru the playbook it looks like the only fact used during it is ansible_hostname. The change to gather only the minimal subset of facts using !all [1] significantly speeds up the runtime of that task and does not impact the playbook. 


After making this change we saw ceph-ansible runs on all our clusters taking ~1.5 hours to complete and the specific step above taking only seconds:


2020-06-20 21:31:52,230 p=26304 u=mistral |  TASK [gather and delegate facts] ***********************************************
2020-06-20 21:31:52,230 p=26304 u=mistral |  Saturday 20 June 2020  21:31:52 +0000 (0:00:00.283)       0:00:00.516 ********* 
2020-06-20 21:31:53,294 p=26304 u=mistral |  ok: [10.10.10.13 -> 10.10.10.13] => (item=10.10.10.13)
2020-06-20 21:31:54,183 p=26304 u=mistral |  ok: [10.10.10.13 -> 10.10.10.9] => (item=10.10.10.9)
2020-06-20 21:31:54,913 p=26304 u=mistral |  ok: [10.10.10.13 -> 10.10.10.8] => (item=10.10.10.8)
2020-06-20 21:31:55,353 p=26304 u=mistral |  ok: [10.10.10.13 -> 10.10.10.16] => (item=10.10.10.16)
2020-06-20 21:31:55,797 p=26304 u=mistral |  ok: [10.10.10.13 -> 10.10.10.15] => (item=10.10.10.15)
2020-06-20 21:31:56,239 p=26304 u=mistral |  ok: [10.10.10.13 -> 10.10.10.14] => (item=10.10.10.14)
2020-06-20 21:31:56,760 p=26304 u=mistral |  ok: [10.10.10.13 -> 10.10.10.7] => (item=10.10.10.7)
2020-06-20 21:31:57,437 p=26304 u=mistral |  ok: [10.10.10.13 -> 10.10.10.12] => (item=10.10.10.12)
2020-06-20 21:31:58,008 p=26304 u=mistral |  ok: [10.10.10.13 -> 10.10.10.5] => (item=10.10.10.5)
2020-06-20 21:31:58,748 p=26304 u=mistral |  ok: [10.10.10.13 -> 10.10.10.10] => (item=10.10.10.10)
2020-06-20 21:31:59,421 p=26304 u=mistral |  ok: [10.10.10.13 -> 10.10.10.11] => (item=10.10.10.11)
2020-06-20 21:31:59,938 p=26304 u=mistral |  ok: [10.10.10.13 -> 10.10.10.6] => (item=10.10.10.6)
2020-06-20 21:32:00,026 p=26304 u=mistral |  TASK [check if it is atomic host] **********************************************

Comment 7 Veera Raghava Reddy 2020-10-07 11:22:25 UTC

Hi David Hill,
From Customer case 02745633, looks like fixing a H/w issue resolved the reported problem.
Is it valid BZ?

Comment 8 David Hill 2020-10-07 13:03:15 UTC

If it's already fixed in the code (I'm pretty sure it is in later releases), I'm pretty sure we can close this BZ if the case is solved.

Comment 9 Veera Raghava Reddy 2020-10-09 11:42:53 UTC

Hi Guillaume,
For this BZ is there nay fix to verify. from Customer case this seems to be H/w issue.
Please suggest?

Note You need to log in before you can comment on or make changes to this bug.