1425834 – Introspection fails when running on many nodes (35+) simultanously.

Bug 1425834 - Introspection fails when running on many nodes (35+) simultanously.

Summary: Introspection fails when running on many nodes (35+) simultanously.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tripleo-common
Sub Component:
Version:	10.0 (Newton)
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	z5
Target Release:	11.0 (Ocata)
Assignee:	Toure Dunnon
QA Contact:	Dan Yasny
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-02-22 14:41 UTC by Harald Jensås
Modified:	2020-12-14 08:14 UTC (History)
CC List:	8 users (show)
Fixed In Version:	openstack-tripleo-common-6.1.4-1.el7ost
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-05-18 17:08:29 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
OpenStack gerrit	465518	0	None	MERGED	Run introspection 20 nodes at a time	2020-05-15 11:28:25 UTC
Red Hat Product Errata	RHBA-2018:1622	0	None	None	None	2018-05-18 17:09:49 UTC

Description Harald Jensås 2017-02-22 14:41:22 UTC

Description of problem:
When starting introspection on many nodes simultanously, introspection fails.


I have a high number of nodes configured to overcloud, every time introspection is started by executing 'openstack overcloud node introspect --all-manageable --provideæ it fails with timeout. Even while overcloud nodes are still on. Nodes are still reporting after introspection has timed out.

Version-Release number of selected component (if applicable):


How reproducible:
100% of times.

Steps to Reproduce:
1.  for node in $(openstack baremetal node list --fields uuid -f value) ; do openstack baremetal node manage $node ; done
2. openstack overcloud node introspect --all-manageable --provide

Actual results:
$ openstack overcloud node introspect --all-manageable --provide
Started Mistral Workflow. Execution ID: 1f37c026-aaec-428f-8f44-2dcd1de2e92b
Waiting for introspection to finish...
Exception introspecting nodes: {u'result': u"Failure caused by error in tasks: wait_for_introspection_to_finish\n\n  wait_for_introspection_to_finish [task_ex_id=87fbc331-a389-4bb7-b30f-5d8c8f779c60] -> Failed to run action [action_ex_id=b43293c2-e5a5-4f09-b21f-642643307013, action_cls='<class 'mistral.actions.action_factory.BaremetalIntrospectionAction'>', attributes='{u'client_method_name': u'wait_for_finish'}', params='{u'max_retries': 3600, u'retry_interval': 10, u'uuids': [u'a26be1e5-5604-42e3-869e-4c9f78263a8f', u'b3d2263e-4c65-4815-ab27-538309926cd0', u'9fb565d1-3290-4b08-9db4-a7e2b6e495b7', u'5da2d3af-24b1-4000-adf2-6fe5234bd253', u'1090e20c-9996-41ad-9d66-5bf0dd3f8cec', u'b559165f-1251-402c-8797-c3bbb00acadf', u'85ecb772-1de9-477e-b0f3-cd7a22e60e66',

. . .

u'218e3ff2-06d9-47cc-855d-8fc2d7dea338', u'b8eb680b-8816-4c4b-afc4-53f8ebd8adc9', u'd48095e5-72da-4eff-8fe5-9a6cbbeb400b', u'4b30011f-308a-4659-928b-2e485a99cf8c', u'00c9a249-23bb-4eaa-aded-b070167d24d3', u'b0e39485-f25b-45ec-9ac2-5de7c4bee2bb', u'0425cee6-2136-425a-a6e0-b61ea126f058', u'03c2680b-2cf4-42a9-8097-2355dba2e9d3']}']\n BaremetalIntrospectionAction.wait_for_finish failed: <class 'keystoneauth1.exceptions.connection.ConnectFailure'>: Connection failure that may be retried.\n"}


Expected results:
Introspection should finish without errors.

Additional info:
Workaround:
Start introspection on node by node, adding sleep 5 between each start.

'for node in `openstack baremetal node list -f value | cut -d' ' -f1`; do openstack overcloud node introspect $node --provide > /tmp/$node.log & sleep 5; done'

Comment 1 Justin Kilpatrick 2017-03-16 15:49:34 UTC

Reproduced in the scale lab, significantly more data available here. 

https://docs.google.com/document/d/1IOLuU7KOMsubLKvo0tam4eYd25PymxW2QlygDkKxLJY/edit#

Comment 2 Toure Dunnon 2017-08-30 13:28:57 UTC

This has landed upstream for Ocata.

Comment 3 Bob Fournier 2018-02-19 17:01:38 UTC

Changing the DFG assignment to Workflows. Fix was made in Workflows by that team and they'll have a better handle on fix.

We have been talking with Joe Talerico about how to test this in scale lab with > 35 nodes.

Comment 7 errata-xmlrpc 2018-05-18 17:08:29 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1622

Note You need to log in before you can comment on or make changes to this bug.