Bug 1425834 - Introspection fails when running on many nodes (35+) simultanously.
Summary: Introspection fails when running on many nodes (35+) simultanously.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-common
Version: 10.0 (Newton)
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: z5
: 11.0 (Ocata)
Assignee: Toure Dunnon
QA Contact: Dan Yasny
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-02-22 14:41 UTC by Harald Jensås
Modified: 2020-12-14 08:14 UTC (History)
8 users (show)

Fixed In Version: openstack-tripleo-common-6.1.4-1.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-05-18 17:08:29 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 465518 0 None MERGED Run introspection 20 nodes at a time 2020-05-15 11:28:25 UTC
Red Hat Product Errata RHBA-2018:1622 0 None None None 2018-05-18 17:09:49 UTC

Description Harald Jensås 2017-02-22 14:41:22 UTC
Description of problem:
When starting introspection on many nodes simultanously, introspection fails.


I have a high number of nodes configured to overcloud, every time introspection is started by executing 'openstack overcloud node introspect --all-manageable --provideæ it fails with timeout. Even while overcloud nodes are still on. Nodes are still reporting after introspection has timed out.

Version-Release number of selected component (if applicable):


How reproducible:
100% of times.

Steps to Reproduce:
1.  for node in $(openstack baremetal node list --fields uuid -f value) ; do openstack baremetal node manage $node ; done
2. openstack overcloud node introspect --all-manageable --provide

Actual results:
$ openstack overcloud node introspect --all-manageable --provide
Started Mistral Workflow. Execution ID: 1f37c026-aaec-428f-8f44-2dcd1de2e92b
Waiting for introspection to finish...
Exception introspecting nodes: {u'result': u"Failure caused by error in tasks: wait_for_introspection_to_finish\n\n  wait_for_introspection_to_finish [task_ex_id=87fbc331-a389-4bb7-b30f-5d8c8f779c60] -> Failed to run action [action_ex_id=b43293c2-e5a5-4f09-b21f-642643307013, action_cls='<class 'mistral.actions.action_factory.BaremetalIntrospectionAction'>', attributes='{u'client_method_name': u'wait_for_finish'}', params='{u'max_retries': 3600, u'retry_interval': 10, u'uuids': [u'a26be1e5-5604-42e3-869e-4c9f78263a8f', u'b3d2263e-4c65-4815-ab27-538309926cd0', u'9fb565d1-3290-4b08-9db4-a7e2b6e495b7', u'5da2d3af-24b1-4000-adf2-6fe5234bd253', u'1090e20c-9996-41ad-9d66-5bf0dd3f8cec', u'b559165f-1251-402c-8797-c3bbb00acadf', u'85ecb772-1de9-477e-b0f3-cd7a22e60e66',

. . .

u'218e3ff2-06d9-47cc-855d-8fc2d7dea338', u'b8eb680b-8816-4c4b-afc4-53f8ebd8adc9', u'd48095e5-72da-4eff-8fe5-9a6cbbeb400b', u'4b30011f-308a-4659-928b-2e485a99cf8c', u'00c9a249-23bb-4eaa-aded-b070167d24d3', u'b0e39485-f25b-45ec-9ac2-5de7c4bee2bb', u'0425cee6-2136-425a-a6e0-b61ea126f058', u'03c2680b-2cf4-42a9-8097-2355dba2e9d3']}']\n BaremetalIntrospectionAction.wait_for_finish failed: <class 'keystoneauth1.exceptions.connection.ConnectFailure'>: Connection failure that may be retried.\n"}


Expected results:
Introspection should finish without errors.

Additional info:
Workaround:
Start introspection on node by node, adding sleep 5 between each start.

'for node in `openstack baremetal node list -f value | cut -d' ' -f1`; do openstack overcloud node introspect $node --provide > /tmp/$node.log & sleep 5; done'

Comment 1 Justin Kilpatrick 2017-03-16 15:49:34 UTC
Reproduced in the scale lab, significantly more data available here. 

https://docs.google.com/document/d/1IOLuU7KOMsubLKvo0tam4eYd25PymxW2QlygDkKxLJY/edit#

Comment 2 Toure Dunnon 2017-08-30 13:28:57 UTC
This has landed upstream for Ocata.

Comment 3 Bob Fournier 2018-02-19 17:01:38 UTC
Changing the DFG assignment to Workflows. Fix was made in Workflows by that team and they'll have a better handle on fix.

We have been talking with Joe Talerico about how to test this in scale lab with > 35 nodes.

Comment 7 errata-xmlrpc 2018-05-18 17:08:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1622


Note You need to log in before you can comment on or make changes to this bug.