Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1425834

Summary: Introspection fails when running on many nodes (35+) simultanously.
Product: Red Hat OpenStack Reporter: Harald Jensås <hjensas>
Component: openstack-tripleo-commonAssignee: Toure Dunnon <tdunnon>
Status: CLOSED ERRATA QA Contact: Dan Yasny <dyasny>
Severity: medium Docs Contact:
Priority: medium    
Version: 10.0 (Newton)CC: aschultz, bfournie, dbecker, jjoyce, mburns, morazi, rhel-osp-director-maint, slinaber
Target Milestone: z5Keywords: TestOnly, Triaged, ZStream
Target Release: 11.0 (Ocata)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-common-6.1.4-1.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-05-18 17:08:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Harald Jensås 2017-02-22 14:41:22 UTC
Description of problem:
When starting introspection on many nodes simultanously, introspection fails.


I have a high number of nodes configured to overcloud, every time introspection is started by executing 'openstack overcloud node introspect --all-manageable --provideæ it fails with timeout. Even while overcloud nodes are still on. Nodes are still reporting after introspection has timed out.

Version-Release number of selected component (if applicable):


How reproducible:
100% of times.

Steps to Reproduce:
1.  for node in $(openstack baremetal node list --fields uuid -f value) ; do openstack baremetal node manage $node ; done
2. openstack overcloud node introspect --all-manageable --provide

Actual results:
$ openstack overcloud node introspect --all-manageable --provide
Started Mistral Workflow. Execution ID: 1f37c026-aaec-428f-8f44-2dcd1de2e92b
Waiting for introspection to finish...
Exception introspecting nodes: {u'result': u"Failure caused by error in tasks: wait_for_introspection_to_finish\n\n  wait_for_introspection_to_finish [task_ex_id=87fbc331-a389-4bb7-b30f-5d8c8f779c60] -> Failed to run action [action_ex_id=b43293c2-e5a5-4f09-b21f-642643307013, action_cls='<class 'mistral.actions.action_factory.BaremetalIntrospectionAction'>', attributes='{u'client_method_name': u'wait_for_finish'}', params='{u'max_retries': 3600, u'retry_interval': 10, u'uuids': [u'a26be1e5-5604-42e3-869e-4c9f78263a8f', u'b3d2263e-4c65-4815-ab27-538309926cd0', u'9fb565d1-3290-4b08-9db4-a7e2b6e495b7', u'5da2d3af-24b1-4000-adf2-6fe5234bd253', u'1090e20c-9996-41ad-9d66-5bf0dd3f8cec', u'b559165f-1251-402c-8797-c3bbb00acadf', u'85ecb772-1de9-477e-b0f3-cd7a22e60e66',

. . .

u'218e3ff2-06d9-47cc-855d-8fc2d7dea338', u'b8eb680b-8816-4c4b-afc4-53f8ebd8adc9', u'd48095e5-72da-4eff-8fe5-9a6cbbeb400b', u'4b30011f-308a-4659-928b-2e485a99cf8c', u'00c9a249-23bb-4eaa-aded-b070167d24d3', u'b0e39485-f25b-45ec-9ac2-5de7c4bee2bb', u'0425cee6-2136-425a-a6e0-b61ea126f058', u'03c2680b-2cf4-42a9-8097-2355dba2e9d3']}']\n BaremetalIntrospectionAction.wait_for_finish failed: <class 'keystoneauth1.exceptions.connection.ConnectFailure'>: Connection failure that may be retried.\n"}


Expected results:
Introspection should finish without errors.

Additional info:
Workaround:
Start introspection on node by node, adding sleep 5 between each start.

'for node in `openstack baremetal node list -f value | cut -d' ' -f1`; do openstack overcloud node introspect $node --provide > /tmp/$node.log & sleep 5; done'

Comment 1 Justin Kilpatrick 2017-03-16 15:49:34 UTC
Reproduced in the scale lab, significantly more data available here. 

https://docs.google.com/document/d/1IOLuU7KOMsubLKvo0tam4eYd25PymxW2QlygDkKxLJY/edit#

Comment 2 Toure Dunnon 2017-08-30 13:28:57 UTC
This has landed upstream for Ocata.

Comment 3 Bob Fournier 2018-02-19 17:01:38 UTC
Changing the DFG assignment to Workflows. Fix was made in Workflows by that team and they'll have a better handle on fix.

We have been talking with Joe Talerico about how to test this in scale lab with > 35 nodes.

Comment 7 errata-xmlrpc 2018-05-18 17:08:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1622