Description of problem: When starting introspection on many nodes simultanously, introspection fails. I have a high number of nodes configured to overcloud, every time introspection is started by executing 'openstack overcloud node introspect --all-manageable --provideæ it fails with timeout. Even while overcloud nodes are still on. Nodes are still reporting after introspection has timed out. Version-Release number of selected component (if applicable): How reproducible: 100% of times. Steps to Reproduce: 1. for node in $(openstack baremetal node list --fields uuid -f value) ; do openstack baremetal node manage $node ; done 2. openstack overcloud node introspect --all-manageable --provide Actual results: $ openstack overcloud node introspect --all-manageable --provide Started Mistral Workflow. Execution ID: 1f37c026-aaec-428f-8f44-2dcd1de2e92b Waiting for introspection to finish... Exception introspecting nodes: {u'result': u"Failure caused by error in tasks: wait_for_introspection_to_finish\n\n wait_for_introspection_to_finish [task_ex_id=87fbc331-a389-4bb7-b30f-5d8c8f779c60] -> Failed to run action [action_ex_id=b43293c2-e5a5-4f09-b21f-642643307013, action_cls='<class 'mistral.actions.action_factory.BaremetalIntrospectionAction'>', attributes='{u'client_method_name': u'wait_for_finish'}', params='{u'max_retries': 3600, u'retry_interval': 10, u'uuids': [u'a26be1e5-5604-42e3-869e-4c9f78263a8f', u'b3d2263e-4c65-4815-ab27-538309926cd0', u'9fb565d1-3290-4b08-9db4-a7e2b6e495b7', u'5da2d3af-24b1-4000-adf2-6fe5234bd253', u'1090e20c-9996-41ad-9d66-5bf0dd3f8cec', u'b559165f-1251-402c-8797-c3bbb00acadf', u'85ecb772-1de9-477e-b0f3-cd7a22e60e66', . . . u'218e3ff2-06d9-47cc-855d-8fc2d7dea338', u'b8eb680b-8816-4c4b-afc4-53f8ebd8adc9', u'd48095e5-72da-4eff-8fe5-9a6cbbeb400b', u'4b30011f-308a-4659-928b-2e485a99cf8c', u'00c9a249-23bb-4eaa-aded-b070167d24d3', u'b0e39485-f25b-45ec-9ac2-5de7c4bee2bb', u'0425cee6-2136-425a-a6e0-b61ea126f058', u'03c2680b-2cf4-42a9-8097-2355dba2e9d3']}']\n BaremetalIntrospectionAction.wait_for_finish failed: <class 'keystoneauth1.exceptions.connection.ConnectFailure'>: Connection failure that may be retried.\n"} Expected results: Introspection should finish without errors. Additional info: Workaround: Start introspection on node by node, adding sleep 5 between each start. 'for node in `openstack baremetal node list -f value | cut -d' ' -f1`; do openstack overcloud node introspect $node --provide > /tmp/$node.log & sleep 5; done'
Reproduced in the scale lab, significantly more data available here. https://docs.google.com/document/d/1IOLuU7KOMsubLKvo0tam4eYd25PymxW2QlygDkKxLJY/edit#
This has landed upstream for Ocata.
Changing the DFG assignment to Workflows. Fix was made in Workflows by that team and they'll have a better handle on fix. We have been talking with Joe Talerico about how to test this in scale lab with > 35 nodes.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:1622