Description of problem: Failure to introspect with OSP13: Introspection completed. Started Mistral Workflow tripleo.baremetal.v1.provide_manageable_nodes. Execution ID: 88f1a771-7d65-473d-9572-2dc5d31538e3 [{}, {u'result': u"Failed to run action [action_ex_id=6ff49be2-89b3-496b-9abd-7a867ac32f3e, action_cls='<class 'mistral.actions.action_factory.IronicAction'>', attributes='{u'client_method_name': u'node.set_provision_state'}', params='{u'state': u'provide', u'node_uuid': u'68caf961-0502-4c6f-9184-64ae101abb70'}']\n IronicAction.node.set_provision_state failed: Node 68caf961-0502-4c6f-9184-64ae101abb70 is locked by host localhost.localdomain, please retry after the current operation is completed. (HTTP 409)"}, {u'result': u"Failed to run action [action_ex_id=a8950e5d-161f-4745-aa0d-86f2108fe6a1, action_cls='<class 'mistral.actions.action_factory.IronicAction'>', attributes='{u'client_method_name': u'node.set_provision_state'}', params='{u'state': u'provide', u'node_uuid': u'6d291dbe-e0e7-4bec-a2d0-491bd77b91b6'}']\n IronicAction.node.set_provision_state failed: Node 6d291dbe-e0e7-4bec-a2d0-491bd77b91b6 is locked by host localhost.localdomain, please retry after the current operation is completed. (HTTP 409)"}, {u'result': u"Failed to run action [action_ex_id=2bb4096f-ae4e-49c5-ae5e-ffd89c2103a5, action_cls='<class 'mistral.actions.action_factory.IronicAction'>', attributes='{u'client_method_name': u'node.set_provision_state'}', params='{u'state': u'provide', u'node_uuid': u'b62cdd39-3ca8-4339-925b-bc32a2aaeb7a'}']\n IronicAction.node.set_provision_state failed: Node b62cdd39-3ca8-4339-925b-bc32a2aaeb7a is locked by host localhost.localdomain, please retry after the current operation is completed. (HTTP 409)"}, {u'result': u"Failed to run action [action_ex_id=6395e5a0-66f9-4097-bda6-b7e837aa7772, action_cls='<class 'mistral.actions.action_factory.IronicAction'>', attributes='{u'client_method_name': u'node.set_provision_state'}', params='{u'state': u'provide', u'node_uuid': u'45ecc552-15e3-4891-88e8-1d7e5bf3277c'}']\n IronicAction.node.set_provision_state failed: Node 45ecc552-15e3-4891-88e8-1d7e5bf3277c is locked by host localhost.localdomain, please retry after the current operation is completed. (HTTP 409)"}, {u'result': u"Failed to run action [action_ex_id=a420ff5e-3115-4b74-8283-f4917503c1fa, action_cls='<class 'mistral.actions.action_factory.IronicAction'>', attributes='{u'client_method_name': u'node.set_provision_state'}', params='{u'state': u'provide', u'node_uuid': u'34e2d4a7-7811-4b8d-9a92-3bfb1d8970ec'}']\n IronicAction.node.set_provision_state failed: Node 34e2d4a7-7811-4b8d-9a92-3bfb1d8970ec is locked by host localhost.localdomain, please retry after the current operation is completed. (HTTP 409)"}, {u'result': u"Failed to run action [action_ex_id=eea2af23-ee95-48df-90a5-3aeb56d3a704, action_cls='<class 'mistral.actions.action_factory.IronicAction'>', attributes='{u'client_method_name': u'node.set_provision_state'}', params='{u'state': u'provide', u'node_uuid': u'232396e9-0e84-46c0-90db-91fc9c3ea130'}']\n IronicAction.node.set_provision_state failed: Node 232396e9-0e84-46c0-90db-91fc9c3ea130 is locked by host localhost.localdomain, please retry after the current operation is completed. (HTTP 40 And in the mistral logs we see: Conflict: Node 45ecc552-15e3-4891-88e8-1d7e5bf3277c is locked by host localhost.localdomain, please retry after the current operation is completed. (HTTP 409) : Conflict: Node 45ecc552-15e3-4891-88e8-1d7e5bf3277c is locked by host localhost.localdomain, please retry after the current operation is completed. (HTTP 409) 2019-01-21 11:50:18.671 9447 ERROR mistral.executors.default_executor [req-7ee0d872-637f-459d-b4e9-aa20bf14c594 7380fe039807448186f3df84d4b0649e 6baa1c5efb7247518accff3736b331ac - default default] Failed to run action [action_ex_id=6395e5a0-66f9-4097-bda6-b7e837aa7772, action_cls='<class 'mistral.actions.action_factory.IronicAction'>', attributes='{u'client_method_name': u'node.set_provision_state'}', params='{u'state': u'provide', u'node_uuid': u'45ecc552-15e3-4891-88e8-1d7e5bf3277c'}'] IronicAction.node.set_provision_state failed: Node 45ecc552-15e3-4891-88e8-1d7e5bf3277c is locked by host localhost.localdomain, please retry after the current operation is completed. (HTTP 409): ActionException: IronicAction.node.set_provision_state failed: Node 45ecc552-15e3-4891-88e8-1d7e5bf3277c is locked by host localhost.localdomain, please retry after the current operation is completed. (HTTP 409) 2019-01-21 11:50:18.671 9447 ERROR mistral.executors.default_executor Traceback (most recent call last): 2019-01-21 11:50:18.671 9447 ERROR mistral.executors.default_executor File "/usr/lib/python2.7/site-packages/mistral/executors/default_executor.py", line 114, in run_action 2019-01-21 11:50:18.671 9447 ERROR mistral.executors.default_executor result = action.run(action_ctx) 2019-01-21 11:50:18.671 9447 ERROR mistral.executors.default_executor File "/usr/lib/python2.7/site-packages/mistral/actions/openstack/base.py", line 130, in run 2019-01-21 11:50:18.671 9447 ERROR mistral.executors.default_executor (self.__class__.__name__, self.client_method_name, str(e)) 2019-01-21 11:50:18.671 9447 ERROR mistral.executors.default_executor ActionException: IronicAction.node.set_provision_state failed: Node 45ecc552-15e3-4891-88e8-1d7e5bf3277c is locked by host localhost.localdomain, please retry after the current operation is completed. (HTTP 409) 2019-01-21 11:50:18.671 9447 ERROR mistral.executors.default_executor Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1. Deploy an undercloud on RHEL 7.5 2. Try introspecting 7 nodes 3. Actual results: Failure to set the privion_state at the very end which breaks the introspection process. Expected results: Additional info:
So node is locked, is this happening on all the nodes? Is there an issue with the BMC's responding? Can we get Ironic logs?
Dave - do you have more info on this? Its normal for the nodes to be locked for a short period of time. We don't really have anything to go on.
The case seems closed, please reopen if you have the requested logs.
Hi David i wonder why this issue was closed it happens 1/15 deployments , is there a workaround ? Waiting for messages on queue 'tripleo' with no timeout. [{}, {u'result': u"The action raised an exception [action_ex_id=4e2f2d80-76a4-479e-ad62-088164efc367, action_cls='<class 'mistral.actions.action_factory.IronicAction'>', attributes=' {u'client_method_name': u'node.set_provision_state'} ', params='{u'state': u'provide', u'node_uuid': u'30925bac-bc22-4e74-bd40-de7e77b82db3'}']\n IronicAction.node.set_provision_state failed: Node 30925bac-bc22-4e74-bd40-de7e77b82db3 is locked by host undercloud.localdomain, please retry after the current operation is completed. (HTTP 409)"}, {}, {}, {}, {}] No JSON object could be decoded 2020-04-01 09:36:47,227 - CbisDeployment - ERROR - e is: error occurred during command: openstack overcloud node provide --all-manageable error: Waiting for messages on queue 'tripleo' with no timeout. [{}, {u'result': u"The action raised an exception [action_ex_id=4e2f2d80-76a4-479e-ad62-088164efc367, action_cls='<class 'mistral.actions.action_factory.IronicAction'>', attributes=' {u'client_method_name': u'node.set_provision_state'} ', params='{u'state': u'provide', u'node_uuid': u'30925bac-bc22-4e74-bd40-de7e77b82db3'}']\n IronicAction.node.set_provision_state failed: Node 30925bac-bc22-4e74-bd40-de7e77b82db3 is locked by host undercloud.localdomain, please retry after the current operation is completed. (HTTP 409)"}, {}, {}, {}, {}] No JSON object could be decoded
It was closed because the customer reporting this issue closed the case. Are you using the latest bits ? Did you open a support case ?
yes latest not opening a ticket , just wondering on a workaroud if exists ?
I'd suggest you read comment #2 and open a support case if you hit this issue.
The supplied logs show the actions to set BMC state took 80-300+ seconds with some requests even backing up because it seems requests are being sent to make changes and manipulate the node faster than its configuration can be changed. This seems to be due to configuration jobs having to be executed to change the running state of the BMC configuration. i.e. Change boot device when power cycling or please boot in UEFI mode or BIOS mode. Unfortunately it is hard for me to classify this as a bug in ironic. Ironic seems to be doing the needful as fast as possible. Which realistically leaves two options: 1) tripleo-common needs to instantiate ironic's client with more retries. Currently this is 12 retries every 5 seconds, or for 60 seconds in total. 2) Or tripleo-common's code could poll and see if the node is locked for manipulation before issuing actions.
One additional item, often these delays are due to power-on-self-test memory checks while the configuration job is being applied, as the machine boots into the bios. Ensure that memory testing is disabled in the bios firmware settings, which should speed up the configuration times and may actually make this issue go away.
The introspection timeout is 20 minutes, which I assume is enough even for slow servers. The problem seems to be that after a successful introspection the provide fails because the node is still locked. It is not clear to me if the node stays locked indefinitely. If the introspect finishes (failure or success) and the node remains locked, then that is something we need to investigate. If the lock is *eventually* released, I would suggest the following as a workaround. Stop specifying --provide as an argument and run the provide command separately, so openstack overcloud node introspect --all-manageable openstack overcloud node provide --all-manageable or openstack overcloud node introspect <node1> <node2> ... openstack overcloud node provide <node1> <node2> ... If the provide command fails with the locked node error, wait then try again. We need to know how long this lock lingers after introspect completes. Is it seconds, triggering a race when doing introspect --provide? Is it minutes? Does it remain on that node forever? (setting NEEDINFO for this information)
hi related to running the command separately : openstack overcloud node provide <node1> <node2> ... we do see that usually 1 or 2 blades are stack with this lock and the command: openstack overcloud node provide --all-manageable will wait forever during that time waiting for this blades. if while it is being hanged we identify those blades and run (while the comamnd openstack overcloud node provide --all-manageable is still excuted) a separate command : (assume node-1 was locked by the example) openstack overcloud node provide <node1> Then surprisingly the node is being changed to available but the still hanging command : openstack overcloud node provide --all-manageable Does not go through till it gets the Total timeout of the deployment. Important note: running each node separately is something we want to avoid as for big numbers of blades this is not rational to do also i suspect that One the command : openstack overcloud node provide --all-manageable , identifies a Lock , it is not releasing something internally and waits for Ever regards
Interesting, I'll look into why provide --all-manageable hits lock issues when provide <node> does not
Just updating this, I've proposed a potential fix upstream [1] which can be backported if it shows to help [1] https://review.opendev.org/#/c/745991/
I haven't seen this issue in a while now ... if I see it again, I'll update this BZ. Attached case is closed so we can't ask the customer to test it for us either.
Attached is an upstream stable/queens fix which should help with this issue. OSP-17 will have a similar fix, but in ansible instead of mistral.
This bug is to track the OSP-13 backport of bug #1848560
The fix for this is ready for rhpkg push when this bug gets its +1 flags
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 13.0 director bug fix advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:5575