Bug 1668028

Summary:	Failure to introspect with OSP13
Product:	Red Hat OpenStack	Reporter:	David Hill <dhill>
Component:	openstack-tripleo-common	Assignee:	Steve Baker <sbaker>
Status:	CLOSED ERRATA	QA Contact:	Alistair Tonner <atonner>
Severity:	high	Docs Contact:
Priority:	low
Version:	13.0 (Queens)	CC:	bfournie, dhill, jjoyce, jkreger, jschluet, kobi.ginon, mburns, mgarciac, mzheng, pkesavar, pweeks, sbaker, slinaber, tvignaud, ukalifon
Target Milestone:	z14	Keywords:	Reopened, Triaged, ZStream
Target Release:	13.0 (Queens)	Flags:	pweeks: needinfo+
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	openstack-tripleo-common-8.7.1-25.el7ost	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-12-16 13:55:05 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description David Hill 2019-01-21 17:55:13 UTC

Description of problem:
Failure to introspect with OSP13:
Introspection completed.
Started Mistral Workflow tripleo.baremetal.v1.provide_manageable_nodes. Execution ID: 88f1a771-7d65-473d-9572-2dc5d31538e3
[{}, {u'result': u"Failed to run action [action_ex_id=6ff49be2-89b3-496b-9abd-7a867ac32f3e, action_cls='<class 'mistral.actions.action_factory.IronicAction'>', attributes='{u'client_method_name': u'node.set_provision_state'}', params='{u'state': u'provide', u'node_uuid': u'68caf961-0502-4c6f-9184-64ae101abb70'}']\n IronicAction.node.set_provision_state failed: Node 68caf961-0502-4c6f-9184-64ae101abb70 is locked by host localhost.localdomain, please retry after the current operation is completed. (HTTP 409)"}, {u'result': u"Failed to run action [action_ex_id=a8950e5d-161f-4745-aa0d-86f2108fe6a1, action_cls='<class 'mistral.actions.action_factory.IronicAction'>', attributes='{u'client_method_name': u'node.set_provision_state'}', params='{u'state': u'provide', u'node_uuid': u'6d291dbe-e0e7-4bec-a2d0-491bd77b91b6'}']\n IronicAction.node.set_provision_state failed: Node 6d291dbe-e0e7-4bec-a2d0-491bd77b91b6 is locked by host localhost.localdomain, please retry after the current operation is completed. (HTTP 409)"}, {u'result': u"Failed to run action [action_ex_id=2bb4096f-ae4e-49c5-ae5e-ffd89c2103a5, action_cls='<class 'mistral.actions.action_factory.IronicAction'>', attributes='{u'client_method_name': u'node.set_provision_state'}', params='{u'state': u'provide', u'node_uuid': u'b62cdd39-3ca8-4339-925b-bc32a2aaeb7a'}']\n IronicAction.node.set_provision_state failed: Node b62cdd39-3ca8-4339-925b-bc32a2aaeb7a is locked by host localhost.localdomain, please retry after the current operation is completed. (HTTP 409)"}, {u'result': u"Failed to run action [action_ex_id=6395e5a0-66f9-4097-bda6-b7e837aa7772, action_cls='<class 'mistral.actions.action_factory.IronicAction'>', attributes='{u'client_method_name': u'node.set_provision_state'}', params='{u'state': u'provide', u'node_uuid': u'45ecc552-15e3-4891-88e8-1d7e5bf3277c'}']\n IronicAction.node.set_provision_state failed: Node 45ecc552-15e3-4891-88e8-1d7e5bf3277c is locked by host localhost.localdomain, please retry after the current operation is completed. (HTTP 409)"}, {u'result': u"Failed to run action [action_ex_id=a420ff5e-3115-4b74-8283-f4917503c1fa, action_cls='<class 'mistral.actions.action_factory.IronicAction'>', attributes='{u'client_method_name': u'node.set_provision_state'}', params='{u'state': u'provide', u'node_uuid': u'34e2d4a7-7811-4b8d-9a92-3bfb1d8970ec'}']\n IronicAction.node.set_provision_state failed: Node 34e2d4a7-7811-4b8d-9a92-3bfb1d8970ec is locked by host localhost.localdomain, please retry after the current operation is completed. (HTTP 409)"}, {u'result': u"Failed to run action [action_ex_id=eea2af23-ee95-48df-90a5-3aeb56d3a704, action_cls='<class 'mistral.actions.action_factory.IronicAction'>', attributes='{u'client_method_name': u'node.set_provision_state'}', params='{u'state': u'provide', u'node_uuid': u'232396e9-0e84-46c0-90db-91fc9c3ea130'}']\n IronicAction.node.set_provision_state failed: Node 232396e9-0e84-46c0-90db-91fc9c3ea130 is locked by host localhost.localdomain, please retry after the current operation is completed. (HTTP 40


And in the mistral logs we see:
Conflict: Node 45ecc552-15e3-4891-88e8-1d7e5bf3277c is locked by host localhost.localdomain, please retry after the current operation is completed. (HTTP 409)
: Conflict: Node 45ecc552-15e3-4891-88e8-1d7e5bf3277c is locked by host localhost.localdomain, please retry after the current operation is completed. (HTTP 409)
2019-01-21 11:50:18.671 9447 ERROR mistral.executors.default_executor [req-7ee0d872-637f-459d-b4e9-aa20bf14c594 7380fe039807448186f3df84d4b0649e 6baa1c5efb7247518accff3736b331ac - default default] Failed to run action [action_ex_id=6395e5a0-66f9-4097-bda6-b7e837aa7772, action_cls='<class 'mistral.actions.action_factory.IronicAction'>', attributes='{u'client_method_name': u'node.set_provision_state'}', params='{u'state': u'provide', u'node_uuid': u'45ecc552-15e3-4891-88e8-1d7e5bf3277c'}']
 IronicAction.node.set_provision_state failed: Node 45ecc552-15e3-4891-88e8-1d7e5bf3277c is locked by host localhost.localdomain, please retry after the current operation is completed. (HTTP 409): ActionException: IronicAction.node.set_provision_state failed: Node 45ecc552-15e3-4891-88e8-1d7e5bf3277c is locked by host localhost.localdomain, please retry after the current operation is completed. (HTTP 409)
2019-01-21 11:50:18.671 9447 ERROR mistral.executors.default_executor Traceback (most recent call last):
2019-01-21 11:50:18.671 9447 ERROR mistral.executors.default_executor   File "/usr/lib/python2.7/site-packages/mistral/executors/default_executor.py", line 114, in run_action
2019-01-21 11:50:18.671 9447 ERROR mistral.executors.default_executor     result = action.run(action_ctx)
2019-01-21 11:50:18.671 9447 ERROR mistral.executors.default_executor   File "/usr/lib/python2.7/site-packages/mistral/actions/openstack/base.py", line 130, in run
2019-01-21 11:50:18.671 9447 ERROR mistral.executors.default_executor     (self.__class__.__name__, self.client_method_name, str(e))
2019-01-21 11:50:18.671 9447 ERROR mistral.executors.default_executor ActionException: IronicAction.node.set_provision_state failed: Node 45ecc552-15e3-4891-88e8-1d7e5bf3277c is locked by host localhost.localdomain, please retry after the current operation is completed. (HTTP 409)
2019-01-21 11:50:18.671 9447 ERROR mistral.executors.default_executor 


Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1. Deploy an undercloud on RHEL 7.5
2. Try introspecting 7 nodes
3.

Actual results:
Failure to set the privion_state at the very end which breaks the introspection process.

Expected results:


Additional info:

Comment 2 Bob Fournier 2019-01-24 17:52:57 UTC

So node is locked, is this happening on all the nodes? Is there an issue with the BMC's responding?

Can we get Ironic logs?

Comment 3 Bob Fournier 2019-03-04 19:27:53 UTC

Dave - do you have more info on this?  Its normal for the nodes to be locked for a short period of time.  We don't really have anything to go on.

Comment 4 Dmitry Tantsur 2019-04-11 10:33:33 UTC

The case seems closed, please reopen if you have the requested logs.

Comment 6 kobi ginon 2020-04-20 20:06:28 UTC

Hi David
i wonder why this issue was closed it happens 1/15 deployments ,
is there a workaround ?

Waiting for messages on queue 'tripleo' with no timeout.
[{}, {u'result': u"The action raised an exception [action_ex_id=4e2f2d80-76a4-479e-ad62-088164efc367, action_cls='<class 'mistral.actions.action_factory.IronicAction'>', attributes='

{u'client_method_name': u'node.set_provision_state'}
', params='{u'state': u'provide', u'node_uuid': u'30925bac-bc22-4e74-bd40-de7e77b82db3'}']\n IronicAction.node.set_provision_state failed: Node 30925bac-bc22-4e74-bd40-de7e77b82db3 is locked by host undercloud.localdomain, please retry after the current operation is completed. (HTTP 409)"}, {}, {}, {}, {}]
No JSON object could be decoded

2020-04-01 09:36:47,227 - CbisDeployment - ERROR - e is: error occurred during command:
openstack overcloud node provide --all-manageable
error:
Waiting for messages on queue 'tripleo' with no timeout.
[{}, {u'result': u"The action raised an exception [action_ex_id=4e2f2d80-76a4-479e-ad62-088164efc367, action_cls='<class 'mistral.actions.action_factory.IronicAction'>', attributes='

{u'client_method_name': u'node.set_provision_state'}
', params='{u'state': u'provide', u'node_uuid': u'30925bac-bc22-4e74-bd40-de7e77b82db3'}']\n IronicAction.node.set_provision_state failed: Node 30925bac-bc22-4e74-bd40-de7e77b82db3 is locked by host undercloud.localdomain, please retry after the current operation is completed. (HTTP 409)"}, {}, {}, {}, {}]
No JSON object could be decoded

Comment 7 kobi ginon 2020-04-20 20:06:44 UTC

Hi David
i wonder why this issue was closed it happens 1/15 deployments ,
is there a workaround ?

Waiting for messages on queue 'tripleo' with no timeout.
[{}, {u'result': u"The action raised an exception [action_ex_id=4e2f2d80-76a4-479e-ad62-088164efc367, action_cls='<class 'mistral.actions.action_factory.IronicAction'>', attributes='

{u'client_method_name': u'node.set_provision_state'}
', params='{u'state': u'provide', u'node_uuid': u'30925bac-bc22-4e74-bd40-de7e77b82db3'}']\n IronicAction.node.set_provision_state failed: Node 30925bac-bc22-4e74-bd40-de7e77b82db3 is locked by host undercloud.localdomain, please retry after the current operation is completed. (HTTP 409)"}, {}, {}, {}, {}]
No JSON object could be decoded

2020-04-01 09:36:47,227 - CbisDeployment - ERROR - e is: error occurred during command:
openstack overcloud node provide --all-manageable
error:
Waiting for messages on queue 'tripleo' with no timeout.
[{}, {u'result': u"The action raised an exception [action_ex_id=4e2f2d80-76a4-479e-ad62-088164efc367, action_cls='<class 'mistral.actions.action_factory.IronicAction'>', attributes='

{u'client_method_name': u'node.set_provision_state'}
', params='{u'state': u'provide', u'node_uuid': u'30925bac-bc22-4e74-bd40-de7e77b82db3'}']\n IronicAction.node.set_provision_state failed: Node 30925bac-bc22-4e74-bd40-de7e77b82db3 is locked by host undercloud.localdomain, please retry after the current operation is completed. (HTTP 409)"}, {}, {}, {}, {}]
No JSON object could be decoded

Comment 8 David Hill 2020-04-20 21:08:32 UTC

It was closed because the customer reporting this issue closed the case.  Are you using the latest bits ?   Did you open a support case ?

Comment 9 kobi ginon 2020-04-20 21:11:17 UTC

yes latest 
not opening a ticket , just wondering on a workaroud if exists ?

Comment 10 David Hill 2020-04-20 21:24:25 UTC

I'd suggest you read comment #2 and open a support case if you hit this issue.

Comment 15 Julia Kreger 2020-07-29 14:15:25 UTC

The supplied logs show the actions to set BMC state took 80-300+ seconds with some requests even backing up because it seems requests are being sent to make changes and manipulate the node faster than its configuration can be changed. This seems to be due to configuration jobs having to be executed to change the running state of the BMC configuration. i.e. Change boot device when power cycling or please boot in UEFI mode or BIOS mode. Unfortunately it is hard for me to classify this as a bug in ironic. Ironic seems to be doing the needful as fast as possible.

Which realistically leaves two options:
1) tripleo-common needs to instantiate ironic's client with more retries. Currently this is 12 retries every 5 seconds, or for 60 seconds in total.
2) Or tripleo-common's code could poll and see if the node is locked for manipulation before issuing actions.

Comment 16 Julia Kreger 2020-07-29 14:40:36 UTC

One additional item, often these delays are due to power-on-self-test memory checks while the configuration job is being applied, as the machine boots into the bios. Ensure that memory testing is disabled in the bios firmware settings, which should speed up the configuration times and may actually make this issue go away.

Comment 18 Steve Baker 2020-07-29 22:25:27 UTC

The introspection timeout is 20 minutes, which I assume is enough even for slow servers. The problem seems to be that after a successful introspection the provide fails because the node is still locked.

It is not clear to me if the node stays locked indefinitely. If the introspect finishes (failure or success) and the node remains locked, then that is something we need to investigate.

If the lock is *eventually* released, I would suggest the following as a workaround. Stop specifying --provide as an argument and run the provide command separately, so

openstack overcloud node introspect --all-manageable
openstack overcloud node provide --all-manageable

or

openstack overcloud node introspect <node1> <node2> ...
openstack overcloud node provide <node1> <node2> ...

If the provide command fails with the locked node error, wait then try again.

We need to know how long this lock lingers after introspect completes. Is it seconds, triggering a race when doing introspect --provide? Is it minutes? Does it remain on that node forever? (setting NEEDINFO for this information)

Comment 19 kobi ginon 2020-07-30 04:34:29 UTC

hi 
related to running the command separately : openstack overcloud node provide <node1> <node2> ...
we do see that usually 1 or 2 blades are stack with this lock and the command: openstack overcloud node provide --all-manageable

will wait forever during that time waiting for this blades.

if while it is being hanged we identify those blades and run (while the comamnd openstack overcloud node provide --all-manageable
 is still excuted)
a separate command : (assume node-1 was locked by the example)
openstack overcloud node provide <node1>

Then surprisingly the node is being changed to available
but the still hanging command : openstack overcloud node provide --all-manageable
Does not go through till it gets the Total timeout of the deployment.

Important note: running each node separately is something we want to avoid
as for big numbers of blades this is not rational to do 
also i suspect that One the command : openstack overcloud node provide --all-manageable , identifies a Lock , it is not releasing something internally and waits for Ever

regards

Comment 20 Steve Baker 2020-08-05 23:53:21 UTC

Interesting, I'll look into why provide --all-manageable hits lock issues when provide <node> does not

Comment 22 Steve Baker 2020-08-13 02:50:24 UTC

Just updating this, I've proposed a potential fix upstream [1] which can be backported if it shows to help

[1] https://review.opendev.org/#/c/745991/

Comment 23 David Hill 2020-08-17 15:08:05 UTC

I haven't seen this issue in a while now ... if I see it again, I'll update this BZ.   Attached case is closed so we can't ask the customer to test it for us either.

Comment 24 Steve Baker 2020-08-24 21:04:41 UTC

Attached is an upstream stable/queens fix which should help with this issue. OSP-17 will have a similar fix, but in ansible instead of mistral.

Comment 25 Steve Baker 2020-09-22 19:54:09 UTC

This bug is to track the OSP-13 backport of bug #1848560

Comment 28 Steve Baker 2020-10-21 23:57:15 UTC

The fix for this is ready for rhpkg push when this bug gets its +1 flags

Comment 40 errata-xmlrpc 2020-12-16 13:55:05 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 13.0 director bug fix advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:5575