Bug 1581450 - introspection results in a node locked , and stuck in manageable state
Summary: introspection results in a node locked , and stuck in manageable state
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-ironic
Version: 12.0 (Pike)
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: RHOS Maint
QA Contact: mlammon
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-05-22 19:17 UTC by Jeremy
Modified: 2023-09-15 01:27 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-06-27 18:45:17 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OSP-9057 0 None None None 2022-08-09 10:43:03 UTC

Comment 3 Bob Fournier 2018-05-29 15:21:30 UTC
As per the comments in the case, yes it looks like introspection was successful and the node was powered off:

ironic-inspector.log:2018-05-22 12:43:29.602 24744 DEBUG ironic_inspector.process [-] Forcing power off of node 885792b1-2875-4f5a-a378-af98ab9ed377 _finish /usr/lib/python2.7/site-packages/ironic_inspector/process.py:288
ironic-inspector.log:2018-05-22 12:43:29.691 24744 INFO ironic_inspector.process [-] [node: 885792b1-2875-4f5a-a378-af98ab9ed377 MAC 24:6e:96:4a:d8:c8 BMC 172.16.232.45] Node powered-off
ironic-inspector.log:2018-05-22 12:43:29.691 24744 DEBUG ironic_inspector.node_cache [-] [node: 885792b1-2875-4f5a-a378-af98ab9ed377] Successfully released lock release_lock /usr/lib/python2.7/site-packages/ironic_inspector/node_cache.py:135
ironic-inspector.log:2018-05-22 12:43:29.691 24744 DEBUG ironic_inspector.node_cache [-] [node: 885792b1-2875-4f5a-a378-af98ab9ed377] Committing fields: {'finished_at': datetime.datetime(2018, 5, 22, 16, 43, 29, 691715), 'error': None} _commit /usr/lib/python2.7/site-packages/ironic_inspector/node_cache.py:167
ironic-inspector.log:2018-05-22 12:43:29.703 24744 INFO ironic_inspector.process [-] [node: 885792b1-2875-4f5a-a378-af98ab9ed377 MAC 24:6e:96:4a:d8:c8 BMC 172.16.232.45] Introspection finished successfully


Ironic conductor grabs the lock for the node here and releases it at 12:44:27...

ironic-conductor.log:2018-05-22 12:44:00.594 21575 DEBUG ironic.conductor.task_manager [req-b9690023-77a0-42b8-b9c9-d683694ccf11 5cd08fbf7ade4cfdbcb019e4414d7d66 218074fe49d24a52aab59a10d6b5f395 - default default] Attempting to get exclusive lock on node 885792b1-2875-4f5a-a378-af98ab9ed377 (for provision action provide) __init__ /usr/lib/python2.7/site-packages/ironic/conductor/task_manager.py:225
ironic-conductor.log:2018-05-22 12:44:27.677 21575 DEBUG ironic.conductor.task_manager [req-483613c3-ae65-424f-8db8-079aefe65352 - - - - -] Successfully released shared lock for power state sync on node 885792b1-2875-4f5a-a378-af98ab9ed377 (lock was held 657.01 sec) release_resources /usr/lib/python2.7/site-packages/ironic/conductor/task_manager.py:352
ironic-conductor.log:2018-05-22 12:44:45.196 21575 DEBUG ironic.conductor.task_manager [req-cf4c09ba-7256-493a-9470-7e74dbe72d70 b701f5b232bf4397aba41b707e214137 2babf577ff6a4fd9931292494377932d - default default] Successfully released exclusive lock for changing node power state on node 885792b1-2875-4f5a-a378-af98ab9ed377 (lock was held 75.55 sec) release_resources /usr/lib/python2.7/site-packages/ironic/conductor/task_manager.py:352


Following the successful introspection, mistral attempts to move node to provision state, but the node is locked by ironic when it attempts to grab it at 12:44:02, while it is still locked by ironic...
It looks like there is no retry mechanism for a locked node

engine.log:2018-05-22 12:44:02.709 25028 INFO workflow_trace [req-5a24be00-9549-4c58-96ab-fbea3ae9623a 5cd08fbf7ade4cfdbcb019e4414d7d66 218074fe49d24a52aab59a10d6b5f395 - default default] Task 'set_provision_state' (584e5259-823d-46e7-9191-ed6d15a47a2c) [RUNNING -> ERROR, msg=Failed to run action [action_ex_id=a2dbe6f8-dba4-409d-bf05-3fba4168fd2c, action_cls='<class 'mistral.actions.action_factory.IronicAction'>', attributes='{u'client_method_name': u'node.set_provision_state'}', params='{u'state': u'provide', u'node_uuid': u'885792b1-2875-4f5a-a378-af98ab9ed377'}']
engine.log: IronicAction.node.set_provision_state failed: Node 885792b1-2875-4f5a-a378-af98ab9ed377 is locked by host ucdr-a-001.ussc1.owfms.net, please retry after the current operation is completed. (HTTP 409)] (execution_id=979db328-2f04-40ca-b676-1083eda019c7)
engine.log:  set_provision_state [task_ex_id=584e5259-823d-46e7-9191-ed6d15a47a2c] -> Failed to run action [action_ex_id=a2dbe6f8-dba4-409d-bf05-3fba4168fd2c, action_cls='<class 'mistral.actions.action_factory.IronicAction'>', attributes='{u'client_method_name': u'node.set_provision_state'}', params='{u'state': u'provide', u'node_uuid': u'885792b1-2875-4f5a-a378-af98ab9ed377'}']
engine.log: IronicAction.node.set_provision_state failed: Node 885792b1-2875-4f5a-a378-af98ab9ed377 is locked by host ucdr-a-001.ussc1.owfms.net, please retry after the current operation is completed. (HTTP 409)


So mistral attempts to move the node to "Aborted" state:
ironic-inspector.log:2018-05-22 12:57:49.199 24744 DEBUG ironic_inspector.introspect [-] Aborting introspection for node 885792b1-2875-4f5a-a378-af98ab9ed377 abort /usr/lib/python2.7/site-packages/ironic_inspector/introspect.py:139

This state transition fails as the node has completed introspection:
ironic-inspector.log:2018-05-22 12:57:49.206 24744 ERROR ironic_inspector.node_cache [-] [node: 885792b1-2875-4f5a-a378-af98ab9ed377 state finished] Invalid event: Can not transition from state 'finished' on event 'abort' (no defined transition): NotFound: Can not transition from state 'finished' on event 'abort' (no defined transition)


Mistral returns a failure as it couldn't set the node to provision:

engine.log:2018-05-22 12:44:04.511 25028 DEBUG mistral.expressions.yaql_expression [req-5a24be00-9549-4c58-96ab-fbea3ae9623a 5cd08fbf7ade4cfdbcb019e4414d7d66 218074fe49d24a52aab59a10d6b5f395 - default default] Finished evaluation. [expression='<% $.get('message', '') %>', result: [{u'result': u"Failure caused by error in tasks: set_provision_state\n\n  set_provision_state [task_ex_id=584e5259-823d-46e7-9191-ed6d15a47a2c] -> Failed to run action [action_ex_id=a2dbe6f8-dba4-409d-bf05-3fba4168fd2c, action_cls='<class 'mistral.actions.action_factory.IronicAction'>', attributes='{u'client_method_name': u'node.set_provision_state'}', params='{u'state': u'provide', u'node_uuid': u'885792b1-2875-4f5a-a378-af98ab9ed377'}']\n IronicAction.node.set_provision_state failed: Node 885792b1-2875-4f5a-a378-af98ab9ed377 is locked by host ucdr-a-001.ussc1.owfms.net, please retry after the current operation is completed. (HTTP 409)\n    [action_ex_id=a2dbe6f8-dba4-409d-bf05-3fba4168fd2c, idx=0]: Failed to run action [action_ex_id=a2dbe6f8-dba4-409d-bf05-3fba4168fd2c, action_cls='<class 'mistral.actions.action_factory.IronicAction'>', attributes='{u'client_method_name': u'node.set_provision_state'}', params='{u'state': u'provide', u'node_uuid': u'885792b1-2875-4f5a-a378-af98ab9ed377'}']\n IronicAction.node.set_provision_state failed: Node 885792b1-2875-4f5a-a378-af98ab9ed377 is locked by host ucdr-a-001.ussc1.owfms.net, please retry after the current operation is completed. (HTTP 409)\n"}]] evaluate /usr/lib/python2.7/site-packages/mistral/expressions/yaql_expression.py:128

Comment 4 Jeremy 2018-05-29 17:40:52 UTC
Hello,
Thanks for the in depth analysis! Do you think this would be a bug for mistral retry mechanism? 

Thanks

Comment 5 Dmitry Tantsur 2018-06-08 15:41:25 UTC
> lock was held 657.01 sec

This sounds like some big troubles with the BMC. While we can try looking into retries, if a BMC takes more than 10 minutes to return the power status, nothing will work reliably. Changing power state is also slightly too much (though acceptable):

> lock was held 75.55 sec

Have you tried updating all firmware on the server?

Comment 6 Bob Fournier 2018-06-27 18:45:17 UTC
Closing this as the 10 minute delay to return power status from the BMC is not retryable.  Please check H/W.

It appears that the case has also been closed.

Comment 7 Red Hat Bugzilla 2023-09-15 01:27:16 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days


Note You need to log in before you can comment on or make changes to this bug.