Bug 1581450

Summary: introspection results in a node locked , and stuck in manageable state
Product: Red Hat OpenStack Reporter: Jeremy <jmelvin>
Component: openstack-ironicAssignee: RHOS Maint <rhos-maint>
Status: CLOSED NOTABUG QA Contact: mlammon
Severity: medium Docs Contact:
Priority: medium    
Version: 12.0 (Pike)CC: bfournie, dtantsur, jmelvin, mburns, srevivo
Target Milestone: ---Keywords: Unconfirmed
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-06-27 18:45:17 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Comment 3 Bob Fournier 2018-05-29 15:21:30 UTC
As per the comments in the case, yes it looks like introspection was successful and the node was powered off:

ironic-inspector.log:2018-05-22 12:43:29.602 24744 DEBUG ironic_inspector.process [-] Forcing power off of node 885792b1-2875-4f5a-a378-af98ab9ed377 _finish /usr/lib/python2.7/site-packages/ironic_inspector/process.py:288
ironic-inspector.log:2018-05-22 12:43:29.691 24744 INFO ironic_inspector.process [-] [node: 885792b1-2875-4f5a-a378-af98ab9ed377 MAC 24:6e:96:4a:d8:c8 BMC 172.16.232.45] Node powered-off
ironic-inspector.log:2018-05-22 12:43:29.691 24744 DEBUG ironic_inspector.node_cache [-] [node: 885792b1-2875-4f5a-a378-af98ab9ed377] Successfully released lock release_lock /usr/lib/python2.7/site-packages/ironic_inspector/node_cache.py:135
ironic-inspector.log:2018-05-22 12:43:29.691 24744 DEBUG ironic_inspector.node_cache [-] [node: 885792b1-2875-4f5a-a378-af98ab9ed377] Committing fields: {'finished_at': datetime.datetime(2018, 5, 22, 16, 43, 29, 691715), 'error': None} _commit /usr/lib/python2.7/site-packages/ironic_inspector/node_cache.py:167
ironic-inspector.log:2018-05-22 12:43:29.703 24744 INFO ironic_inspector.process [-] [node: 885792b1-2875-4f5a-a378-af98ab9ed377 MAC 24:6e:96:4a:d8:c8 BMC 172.16.232.45] Introspection finished successfully


Ironic conductor grabs the lock for the node here and releases it at 12:44:27...

ironic-conductor.log:2018-05-22 12:44:00.594 21575 DEBUG ironic.conductor.task_manager [req-b9690023-77a0-42b8-b9c9-d683694ccf11 5cd08fbf7ade4cfdbcb019e4414d7d66 218074fe49d24a52aab59a10d6b5f395 - default default] Attempting to get exclusive lock on node 885792b1-2875-4f5a-a378-af98ab9ed377 (for provision action provide) __init__ /usr/lib/python2.7/site-packages/ironic/conductor/task_manager.py:225
ironic-conductor.log:2018-05-22 12:44:27.677 21575 DEBUG ironic.conductor.task_manager [req-483613c3-ae65-424f-8db8-079aefe65352 - - - - -] Successfully released shared lock for power state sync on node 885792b1-2875-4f5a-a378-af98ab9ed377 (lock was held 657.01 sec) release_resources /usr/lib/python2.7/site-packages/ironic/conductor/task_manager.py:352
ironic-conductor.log:2018-05-22 12:44:45.196 21575 DEBUG ironic.conductor.task_manager [req-cf4c09ba-7256-493a-9470-7e74dbe72d70 b701f5b232bf4397aba41b707e214137 2babf577ff6a4fd9931292494377932d - default default] Successfully released exclusive lock for changing node power state on node 885792b1-2875-4f5a-a378-af98ab9ed377 (lock was held 75.55 sec) release_resources /usr/lib/python2.7/site-packages/ironic/conductor/task_manager.py:352


Following the successful introspection, mistral attempts to move node to provision state, but the node is locked by ironic when it attempts to grab it at 12:44:02, while it is still locked by ironic...
It looks like there is no retry mechanism for a locked node

engine.log:2018-05-22 12:44:02.709 25028 INFO workflow_trace [req-5a24be00-9549-4c58-96ab-fbea3ae9623a 5cd08fbf7ade4cfdbcb019e4414d7d66 218074fe49d24a52aab59a10d6b5f395 - default default] Task 'set_provision_state' (584e5259-823d-46e7-9191-ed6d15a47a2c) [RUNNING -> ERROR, msg=Failed to run action [action_ex_id=a2dbe6f8-dba4-409d-bf05-3fba4168fd2c, action_cls='<class 'mistral.actions.action_factory.IronicAction'>', attributes='{u'client_method_name': u'node.set_provision_state'}', params='{u'state': u'provide', u'node_uuid': u'885792b1-2875-4f5a-a378-af98ab9ed377'}']
engine.log: IronicAction.node.set_provision_state failed: Node 885792b1-2875-4f5a-a378-af98ab9ed377 is locked by host ucdr-a-001.ussc1.owfms.net, please retry after the current operation is completed. (HTTP 409)] (execution_id=979db328-2f04-40ca-b676-1083eda019c7)
engine.log:  set_provision_state [task_ex_id=584e5259-823d-46e7-9191-ed6d15a47a2c] -> Failed to run action [action_ex_id=a2dbe6f8-dba4-409d-bf05-3fba4168fd2c, action_cls='<class 'mistral.actions.action_factory.IronicAction'>', attributes='{u'client_method_name': u'node.set_provision_state'}', params='{u'state': u'provide', u'node_uuid': u'885792b1-2875-4f5a-a378-af98ab9ed377'}']
engine.log: IronicAction.node.set_provision_state failed: Node 885792b1-2875-4f5a-a378-af98ab9ed377 is locked by host ucdr-a-001.ussc1.owfms.net, please retry after the current operation is completed. (HTTP 409)


So mistral attempts to move the node to "Aborted" state:
ironic-inspector.log:2018-05-22 12:57:49.199 24744 DEBUG ironic_inspector.introspect [-] Aborting introspection for node 885792b1-2875-4f5a-a378-af98ab9ed377 abort /usr/lib/python2.7/site-packages/ironic_inspector/introspect.py:139

This state transition fails as the node has completed introspection:
ironic-inspector.log:2018-05-22 12:57:49.206 24744 ERROR ironic_inspector.node_cache [-] [node: 885792b1-2875-4f5a-a378-af98ab9ed377 state finished] Invalid event: Can not transition from state 'finished' on event 'abort' (no defined transition): NotFound: Can not transition from state 'finished' on event 'abort' (no defined transition)


Mistral returns a failure as it couldn't set the node to provision:

engine.log:2018-05-22 12:44:04.511 25028 DEBUG mistral.expressions.yaql_expression [req-5a24be00-9549-4c58-96ab-fbea3ae9623a 5cd08fbf7ade4cfdbcb019e4414d7d66 218074fe49d24a52aab59a10d6b5f395 - default default] Finished evaluation. [expression='<% $.get('message', '') %>', result: [{u'result': u"Failure caused by error in tasks: set_provision_state\n\n  set_provision_state [task_ex_id=584e5259-823d-46e7-9191-ed6d15a47a2c] -> Failed to run action [action_ex_id=a2dbe6f8-dba4-409d-bf05-3fba4168fd2c, action_cls='<class 'mistral.actions.action_factory.IronicAction'>', attributes='{u'client_method_name': u'node.set_provision_state'}', params='{u'state': u'provide', u'node_uuid': u'885792b1-2875-4f5a-a378-af98ab9ed377'}']\n IronicAction.node.set_provision_state failed: Node 885792b1-2875-4f5a-a378-af98ab9ed377 is locked by host ucdr-a-001.ussc1.owfms.net, please retry after the current operation is completed. (HTTP 409)\n    [action_ex_id=a2dbe6f8-dba4-409d-bf05-3fba4168fd2c, idx=0]: Failed to run action [action_ex_id=a2dbe6f8-dba4-409d-bf05-3fba4168fd2c, action_cls='<class 'mistral.actions.action_factory.IronicAction'>', attributes='{u'client_method_name': u'node.set_provision_state'}', params='{u'state': u'provide', u'node_uuid': u'885792b1-2875-4f5a-a378-af98ab9ed377'}']\n IronicAction.node.set_provision_state failed: Node 885792b1-2875-4f5a-a378-af98ab9ed377 is locked by host ucdr-a-001.ussc1.owfms.net, please retry after the current operation is completed. (HTTP 409)\n"}]] evaluate /usr/lib/python2.7/site-packages/mistral/expressions/yaql_expression.py:128

Comment 4 Jeremy 2018-05-29 17:40:52 UTC
Hello,
Thanks for the in depth analysis! Do you think this would be a bug for mistral retry mechanism? 

Thanks

Comment 5 Dmitry Tantsur 2018-06-08 15:41:25 UTC
> lock was held 657.01 sec

This sounds like some big troubles with the BMC. While we can try looking into retries, if a BMC takes more than 10 minutes to return the power status, nothing will work reliably. Changing power state is also slightly too much (though acceptable):

> lock was held 75.55 sec

Have you tried updating all firmware on the server?

Comment 6 Bob Fournier 2018-06-27 18:45:17 UTC
Closing this as the 10 minute delay to return power status from the BMC is not retryable.  Please check H/W.

It appears that the case has also been closed.

Comment 7 Red Hat Bugzilla 2023-09-15 01:27:16 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days