Bug 1678868

Summary:	OSP14 - networking-ansible - Error contacting Ironic server: Node <uuid> can not be updated while a state transition is in progress. (HTTP 409)
Product:	Red Hat OpenStack	Reporter:	Chris Janiszewski <cjanisze>
Component:	python-networking-ansible	Assignee:	Dan Radez <dradez>
Status:	CLOSED DUPLICATE	QA Contact:	Arkady Shtempler <ashtempl>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	14.0 (Rocky)	CC:	cjanisze, jlibosva, michapma
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-02-21 08:59:56 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Chris Janiszewski 2019-02-19 18:50:04 UTC

Description of problem:
In OSP14 with networking-ansible enabled I am getting following error when deleting nova instances (backed by BM nodes):

2019-02-19 17:58:16.052 1 ERROR ironicclient.common.http [req-1c8e02f9-30e8-40f6-9176-63d3c24e2c0e 85ace131b01b49598469689274636c63 7ad8b8ac7c3a43beaa75042e46379230 - default default] Error contacting Ironic server: Node 678ca4b1-b1b7-46d3-b8c9-a1fbc3ce197d can not be updated while a state transition is in progress. (HTTP 409). Attempt 61 of 61: Conflict: Node 678ca4b1-b1b7-46d3-b8c9-a1fbc3ce197d can not be updated while a state transition is in progress. (HTTP 409)
2019-02-19 17:58:17.458 1 INFO nova.compute.manager [-] [instance: ae870e98-b3bb-45c9-9cc6-59b22cfe64f1] Took 1.40 seconds to deallocate network for instance.
2019-02-19 17:58:17.459 1 ERROR nova.compute.manager [req-1c8e02f9-30e8-40f6-9176-63d3c24e2c0e 85ace131b01b49598469689274636c63 7ad8b8ac7c3a43beaa75042e46379230 - default default] [instance: ae870e98-b3bb-45c9-9cc6-59b22cfe64f1] Setting instance vm_state to ERROR: Conflict: Node 678ca4b1-b1b7-46d3-b8c9-a1fbc3ce197d can not be updated while a state transition is in progress. (HTTP 409)
2019-02-19 17:58:17.459 1 ERROR nova.compute.manager [instance: ae870e98-b3bb-45c9-9cc6-59b22cfe64f1] Traceback (most recent call last):
2019-02-19 17:58:17.459 1 ERROR nova.compute.manager [instance: ae870e98-b3bb-45c9-9cc6-59b22cfe64f1]   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2670, in do_terminate_instance
2019-02-19 17:58:17.459 1 ERROR nova.compute.manager [instance: ae870e98-b3bb-45c9-9cc6-59b22cfe64f1]     self._delete_instance(context, instance, bdms)
2019-02-19 17:58:17.459 1 ERROR nova.compute.manager [instance: ae870e98-b3bb-45c9-9cc6-59b22cfe64f1]   File "/usr/lib/python2.7/site-packages/nova/hooks.py", line 154, in inner
2019-02-19 17:58:17.459 1 ERROR nova.compute.manager [instance: ae870e98-b3bb-45c9-9cc6-59b22cfe64f1]     rv = f(*args, **kwargs)
2019-02-19 17:58:17.459 1 ERROR nova.compute.manager [instance: ae870e98-b3bb-45c9-9cc6-59b22cfe64f1]   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2613, in _delete_instance
2019-02-19 17:58:17.459 1 ERROR nova.compute.manager [instance: ae870e98-b3bb-45c9-9cc6-59b22cfe64f1]     self._shutdown_instance(context, instance, bdms)
2019-02-19 17:58:17.459 1 ERROR nova.compute.manager [instance: ae870e98-b3bb-45c9-9cc6-59b22cfe64f1]   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2505, in _shutdown_instance
2019-02-19 17:58:17.459 1 ERROR nova.compute.manager [instance: ae870e98-b3bb-45c9-9cc6-59b22cfe64f1]     requested_networks)
2019-02-19 17:58:17.459 1 ERROR nova.compute.manager [instance: ae870e98-b3bb-45c9-9cc6-59b22cfe64f1]   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2019-02-19 17:58:17.459 1 ERROR nova.compute.manager [instance: ae870e98-b3bb-45c9-9cc6-59b22cfe64f1]     self.force_reraise()
2019-02-19 17:58:17.459 1 ERROR nova.compute.manager [instance: ae870e98-b3bb-45c9-9cc6-59b22cfe64f1]   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2019-02-19 17:58:17.459 1 ERROR nova.compute.manager [instance: ae870e98-b3bb-45c9-9cc6-59b22cfe64f1]     six.reraise(self.type_, self.value, self.tb)
2019-02-19 17:58:17.459 1 ERROR nova.compute.manager [instance: ae870e98-b3bb-45c9-9cc6-59b22cfe64f1]   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2492, in _shutdown_instance
2019-02-19 17:58:17.459 1 ERROR nova.compute.manager [instance: ae870e98-b3bb-45c9-9cc6-59b22cfe64f1]     block_device_info)
2019-02-19 17:58:17.459 1 ERROR nova.compute.manager [instance: ae870e98-b3bb-45c9-9cc6-59b22cfe64f1]   File "/usr/lib/python2.7/site-packages/nova/virt/ironic/driver.py", line 1256, in destroy
2019-02-19 17:58:17.459 1 ERROR nova.compute.manager [instance: ae870e98-b3bb-45c9-9cc6-59b22cfe64f1]     self._cleanup_deploy(node, instance, network_info)
2019-02-19 17:58:17.459 1 ERROR nova.compute.manager [instance: ae870e98-b3bb-45c9-9cc6-59b22cfe64f1]   File "/usr/lib/python2.7/site-packages/nova/virt/ironic/driver.py", line 478, in _cleanup_deploy
2019-02-19 17:58:17.459 1 ERROR nova.compute.manager [instance: ae870e98-b3bb-45c9-9cc6-59b22cfe64f1]     self._remove_instance_info_from_node(node, instance)
2019-02-19 17:58:17.459 1 ERROR nova.compute.manager [instance: ae870e98-b3bb-45c9-9cc6-59b22cfe64f1]   File "/usr/lib/python2.7/site-packages/nova/virt/ironic/driver.py", line 413, in _remove_instance_info_from_node
2019-02-19 17:58:17.459 1 ERROR nova.compute.manager [instance: ae870e98-b3bb-45c9-9cc6-59b22cfe64f1]     self.ironicclient.call('node.update', node.uuid, patch)
2019-02-19 17:58:17.459 1 ERROR nova.compute.manager [instance: ae870e98-b3bb-45c9-9cc6-59b22cfe64f1]   File "/usr/lib/python2.7/site-packages/nova/virt/ironic/client_wrapper.py", line 170, in call
2019-02-19 17:58:17.459 1 ERROR nova.compute.manager [instance: ae870e98-b3bb-45c9-9cc6-59b22cfe64f1]     return self._multi_getattr(client, method)(*args, **kwargs)
2019-02-19 17:58:17.459 1 ERROR nova.compute.manager [instance: ae870e98-b3bb-45c9-9cc6-59b22cfe64f1]   File "/usr/lib/python2.7/site-packages/ironicclient/v1/node.py", line 359, in update
2019-02-19 17:58:17.459 1 ERROR nova.compute.manager [instance: ae870e98-b3bb-45c9-9cc6-59b22cfe64f1]     params=params)
2019-02-19 17:58:17.459 1 ERROR nova.compute.manager [instance: ae870e98-b3bb-45c9-9cc6-59b22cfe64f1]   File "/usr/lib/python2.7/site-packages/ironicclient/common/base.py", line 232, in _update
2019-02-19 17:58:17.459 1 ERROR nova.compute.manager [instance: ae870e98-b3bb-45c9-9cc6-59b22cfe64f1]     resp, body = self.api.json_request(method, url, **kwargs)
2019-02-19 17:58:17.459 1 ERROR nova.compute.manager [instance: ae870e98-b3bb-45c9-9cc6-59b22cfe64f1]   File "/usr/lib/python2.7/site-packages/ironicclient/common/http.py", line 678, in json_request
2019-02-19 17:58:17.459 1 ERROR nova.compute.manager [instance: ae870e98-b3bb-45c9-9cc6-59b22cfe64f1]     resp = self._http_request(url, method, **kwargs)
2019-02-19 17:58:17.459 1 ERROR nova.compute.manager [instance: ae870e98-b3bb-45c9-9cc6-59b22cfe64f1]   File "/usr/lib/python2.7/site-packages/ironicclient/common/http.py", line 287, in wrapper
2019-02-19 17:58:17.459 1 ERROR nova.compute.manager [instance: ae870e98-b3bb-45c9-9cc6-59b22cfe64f1]     return func(self, url, method, **kwargs)
2019-02-19 17:58:17.459 1 ERROR nova.compute.manager [instance: ae870e98-b3bb-45c9-9cc6-59b22cfe64f1]   File "/usr/lib/python2.7/site-packages/ironicclient/common/http.py", line 660, in _http_request
2019-02-19 17:58:17.459 1 ERROR nova.compute.manager [instance: ae870e98-b3bb-45c9-9cc6-59b22cfe64f1]     error_json.get('debuginfo'), method, url)
2019-02-19 17:58:17.459 1 ERROR nova.compute.manager [instance: ae870e98-b3bb-45c9-9cc6-59b22cfe64f1] Conflict: Node 678ca4b1-b1b7-46d3-b8c9-a1fbc3ce197d can not be updated while a state transition is in progress. (HTTP 409)
2019-02-19 17:58:17.459 1 ERROR nova.compute.manager [instance: ae870e98-b3bb-45c9-9cc6-59b22cfe64f1]
2019-02-19 17:58:17.594 1 INFO nova.compute.manager [-] [instance: ae870e98-b3bb-45c9-9cc6-59b22cfe64f1] During sync_power_state the instance has a pending task (deleting). Skip.
2019-02-19 17:58:17.770 1 INFO nova.compute.manager [req-1c8e02f9-30e8-40f6-9176-63d3c24e2c0e 85ace131b01b49598469689274636c63 7ad8b8ac7c3a43beaa75042e46379230 - default default] [instance: ae870e98-b3bb-45c9-9cc6-59b22cfe64f1] Successfully reverted task state from deleting on failure for instance.

This might be networking-ansible related, since I haven't seen this error when using flat network for Ironic

It seems that only nova is unable to get the correct state, from Ironic output all the cleaned-up nodes do go back to a happy state:
(chrisj-osp14) [stack@undercloud-osp14 ~]$ openstack baremetal node list
+--------------------------------------+------------------------------+---------------+-------------+--------------------+-------------+
| UUID                                 | Name                         | Instance UUID | Power State | Provisioning State | Maintenance |
+--------------------------------------+------------------------------+---------------+-------------+--------------------+-------------+
| 68cd5104-fc2f-48ef-b02e-cd7ecdb8a941 | ASRock-J1900D2Y-172.31.9.31  | None          | power off   | available          | False       |
| a067cb0d-796e-4d7e-b686-7009f309a8d8 | ASRock-J1900D2Y-172.31.9.32  | None          | power off   | available          | False       |
| e455da6a-0692-4459-ac8b-551545a92086 | ASRock-J1900D2Y-172.31.9.33  | None          | power off   | available          | False       |
| 678ca4b1-b1b7-46d3-b8c9-a1fbc3ce197d | Supermicro-A1SAi-172.31.9.34 | None          | power off   | available          | False       |
| 3dc357ba-565e-4c8a-9666-bc91bca8f23f | Supermicro-A1SRi-172.31.9.35 | None          | power off   | available          | False       |
+--------------------------------------+------------------------------+---------------+-------------+--------------------+-------------+


Version-Release number of selected component (if applicable):
OSP14

How reproducible:
80% of the time

Steps to Reproduce:
1. Deploy OSP14 with Ironic and networking-ansible enabled
2. Spawn BM node from nova
3. Delete BM node from nova

Actual results:
Error

Expected results:
Nova removes the instance

Additional info:
The logs seems to be too large to attach to this BZ so I am including the link:
http://chrisj.cloud/containers.tar.gz

Comment 1 Jakub Libosvar 2019-02-20 08:05:28 UTC

We've encountered an issue with similar symptoms on QE environment. The cause was that Dell machines took long time after restart to boot up - it was about 5 minutes from the point where node was powered on till it got to the boot loader phase. The workaround was to bump api_max_retries settings for nova's ironic client. The original issue is described in bug 1647005.

Chris, any chance you can try to change this value and re-test if you still hit the issue? I'm pasting here Doc text form the bug 1647005 for your convenience:

Nova-compute ironic driver tries to update BM node while the node is being cleaned up. The cleaning takes approximately five minutes but nova-compute attempts to update the node for approximately two minutes. After timeout, nova-compute stops and puts nova instance into ERROR state.

As a workaround, set the following configuration option for nova-compute service:
[ironic]
api_max_retries = 180

As a result, nova-compute continues to attempt to update BM node longer and eventually succeeds.

Comment 2 Chris Janiszewski 2019-02-20 17:35:11 UTC

Adjusting value for api_max_retries has worked around this issue.
[ironic]
api_max_retries = 180

that extra time that networking-ansible need to take to set the port in my environment must make a difference to go over the default retries value. I wonder if we could inject this into templates for ironic defaults. The BM nodes to take longer then VMs to delete, due to cleanup steps and whatever needs to be done on the switch for multi-tenancy.

Comment 3 Jakub Libosvar 2019-02-21 08:59:56 UTC

(In reply to Chris Janiszewski from comment #2)
> Adjusting value for api_max_retries has worked around this issue.
> [ironic]
> api_max_retries = 180
> 
> that extra time that networking-ansible need to take to set the port in my
> environment must make a difference to go over the default retries value. I
> wonder if we could inject this into templates for ironic defaults. The BM
> nodes to take longer then VMs to delete, due to cleanup steps and whatever
> needs to be done on the switch for multi-tenancy.

We should, I'm working on it here: https://review.openstack.org/#/c/638119/

I'm closing this bug as a dup of the bug 1647005 but thanks for your findings. I raised priority of the original bug and started working on it right away.

*** This bug has been marked as a duplicate of bug 1647005 ***