Bug 1400488

Summary: Error contacting Ironic server: Node can not be updated while a state transition is in progress. (HTTP 409)
Product: Red Hat OpenStack Reporter: Francisco Javier Lopez Y Grueber <flg>
Component: openstack-novaAssignee: Eoghan Glynn <eglynn>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Prasanth Anbalagan <panbalag>
Severity: high Docs Contact:
Priority: unspecified    
Version: 9.0 (Mitaka)CC: berrange, dasmith, eglynn, flg, kchamart, mwitt, sbauza, sferdjao, sgordon, srevivo, vromanso
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-07-18 19:45:20 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Error Messages Nova Compute log
none
Log Entries nova-conductor none

Description Francisco Javier Lopez Y Grueber 2016-12-01 11:20:40 UTC
Created attachment 1226768 [details]
Error Messages Nova Compute log

Description of problem:

Openstack Director Deployment after successfull introspection fails on 5 of 7 nodes due to "No valid hosts" at the end. 

The nova-conductor log shows the following: 

2016-11-28 11:56:59.182 3170 ERROR ironicclient.common.http [req-0d2f2389-5cdd-411f-bcd6-04f403ff7ccd 1d8702a55de4409a96a8a331dbaded27 74a68f4ebd684585814850cac75f5f53 - - -] Error contacting Ironic server: Node c5f8b9da-4c0b-4296-ab80-64ec78a90bb8 can not be updated while a state transition is in progress. (HTTP 409). Attempt 2 of 2
2016-11-28 11:56:59.182 3170 ERROR nova.compute.manager [req-0d2f2389-5cdd-411f-bcd6-04f403ff7ccd 1d8702a55de4409a96a8a331dbaded27 74a68f4ebd684585814850cac75f5f53 - - -] [instance: 070c6876-b746-4fb8-96f1-b8e0332d99d8] Instance failed to spawn
2016-11-28 11:56:59.182 3170 ERROR nova.compute.manager [instance: 070c6876-b746-4fb8-96f1-b8e0332d99d8] Traceback (most recent call last):
2016-11-28 11:56:59.182 3170 ERROR nova.compute.manager [instance: 070c6876-b746-4fb8-96f1-b8e0332d99d8]   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2218, in _build_resources
2016-11-28 11:56:59.182 3170 ERROR nova.compute.manager [instance: 070c6876-b746-4fb8-96f1-b8e0332d99d8]     yield resources
2016-11-28 11:56:59.182 3170 ERROR nova.compute.manager [instance: 070c6876-b746-4fb8-96f1-b8e0332d99d8]   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2064, in _build_and_run_instance
2016-11-28 11:56:59.182 3170 ERROR nova.compute.manager [instance: 070c6876-b746-4fb8-96f1-b8e0332d99d8]     block_device_info=block_device_info)
2016-11-28 11:56:59.182 3170 ERROR nova.compute.manager [instance: 070c6876-b746-4fb8-96f1-b8e0332d99d8]   File "/usr/lib/python2.7/site-packages/nova/virt/ironic/driver.py", line 704, in spawn
2016-11-28 11:56:59.182 3170 ERROR nova.compute.manager [instance: 070c6876-b746-4fb8-96f1-b8e0332d99d8]     self._add_driver_fields(node, instance, image_meta, flavor)
2016-11-28 11:56:59.182 3170 ERROR nova.compute.manager [instance: 070c6876-b746-4fb8-96f1-b8e0332d99d8]   File "/usr/lib/python2.7/site-packages/nova/virt/ironic/driver.py", line 372, in _add_driver_fields
2016-11-28 11:56:59.182 3170 ERROR nova.compute.manager [instance: 070c6876-b746-4fb8-96f1-b8e0332d99d8]     retry_on_conflict=False)
2016-11-28 11:56:59.182 3170 ERROR nova.compute.manager [instance: 070c6876-b746-4fb8-96f1-b8e0332d99d8]   File "/usr/lib/python2.7/site-packages/nova/virt/ironic/client_wrapper.py", line 143, in call
2016-11-28 11:56:59.182 3170 ERROR nova.compute.manager [instance: 070c6876-b746-4fb8-96f1-b8e0332d99d8]     return self._multi_getattr(client, method)(*args, **kwargs)
2016-11-28 11:56:59.182 3170 ERROR nova.compute.manager [instance: 070c6876-b746-4fb8-96f1-b8e0332d99d8]   File "/usr/lib/python2.7/site-packages/ironicclient/v1/node.py", line 198, in update
2016-11-28 11:56:59.182 3170 ERROR nova.compute.manager [instance: 070c6876-b746-4fb8-96f1-b8e0332d99d8]     method=http_method)
2016-11-28 11:56:59.182 3170 ERROR nova.compute.manager [instance: 070c6876-b746-4fb8-96f1-b8e0332d99d8]   File "/usr/lib/python2.7/site-packages/ironicclient/common/base.py", line 171, in _update
2016-11-28 11:56:59.182 3170 ERROR nova.compute.manager [instance: 070c6876-b746-4fb8-96f1-b8e0332d99d8]     resp, body = self.api.json_request(method, url, body=patch)
2016-11-28 11:56:59.182 3170 ERROR nova.compute.manager [instance: 070c6876-b746-4fb8-96f1-b8e0332d99d8]   File "/usr/lib/python2.7/site-packages/ironicclient/common/http.py", line 552, in json_request
2016-11-28 11:56:59.182 3170 ERROR nova.compute.manager [instance: 070c6876-b746-4fb8-96f1-b8e0332d99d8]     resp = self._http_request(url, method, **kwargs)
2016-11-28 11:56:59.182 3170 ERROR nova.compute.manager [instance: 070c6876-b746-4fb8-96f1-b8e0332d99d8]   File "/usr/lib/python2.7/site-packages/ironicclient/common/http.py", line 189, in wrapper
2016-11-28 11:56:59.182 3170 ERROR nova.compute.manager [instance: 070c6876-b746-4fb8-96f1-b8e0332d99d8]     return func(self, url, method, **kwargs)
2016-11-28 11:56:59.182 3170 ERROR nova.compute.manager [instance: 070c6876-b746-4fb8-96f1-b8e0332d99d8]   File "/usr/lib/python2.7/site-packages/ironicclient/common/http.py", line 534, in _http_request
2016-11-28 11:56:59.182 3170 ERROR nova.compute.manager [instance: 070c6876-b746-4fb8-96f1-b8e0332d99d8]     error_json.get('debuginfo'), method, url)
2016-11-28 11:56:59.182 3170 ERROR nova.compute.manager [instance: 070c6876-b746-4fb8-96f1-b8e0332d99d8] Conflict: Node c5f8b9da-4c0b-4296-ab80-64ec78a90bb8 can not be updated while a state transition is in progress. (HTTP 409)
2016-11-28 11:56:59.182 3170 ERROR nova.compute.manager [instance: 070c6876-b746-4fb8-96f1-b8e0332d99d8] 
2016-11-28 11:56:59.211 3170 INFO nova.compute.manager [req-0d2f2389-5cdd-411f-bcd6-04f403ff7ccd 1d8702a55de4409a96a8a331dbaded27 74a68f4ebd684585814850cac75f5f53 - - -] [instance: 070c6876-b746-4fb8-96f1-b8e0332d99d8] Terminating instance

The 
Version-Release number of selected component (if applicable):

[root@cci06-util01 ironic-inspector]# rpm -qa |grep ironic 
python-ironic-inspector-client-1.5.0-3.el7ost.noarch
python-ironicclient-1.3.1-1.el7ost.noarch
python-ironic-lib-1.1.0-1.el7ost.noarch
openstack-ironic-api-5.1.2-3.el7ost.noarch
openstack-ironic-inspector-3.2.2-4.el7ost.noarch
openstack-ironic-common-5.1.2-3.el7ost.noarch
openstack-ironic-conductor-5.1.2-3.el7ost.noarch
python-ironic-tests-5.1.2-3.el7ost.noarch
[root@cci06-util01 ironic-inspector]# rpm -qa |grep nova
openstack-nova-13.1.1-7.el7ost.noarch
python-novaclient-3.3.2-1.el7ost.noarch
openstack-nova-common-13.1.1-7.el7ost.noarch
openstack-nova-compute-13.1.1-7.el7ost.noarch
openstack-nova-cert-13.1.1-7.el7ost.noarch
openstack-nova-conductor-13.1.1-7.el7ost.noarch
openstack-nova-novncproxy-13.1.1-7.el7ost.noarch
python-nova-13.1.1-7.el7ost.noarch
openstack-nova-cells-13.1.1-7.el7ost.noarch
openstack-nova-api-13.1.1-7.el7ost.noarch
openstack-nova-network-13.1.1-7.el7ost.noarch
openstack-nova-scheduler-13.1.1-7.el7ost.noarch
python-nova-tests-13.1.1-7.el7ost.noarch
openstack-nova-console-13.1.1-7.el7ost.noarch
[root@cci06-util01 ironic-inspector]# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 7.3 (Maipo)
How reproducible:


Steps to Reproduce:
1. Delete Stack 
2. Redeploy
3. 

Actual results:

Stack deployment fails, only two nodes have been provisioned (1 Ctrl & 1 Compute). Nevertheless the logs tell another story, 'nova list' show all requested nodes. The ones that failed in ERROR state. 

Our deployment have been ported to OSP9 from OSP8. 

All templates have been recreated using the newest version of openstack-tripleo-heat-templates. 

The introspection finished successfully after disabling selinux (which was enabled on ospd8)

I also noticed that the failing instances are not getting rid of the Instance UUID after successfully deleting the stack. 

We also deleted all nodes and reinitiated the introspection step, with the same result. 



Expected results:

Stack deploys successfully as expected. 

Additional info:

Comment 1 Francisco Javier Lopez Y Grueber 2016-12-01 11:24:16 UTC
Created attachment 1226769 [details]
Log Entries nova-conductor

Comment 2 Francisco Javier Lopez Y Grueber 2016-12-01 11:27:45 UTC
I've attached: 

Bug 1257950 - NodeAssociated errors due to apparent race condition between nova-scheduler and ironic 

As it seems a very similar problem.

Comment 3 Francisco Javier Lopez Y Grueber 2016-12-01 13:02:33 UTC
Added another bug that also is present in our case. 

The fix is targeted for OSP10. 

Deleting the UUID's from the database is not solving the issue, as the issue persists when redeploying.

Comment 4 melanie witt 2016-12-01 21:42:19 UTC
From the log it looks like the root cause is the network setup failed because the Neutron port was "still in use":

2016-11-28 11:56:54.694 3170 ERROR nova.compute.manager [req-0d2f2389-5cdd-411f-bcd6-04f403ff7ccd 1d8702a55de4409a96a8a331dbaded27 74a68f4ebd684585814850cac75f5f53 - - -] Instance failed network setup after 1 attempt(s)
2016-11-28 11:56:54.694 3170 ERROR nova.compute.manager Traceback (most recent call last):
2016-11-28 11:56:54.694 3170 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 1570, in _allocate_network_async
2016-11-28 11:56:54.694 3170 ERROR nova.compute.manager     bind_host_id=bind_host_id)
2016-11-28 11:56:54.694 3170 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/network/neutronv2/api.py", line 729, in allocate_for_instance
2016-11-28 11:56:54.694 3170 ERROR nova.compute.manager     self._delete_ports(neutron, instance, created_port_ids)
2016-11-28 11:56:54.694 3170 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2016-11-28 11:56:54.694 3170 ERROR nova.compute.manager     self.force_reraise()
2016-11-28 11:56:54.694 3170 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2016-11-28 11:56:54.694 3170 ERROR nova.compute.manager     six.reraise(self.type_, self.value, self.tb)
2016-11-28 11:56:54.694 3170 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/network/neutronv2/api.py", line 718, in allocate_for_instance
2016-11-28 11:56:54.694 3170 ERROR nova.compute.manager     security_group_ids, available_macs, dhcp_opts)
2016-11-28 11:56:54.694 3170 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/network/neutronv2/api.py", line 337, in _create_port
2016-11-28 11:56:54.694 3170 ERROR nova.compute.manager     raise exception.PortInUse(port_id=mac_address)
2016-11-28 11:56:54.694 3170 ERROR nova.compute.manager PortInUse: Port f0:4d:a2:74:c7:a6 is still in use.

Are you provisioning nodes in parallel or one at a time?

Are pre-created Neutron ports being used during the deployment? That is, are Neutron ports created ahead of booting the nodes?

I found a ML thread [1] that sounds like the problem you're having where they're able to provision multiple nodes one at a time, but in a batch, some fail.

There's also a bug [2] mentioned in the thread around pre-created Neutron ports. That fix that is part of the Newton release and so should be in OSP 10.

If you're not using pre-created ports, there are a couple of workarounds suggested in the thread. If you're already using Heat ResourceGroup, you could use its batching capability to do batches of 1 with some delay. Or, you can adjust the Nova scheduler settings to increase scheduler_host_subset_size (default is 1) to increase randomness of the host chosen by the scheduler and thus reduce collisions and retries. This is apparently a known issue in the Nova-Ironic interaction that is supposed to be solved by the ongoing Resource Providers work upstream.

If you are using pre-created ports, I think you would also need the bug fix of [2].

[1] http://lists.openstack.org/pipermail/openstack/2016-June/016497.html
[2] https://bugs.launchpad.net/nova/+bug/1544195

Comment 5 Francisco Javier Lopez Y Grueber 2016-12-07 05:53:56 UTC
Hello, 

we are provisioning the whole stack at once. It's an initial overcloud deployment. Requested are 3 Control and 4 Compute nodes. The environment integrates in an existing Ceph environment.

Comment 6 melanie witt 2016-12-08 01:55:56 UTC
Okay. You didn't mention whether you're pre-creating Neutron ports or not.

Assuming you're not pre-creating ports, I would recommend trying to increase the Nova config setting "scheduler_host_subset_size" and see if it improves the situation.

This is a known issue with Nova-Ironic and will be addressed by the ongoing Resource Providers work upstream.

Comment 7 melanie witt 2017-07-18 19:45:20 UTC
Closing this out, not enough data to recommend a possible fix and didn't receive response about the suggested workarounds.