Bug 1257950

Summary:	NodeAssociated errors due to apparent race condition between nova-scheduler and ironic
Product:	Red Hat OpenStack	Reporter:	James Slagle <jslagle>
Component:	openstack-nova	Assignee:	Lucas Alvares Gomes <lmartins>
Status:	CLOSED ERRATA	QA Contact:	Joe H. Rahme <jhakimra>
Severity:	medium	Docs Contact:
Priority:	high
Version:	7.0 (Kilo)	CC:	berrange, dasmith, dmaley, dtantsur, eglynn, felipe.alfaro, ggillies, hbrock, jcoufal, jmelvin, jraju, kchamart, kmorey, lmartins, mburns, mcornea, mwagner, ndipanov, nlevinki, ohochman, pcaruana, rbiba, rhel-osp-director-maint, sbauza, sferdjao, sgordon, skinjo, vromanso, yeylon
Target Milestone:	async	Keywords:	OtherQA, Triaged, ZStream
Target Release:	7.0 (Kilo)
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	openstack-nova-2015.1.2-9.el7ost	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:
Clones:	1337951 (view as bug list)		Environment:
Last Closed:	2016-02-18 16:08:51 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1264236, 1337951

Description James Slagle 2015-08-28 13:00:35 UTC

There appears to be a race condition between nova-scheduler and ironic resulting in tracebacks with an error that looks like:

NodeAssociated: Node 2bdb3ad1-4854-4956-977f-41030221984d is associated with instance 4c986d52-31ec-40ae-93bf-5960ccd0c4db.

nova-scheduler is apparently trying to associate an instance with a claimed node that is already in used by another instance.

The full traceback from ironic-api.log is:

2015-08-28 12:52:03.004 1695 WARNING wsme.api [-] Client-side error: Node 4c8a48d4-de85-4133-bc25-b0be2b7cd83a is associated with instance 45797815-cd88-4a77-bd93-f323a7f74544.
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 142, in _dispatch_and_reply
    executor_callback))
  File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 186, in _dispatch
    executor_callback)
  File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 130, in _do_dispatch
    result = func(ctxt, **new_args)
  File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 142, in inner
    return func(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/ironic/conductor/manager.py", line 405, in update_node
    node_obj.save()
  File "/usr/lib/python2.7/site-packages/ironic/objects/base.py", line 143, in wrapper
    return fn(self, ctxt, *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/ironic/objects/node.py", line 265, in save
    self.dbapi.update_node(self.uuid, updates)
  File "/usr/lib/python2.7/site-packages/ironic/db/sqlalchemy/api.py", line 338, in update_node
    return self._do_update_node(node_id, values)
  File "/usr/lib/python2.7/site-packages/ironic/db/sqlalchemy/api.py", line 364, in _do_update_node
    instance=ref.instance_uuid)
NodeAssociated: Node 4c8a48d4-de85-4133-bc25-b0be2b7cd83a is associated with instance 45797815-cd88-4a77-bd93-f323a7f74544.


Although the race is somewhat transient it is made worse under the following observed scenarios:

- when deploying the same number of instances as available nodes (or close to the same number). this is obviously because there are less nodes for the scheduler to pick from, and thus a much higher chance of picking one already in use

- when deploying with a very performant undercloud. for instance, this race condition is affecting us near 100% of the time in the scale lab with a 24 core / 64GB ram undercloud. We're trying to deploy 65 instances on 65 nodes, so the previous scenario is also in play as well.

Comment 3 James Slagle 2015-08-28 13:05:10 UTC

Likely related upstream bug: https://bugs.launchpad.net/nova/+bug/1341420

However, I've tried the patch from https://review.openstack.org/#/c/106716/2 (referenced from the above launchpad bug) and it had no positive effect or improvement in the scale lab (i did verify the patch was applied and was getting the sleep log messages in nova-scheduler.log).

I also tried increasing the sleep value to various other values:
1 second - no improvement
2, 5, and 10 seconds - all these values caused instances to not get scheduled at all after the first batch of max_concurrent_builds (10) was done.

Comment 4 James Slagle 2015-08-28 14:49:00 UTC

I've uploaded nova and ironic logs to http://file.rdu.redhat.com/~jslagle/bug-1257950/

These are from the scale lab. I reset the log files prior to doing a new deployment, and they show numerous examples of the problem.

Comment 5 James Slagle 2015-08-28 14:52:14 UTC

Note that one suggested workaround is to add a bunch of fake nodes to ironic. This decreases the chance that the scheduler will pick an already in use node. But, you have to wait for the deployment on the fake node to time out and get reshceduled.

Would it be possible to add a special case handling somewhere of the NodeAssociated error to force a reschedule as soon as its detected?

Comment 6 Lucas Alvares Gomes 2015-08-28 15:59:22 UTC

Hi @James,

Talking to @Nikola he pointed me to this spec in nova that was drafted to fix this problem: http://specs.openstack.org/openstack/nova-specs/specs/liberty/approved/host-state-level-locking.html

Unfortunately it won't land in liberty since it got cut at the non priority freeze, but it still give a good overview about where the problem happens and proposes a fix for it. (Which may be less hacky and faster to implement than a possible workaround).

Comment 7 James Slagle 2015-08-28 20:11:54 UTC

Through some trial and error, I did find a configuration that allowed me to get 
working deployments again in the scale lab.

I set osapi_compute_workers to 16 to reduce some of the load on the undercloud. This was previously set to 24 since it's a 24 core box.

First, setting ironic.api_max_retries to 5 and ironic.api_retry_interval to 1
in nova.conf. Effectively, we want to only retry a few times, and fail quickly.
I also applied this patch from Dmitry so we don't have the compounded retries
between nova and ironicclient:
https://code.engineering.redhat.com/gerrit/#/c/54864/

Secondly, I set scheduler_max_attempts=100 in nova.conf. The default is 3, and 
we need to retry a lot more times than that to get around the issue.

Comment 8 James Slagle 2015-08-28 20:15:27 UTC

Also, I set max_concurrent_builds to 100 in nova.conf. The reasoning behind
that is that we don't want to get blocked early on if a few nodes hit the issue
and have to retry a bunch of times. We might as well kick everything off
immediately.

Comment 9 Graeme Gillies 2015-09-09 02:22:07 UTC

Has there been progress on this to fix it properly? I'm doing testing with 4 node deployments doing 1 controller, 2 compute, 1 ceph and I hit this problem around 80% of my deployments, even with the patches that have been submitted above (I am using latest code from CDN).

openstack-nova-console-2015.1.0-18.el7ost.noarch
openstack-nova-common-2015.1.0-18.el7ost.noarch
openstack-nova-cert-2015.1.0-18.el7ost.noarch
python-novaclient-2.23.0-1.el7ost.noarch
python-nova-2015.1.0-18.el7ost.noarch
openstack-nova-api-2015.1.0-18.el7ost.noarch
openstack-nova-novncproxy-2015.1.0-18.el7ost.noarch
openstack-nova-conductor-2015.1.0-18.el7ost.noarch
openstack-nova-compute-2015.1.0-18.el7ost.noarch
openstack-nova-scheduler-2015.1.0-18.el7ost.noarch

Not to mention we pretty much hit this at all client sites.

Regards,

Graeme

Comment 10 James Slagle 2015-09-14 17:36:50 UTC

(In reply to Graeme Gillies from comment #9)
> Has there been progress on this to fix it properly? I'm doing testing with 4
> node deployments doing 1 controller, 2 compute, 1 ceph and I hit this
> problem around 80% of my deployments, even with the patches that have been
> submitted above (I am using latest code from CDN).
> 
> openstack-nova-console-2015.1.0-18.el7ost.noarch
> openstack-nova-common-2015.1.0-18.el7ost.noarch
> openstack-nova-cert-2015.1.0-18.el7ost.noarch
> python-novaclient-2.23.0-1.el7ost.noarch
> python-nova-2015.1.0-18.el7ost.noarch
> openstack-nova-api-2015.1.0-18.el7ost.noarch
> openstack-nova-novncproxy-2015.1.0-18.el7ost.noarch
> openstack-nova-conductor-2015.1.0-18.el7ost.noarch
> openstack-nova-compute-2015.1.0-18.el7ost.noarch
> openstack-nova-scheduler-2015.1.0-18.el7ost.noarch
> 
> Not to mention we pretty much hit this at all client sites.
> 
> Regards,
> 
> Graeme

there aren't any additional updates i'm aware of. the bug is aligned against y2, so the plan is to fix it properly at that time.

Comment 11 Lucas Alvares Gomes 2015-09-22 10:23:59 UTC

I don't have an environment where I can reproduce this bug, but I've proposed this workaround that I hope helps to mitigate the problem [1]. Someone could give it a go please? It basically causes the deployment to fail faster in case the node in Ironic is already associated with another instance and the RetryFilter can go and find another node for that instance quicker.

I'm afraid the proper fix for this bug should go into the scheduler and resource tracker in Nova as described in Nikola's spec [2], the spec haven't made it into the Liberty cycle (it was cut off in the spec freeze) but talking to Nikola he said that it will be re-proposed for Mitaka.

[1] https://review.openstack.org/#/c/226235/ 
[2] http://specs.openstack.org/openstack/nova-specs/specs/liberty/approved/host-state-level-locking.html

Comment 12 Mike Burns 2015-11-06 14:25:01 UTC

based on comment 11, moving this to nova.

Comment 13 Mike Burns 2015-11-12 13:40:13 UTC

*** Bug 1281318 has been marked as a duplicate of this bug. ***

Comment 14 Eoghan Glynn 2015-11-25 13:15:17 UTC

"We're trying to deploy 65 instances on 65 nodes" ==> this attempt to hit 100% capacity is not the intended usage model (based on the designed statelessness/optimistism in the nova scheduler)

IIUC a correct solution to this would require an as-yet unimplemented mitaka blueprint https://blueprints.launchpad.net/nova/+spec/host-state-level-locking (which obviously is not going to happen for OSP7/z3)

Comment 17 Hugh Brock 2015-11-27 10:49:19 UTC

We were able to verify in internal testing on bare metal that Lucas' patch in Comment 11 above mitigates the problem, and it does not appear to cause any other issues. I'd therefore recommend we move forward with that approach given it at least mitigates the problem, pending a real scheduler fix upstream.

Comment 18 Graeme Gillies 2015-11-30 07:06:00 UTC

Hi,

I applied this patch in one of the environments I was building today (concurrency in nova of 10, the default) and the only thing I noticed was that we seemed to get a few scheduling errors with 'no valid host found', which we have never seen before. Could be related to this but probably not, will do some more digging and confirm/deny if that is the case

Regards,

Graeme

Comment 19 Graeme Gillies 2015-11-30 07:13:27 UTC

False alarm, turns out undercloud disk was full. 22GB of keystone tokens. I guess we don't have a cron job to flush it out on the undercloud? :(

Comment 20 Graeme Gillies 2015-12-01 01:18:58 UTC

Scaled up multiple times today with concurrency of 10, haven't seem problem with this patch, so consider it a +1 from me

Regards,

Graeme

Comment 21 Lucas Alvares Gomes 2015-12-15 17:28:21 UTC

FYI, the patch has been approved upstream in Nova [0]

[0] https://review.openstack.org/#/c/226235/

Comment 24 Graeme Gillies 2016-01-21 01:45:25 UTC

Created attachment 1116760 [details]
logs of issue potentially occuring on openstack-nova-compute-2015.1.2-13.el7ost.noarch

I believe I am still experiencing this issue (or something like it) on one of my environments currently running

openstack-nova-compute-2015.1.2-13.el7ost.noarch

Attached are the nova log files, towards the top of the log file is where I believe you see the errors

Comment 25 Dmitry Tantsur 2016-01-21 09:12:38 UTC

2016-01-18 22:36:01.470 2641 TRACE nova.compute.manager [instance: 9e13a6c8-8f27-4e84-abba-46e38e2f157e]   File "/usr/lib/python2.7/site-packages/nova/virt/ironic/driver.py", line 354, in _cleanup_deploy
2016-01-18 22:36:01.470 2641 TRACE nova.compute.manager [instance: 9e13a6c8-8f27-4e84-abba-46e38e2f157e]     self.ironicclient.call('node.update', node.uuid, patch)
2016-01-18 22:36:01.470 2641 TRACE nova.compute.manager [instance: 9e13a6c8-8f27-4e84-abba-46e38e2f157e]   File "/usr/lib/python2.7/site-packages/nova/virt/ironic/client_wrapper.py", line 136, in call
2016-01-18 22:36:01.470 2641 TRACE nova.compute.manager [instance: 9e13a6c8-8f27-4e84-abba-46e38e2f157e]     return self._multi_getattr(client, method)(*args, **kwargs)
2016-01-18 22:36:01.470 2641 TRACE nova.compute.manager [instance: 9e13a6c8-8f27-4e84-abba-46e38e2f157e]   File "/usr/lib/python2.7/site-packages/ironicclient/v1/node.py", line 169, in update
2016-01-18 22:36:01.470 2641 TRACE nova.compute.manager [instance: 9e13a6c8-8f27-4e84-abba-46e38e2f157e]     return self._update(self._path(node_id), patch, method=http_method)
2016-01-18 22:36:01.470 2641 TRACE nova.compute.manager [instance: 9e13a6c8-8f27-4e84-abba-46e38e2f157e]   File "/usr/lib/python2.7/site-packages/ironicclient/common/base.py", line 130, in _update
2016-01-18 22:36:01.470 2641 TRACE nova.compute.manager [instance: 9e13a6c8-8f27-4e84-abba-46e38e2f157e]     resp, body = self.api.json_request(method, url, body=body)
2016-01-18 22:36:01.470 2641 TRACE nova.compute.manager [instance: 9e13a6c8-8f27-4e84-abba-46e38e2f157e]   File "/usr/lib/python2.7/site-packages/ironicclient/common/http.py", line 353, in json_request
2016-01-18 22:36:01.470 2641 TRACE nova.compute.manager [instance: 9e13a6c8-8f27-4e84-abba-46e38e2f157e]     resp, body_iter = self._http_request(url, method, **kwargs)
2016-01-18 22:36:01.470 2641 TRACE nova.compute.manager [instance: 9e13a6c8-8f27-4e84-abba-46e38e2f157e]   File "/usr/lib/python2.7/site-packages/ironicclient/common/http.py", line 162, in wrapper
2016-01-18 22:36:01.470 2641 TRACE nova.compute.manager [instance: 9e13a6c8-8f27-4e84-abba-46e38e2f157e]     return func(self, url, method, **kwargs)
2016-01-18 22:36:01.470 2641 TRACE nova.compute.manager [instance: 9e13a6c8-8f27-4e84-abba-46e38e2f157e]   File "/usr/lib/python2.7/site-packages/ironicclient/common/http.py", line 336, in _http_request
2016-01-18 22:36:01.470 2641 TRACE nova.compute.manager [instance: 9e13a6c8-8f27-4e84-abba-46e38e2f157e]     error_json.get('debuginfo'), method, url)
2016-01-18 22:36:01.470 2641 TRACE nova.compute.manager [instance: 9e13a6c8-8f27-4e84-abba-46e38e2f157e] Conflict: Node 4903f2f3-69de-4a55-9d8e-1f51c5266217 can not be updated while a state transition is in progress. (HTTP 409)

so it looks like we now have conflicts during cleanup.. I wonder if we could make less retries on clean up as well.. Second thought is why we do clean up here at all. I need to investigate more what happened.

Comment 26 Dmitry Tantsur 2016-01-21 09:23:38 UTC

Maybe it's not the biggest problem. What worries me is a lot of messages like:

2016-01-18 22:56:20.108 2641 ERROR nova.virt.ironic.driver [req-f5f61ebf-db6a-4365-ac06-e0bb272c7ab6 d36c91db26434f0d9f2bf7f8c5f036f0 330d5e1c76a1470fa083438510c1630c - - -] (u'Failed to request Ironic to provision instance %(inst)s: %(reason)s', {'reason': u'The requested action "active" can not be performed on node "38240f0d-2bf8-45df-91b2-0ba255a85fc1" while it is in state "active". (HTTP 400)', 'inst': '44503803-6417-4fbf-b3fb-22f8775887eb'})
2016-01-18 22:56:20.238 2641 TRACE nova.compute.manager [instance: 44503803-6417-4fbf-b3fb-22f8775887eb] BadRequest: The requested action "active" can not be performed on node "38240f0d-2bf8-45df-91b2-0ba255a85fc1" while it is in state "active". (HTTP 400)

So now we retry on another problem, not sure why.

I would really like to get ironic-conductor logs for the same deployment to track the timeline, could you grab them please?

Comment 27 Lon Hohberger 2016-02-01 14:33:23 UTC

This should be resolved by openstack-nova-2015.1.2-13.el7ost, available via the OpenStack 7 repository.

Comment 28 Perry Myers 2016-02-01 20:24:29 UTC

Marking a bunch of comments that were private as public. Folks, let's keep the comments public unless they have internal URLs or customer info, etc.

(In reply to Lon Hohberger from comment #27)
> This should be resolved by openstack-nova-2015.1.2-13.el7ost, available via
> the OpenStack 7 repository.

Yes, but @ggillies has concerns that it is not completely fixed as per the now public comment #24.

So it could be that -13 fixes the issue completely or partially or not at all. In any case, it would be good for people to try -13 to see if it resolve the issue and we can compare to what @ggilles has seen in testing.

Comment 32 nlevinki 2016-02-18 09:58:57 UTC

lack of HW to test it.
from comment #20 looks like it is working.

Comment 34 errata-xmlrpc 2016-02-18 16:08:51 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0261.html