Bug 1257950
Summary: | NodeAssociated errors due to apparent race condition between nova-scheduler and ironic | |||
---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | James Slagle <jslagle> | |
Component: | openstack-nova | Assignee: | Lucas Alvares Gomes <lmartins> | |
Status: | CLOSED ERRATA | QA Contact: | Joe H. Rahme <jhakimra> | |
Severity: | medium | Docs Contact: | ||
Priority: | high | |||
Version: | 7.0 (Kilo) | CC: | berrange, dasmith, dmaley, dtantsur, eglynn, felipe.alfaro, ggillies, hbrock, jcoufal, jmelvin, jraju, kchamart, kmorey, lmartins, mburns, mcornea, mwagner, ndipanov, nlevinki, ohochman, pcaruana, rbiba, rhel-osp-director-maint, sbauza, sferdjao, sgordon, skinjo, vromanso, yeylon | |
Target Milestone: | async | Keywords: | OtherQA, Triaged, ZStream | |
Target Release: | 7.0 (Kilo) | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | openstack-nova-2015.1.2-9.el7ost | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1337951 (view as bug list) | Environment: | ||
Last Closed: | 2016-02-18 16:08:51 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1264236, 1337951 |
Description
James Slagle
2015-08-28 13:00:35 UTC
Likely related upstream bug: https://bugs.launchpad.net/nova/+bug/1341420 However, I've tried the patch from https://review.openstack.org/#/c/106716/2 (referenced from the above launchpad bug) and it had no positive effect or improvement in the scale lab (i did verify the patch was applied and was getting the sleep log messages in nova-scheduler.log). I also tried increasing the sleep value to various other values: 1 second - no improvement 2, 5, and 10 seconds - all these values caused instances to not get scheduled at all after the first batch of max_concurrent_builds (10) was done. I've uploaded nova and ironic logs to http://file.rdu.redhat.com/~jslagle/bug-1257950/ These are from the scale lab. I reset the log files prior to doing a new deployment, and they show numerous examples of the problem. Note that one suggested workaround is to add a bunch of fake nodes to ironic. This decreases the chance that the scheduler will pick an already in use node. But, you have to wait for the deployment on the fake node to time out and get reshceduled. Would it be possible to add a special case handling somewhere of the NodeAssociated error to force a reschedule as soon as its detected? Hi @James, Talking to @Nikola he pointed me to this spec in nova that was drafted to fix this problem: http://specs.openstack.org/openstack/nova-specs/specs/liberty/approved/host-state-level-locking.html Unfortunately it won't land in liberty since it got cut at the non priority freeze, but it still give a good overview about where the problem happens and proposes a fix for it. (Which may be less hacky and faster to implement than a possible workaround). Through some trial and error, I did find a configuration that allowed me to get working deployments again in the scale lab. I set osapi_compute_workers to 16 to reduce some of the load on the undercloud. This was previously set to 24 since it's a 24 core box. First, setting ironic.api_max_retries to 5 and ironic.api_retry_interval to 1 in nova.conf. Effectively, we want to only retry a few times, and fail quickly. I also applied this patch from Dmitry so we don't have the compounded retries between nova and ironicclient: https://code.engineering.redhat.com/gerrit/#/c/54864/ Secondly, I set scheduler_max_attempts=100 in nova.conf. The default is 3, and we need to retry a lot more times than that to get around the issue. Also, I set max_concurrent_builds to 100 in nova.conf. The reasoning behind that is that we don't want to get blocked early on if a few nodes hit the issue and have to retry a bunch of times. We might as well kick everything off immediately. Has there been progress on this to fix it properly? I'm doing testing with 4 node deployments doing 1 controller, 2 compute, 1 ceph and I hit this problem around 80% of my deployments, even with the patches that have been submitted above (I am using latest code from CDN). openstack-nova-console-2015.1.0-18.el7ost.noarch openstack-nova-common-2015.1.0-18.el7ost.noarch openstack-nova-cert-2015.1.0-18.el7ost.noarch python-novaclient-2.23.0-1.el7ost.noarch python-nova-2015.1.0-18.el7ost.noarch openstack-nova-api-2015.1.0-18.el7ost.noarch openstack-nova-novncproxy-2015.1.0-18.el7ost.noarch openstack-nova-conductor-2015.1.0-18.el7ost.noarch openstack-nova-compute-2015.1.0-18.el7ost.noarch openstack-nova-scheduler-2015.1.0-18.el7ost.noarch Not to mention we pretty much hit this at all client sites. Regards, Graeme (In reply to Graeme Gillies from comment #9) > Has there been progress on this to fix it properly? I'm doing testing with 4 > node deployments doing 1 controller, 2 compute, 1 ceph and I hit this > problem around 80% of my deployments, even with the patches that have been > submitted above (I am using latest code from CDN). > > openstack-nova-console-2015.1.0-18.el7ost.noarch > openstack-nova-common-2015.1.0-18.el7ost.noarch > openstack-nova-cert-2015.1.0-18.el7ost.noarch > python-novaclient-2.23.0-1.el7ost.noarch > python-nova-2015.1.0-18.el7ost.noarch > openstack-nova-api-2015.1.0-18.el7ost.noarch > openstack-nova-novncproxy-2015.1.0-18.el7ost.noarch > openstack-nova-conductor-2015.1.0-18.el7ost.noarch > openstack-nova-compute-2015.1.0-18.el7ost.noarch > openstack-nova-scheduler-2015.1.0-18.el7ost.noarch > > Not to mention we pretty much hit this at all client sites. > > Regards, > > Graeme there aren't any additional updates i'm aware of. the bug is aligned against y2, so the plan is to fix it properly at that time. I don't have an environment where I can reproduce this bug, but I've proposed this workaround that I hope helps to mitigate the problem [1]. Someone could give it a go please? It basically causes the deployment to fail faster in case the node in Ironic is already associated with another instance and the RetryFilter can go and find another node for that instance quicker. I'm afraid the proper fix for this bug should go into the scheduler and resource tracker in Nova as described in Nikola's spec [2], the spec haven't made it into the Liberty cycle (it was cut off in the spec freeze) but talking to Nikola he said that it will be re-proposed for Mitaka. [1] https://review.openstack.org/#/c/226235/ [2] http://specs.openstack.org/openstack/nova-specs/specs/liberty/approved/host-state-level-locking.html based on comment 11, moving this to nova. *** Bug 1281318 has been marked as a duplicate of this bug. *** "We're trying to deploy 65 instances on 65 nodes" ==> this attempt to hit 100% capacity is not the intended usage model (based on the designed statelessness/optimistism in the nova scheduler) IIUC a correct solution to this would require an as-yet unimplemented mitaka blueprint https://blueprints.launchpad.net/nova/+spec/host-state-level-locking (which obviously is not going to happen for OSP7/z3) We were able to verify in internal testing on bare metal that Lucas' patch in Comment 11 above mitigates the problem, and it does not appear to cause any other issues. I'd therefore recommend we move forward with that approach given it at least mitigates the problem, pending a real scheduler fix upstream. Hi, I applied this patch in one of the environments I was building today (concurrency in nova of 10, the default) and the only thing I noticed was that we seemed to get a few scheduling errors with 'no valid host found', which we have never seen before. Could be related to this but probably not, will do some more digging and confirm/deny if that is the case Regards, Graeme False alarm, turns out undercloud disk was full. 22GB of keystone tokens. I guess we don't have a cron job to flush it out on the undercloud? :( Scaled up multiple times today with concurrency of 10, haven't seem problem with this patch, so consider it a +1 from me Regards, Graeme FYI, the patch has been approved upstream in Nova [0] [0] https://review.openstack.org/#/c/226235/ Created attachment 1116760 [details]
logs of issue potentially occuring on openstack-nova-compute-2015.1.2-13.el7ost.noarch
I believe I am still experiencing this issue (or something like it) on one of my environments currently running
openstack-nova-compute-2015.1.2-13.el7ost.noarch
Attached are the nova log files, towards the top of the log file is where I believe you see the errors
2016-01-18 22:36:01.470 2641 TRACE nova.compute.manager [instance: 9e13a6c8-8f27-4e84-abba-46e38e2f157e] File "/usr/lib/python2.7/site-packages/nova/virt/ironic/driver.py", line 354, in _cleanup_deploy 2016-01-18 22:36:01.470 2641 TRACE nova.compute.manager [instance: 9e13a6c8-8f27-4e84-abba-46e38e2f157e] self.ironicclient.call('node.update', node.uuid, patch) 2016-01-18 22:36:01.470 2641 TRACE nova.compute.manager [instance: 9e13a6c8-8f27-4e84-abba-46e38e2f157e] File "/usr/lib/python2.7/site-packages/nova/virt/ironic/client_wrapper.py", line 136, in call 2016-01-18 22:36:01.470 2641 TRACE nova.compute.manager [instance: 9e13a6c8-8f27-4e84-abba-46e38e2f157e] return self._multi_getattr(client, method)(*args, **kwargs) 2016-01-18 22:36:01.470 2641 TRACE nova.compute.manager [instance: 9e13a6c8-8f27-4e84-abba-46e38e2f157e] File "/usr/lib/python2.7/site-packages/ironicclient/v1/node.py", line 169, in update 2016-01-18 22:36:01.470 2641 TRACE nova.compute.manager [instance: 9e13a6c8-8f27-4e84-abba-46e38e2f157e] return self._update(self._path(node_id), patch, method=http_method) 2016-01-18 22:36:01.470 2641 TRACE nova.compute.manager [instance: 9e13a6c8-8f27-4e84-abba-46e38e2f157e] File "/usr/lib/python2.7/site-packages/ironicclient/common/base.py", line 130, in _update 2016-01-18 22:36:01.470 2641 TRACE nova.compute.manager [instance: 9e13a6c8-8f27-4e84-abba-46e38e2f157e] resp, body = self.api.json_request(method, url, body=body) 2016-01-18 22:36:01.470 2641 TRACE nova.compute.manager [instance: 9e13a6c8-8f27-4e84-abba-46e38e2f157e] File "/usr/lib/python2.7/site-packages/ironicclient/common/http.py", line 353, in json_request 2016-01-18 22:36:01.470 2641 TRACE nova.compute.manager [instance: 9e13a6c8-8f27-4e84-abba-46e38e2f157e] resp, body_iter = self._http_request(url, method, **kwargs) 2016-01-18 22:36:01.470 2641 TRACE nova.compute.manager [instance: 9e13a6c8-8f27-4e84-abba-46e38e2f157e] File "/usr/lib/python2.7/site-packages/ironicclient/common/http.py", line 162, in wrapper 2016-01-18 22:36:01.470 2641 TRACE nova.compute.manager [instance: 9e13a6c8-8f27-4e84-abba-46e38e2f157e] return func(self, url, method, **kwargs) 2016-01-18 22:36:01.470 2641 TRACE nova.compute.manager [instance: 9e13a6c8-8f27-4e84-abba-46e38e2f157e] File "/usr/lib/python2.7/site-packages/ironicclient/common/http.py", line 336, in _http_request 2016-01-18 22:36:01.470 2641 TRACE nova.compute.manager [instance: 9e13a6c8-8f27-4e84-abba-46e38e2f157e] error_json.get('debuginfo'), method, url) 2016-01-18 22:36:01.470 2641 TRACE nova.compute.manager [instance: 9e13a6c8-8f27-4e84-abba-46e38e2f157e] Conflict: Node 4903f2f3-69de-4a55-9d8e-1f51c5266217 can not be updated while a state transition is in progress. (HTTP 409) so it looks like we now have conflicts during cleanup.. I wonder if we could make less retries on clean up as well.. Second thought is why we do clean up here at all. I need to investigate more what happened. Maybe it's not the biggest problem. What worries me is a lot of messages like: 2016-01-18 22:56:20.108 2641 ERROR nova.virt.ironic.driver [req-f5f61ebf-db6a-4365-ac06-e0bb272c7ab6 d36c91db26434f0d9f2bf7f8c5f036f0 330d5e1c76a1470fa083438510c1630c - - -] (u'Failed to request Ironic to provision instance %(inst)s: %(reason)s', {'reason': u'The requested action "active" can not be performed on node "38240f0d-2bf8-45df-91b2-0ba255a85fc1" while it is in state "active". (HTTP 400)', 'inst': '44503803-6417-4fbf-b3fb-22f8775887eb'}) 2016-01-18 22:56:20.238 2641 TRACE nova.compute.manager [instance: 44503803-6417-4fbf-b3fb-22f8775887eb] BadRequest: The requested action "active" can not be performed on node "38240f0d-2bf8-45df-91b2-0ba255a85fc1" while it is in state "active". (HTTP 400) So now we retry on another problem, not sure why. I would really like to get ironic-conductor logs for the same deployment to track the timeline, could you grab them please? This should be resolved by openstack-nova-2015.1.2-13.el7ost, available via the OpenStack 7 repository. Marking a bunch of comments that were private as public. Folks, let's keep the comments public unless they have internal URLs or customer info, etc. (In reply to Lon Hohberger from comment #27) > This should be resolved by openstack-nova-2015.1.2-13.el7ost, available via > the OpenStack 7 repository. Yes, but @ggillies has concerns that it is not completely fixed as per the now public comment #24. So it could be that -13 fixes the issue completely or partially or not at all. In any case, it would be good for people to try -13 to see if it resolve the issue and we can compare to what @ggilles has seen in testing. lack of HW to test it. from comment #20 looks like it is working. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-0261.html |