Bug 1235696
Summary: | nova fails to schedule Instances when compute node is dead but not "down" | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Fabio Massimo Di Nitto <fdinitto> |
Component: | openstack-nova | Assignee: | Eoghan Glynn <eglynn> |
Status: | CLOSED WONTFIX | QA Contact: | nlevinki <nlevinki> |
Severity: | urgent | Docs Contact: | |
Priority: | urgent | ||
Version: | 7.0 (Kilo) | CC: | abeekhof, berrange, cluster-maint, dasmith, eglynn, kchamart, sbauza, sferdjao, sgordon, srevivo, vromanso |
Target Milestone: | --- | Keywords: | ZStream |
Target Release: | 8.0 (Liberty) | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2017-06-05 17:05:32 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1185030, 1251948, 1261487 |
Description
Fabio Massimo Di Nitto
2015-06-25 14:06:21 UTC
Clarifying comment #0: node exploding == dying a horrible death == $ echo c > /proc/sysrq_trigger This is a well-understood gap at the moment. There are several efforts underway to close this, all of which will be things we land in Liberty (hopefully). None of them will be candidates for backporting to Kilo, IMHO. First, service groups based on tooz will massively shorten the delay between a node going down and nova noticing. Second, the mark-host-down functionality will allow things like pacemaker to be in the *middle* of decisions about host failure, evacuation decisions, etc. Third, the additional information about in-process evacuations provided by the robustify-evacuation work in the form of recording evacuations as migrations, as well as new notifications about progress, will allow things like pacemaker to make much more informed decisions about restarting this process if it does race with a secondary node failure. Dan I understand this is a work in progress and probably fixed in Liberty. The issue here is that it defeats the work of Instance HA in case of some failure at given times (basically it's not real HA). The problem here is not a workaround of when nova notices the host is down (or how). The problem is that nova tries to schedule something, it doesn't even check if the other hand has received the request and let things die there. This could happen even when booting a new instance I guess. Some level of: "hey nodeX can you boot an instance?" and receive no reply from node X, nova should take action to ask node Y. It's not even a matter if nova knows that X is down or not. Anyway we will need some level of fixing in Kilo here since Instance HA is going to be flagship feature for OSP7. Realistically there is nothing that can be done about this for 7.0, so I'm marking for 8.0/7.0.z. I believe we need to enhance the fence agent for Nova to use the new mark host down API, where it is available, to handle this. (In reply to Stephen Gordon from comment #13) > I believe we need to enhance the fence agent for Nova to use the new mark > host down API, where it is available, to handle this. Can you please don“t bounce bugs around if you are not sure what they are for? If you are in doubt please ask before proceeding. With or without using the new nova api mark down, there is still a race condition within nova. This bug about this specific race condition that has to be fixed as pointed out in the comments above. The mark-host-down API was merged in Liberty (OSP8). It's an API and internals change, so it won't be backported to anything before that. But, it sounds like that API doesn't really even help your use case. Nova's boot operation is fundamentally a cast and that's not really going to ever change, AFAIK. Specific work items that make it easier to externally detect that an instance boot has been dropped on the floor are certainly up for discussion. I don't think this generalized bug has any specific work to do, so +1 for closing. |