Once again, testing Instance HA but this issue can be reproduced without pacemaker or HA. It is time sensitive, so be careful at the sequence of events: 1) compute node1 with X instances dies a horrible death 2) compute node1 is marked "down" in nova 3) compute node2 explodes (but still marked "up" in nova) 4) request evacuation of instances from node1 immediately after #3. 5) nova can potentially schedule instances on node2 (that is dead already but nova doesn't know it yet) and node3 (perfectly functional) 6) instances will go in REBUILD state (expected) 7) instances scheduled on node2 (dead) will ERROR after some time. 8) instances scheduled on node3 will boot fine The problem here is that nova attempts to schedule to node2 (that's fine, it is still marked alive) but doesn't even check for a reply from node2 that the command has been successful (looks like a one way communication) and once the node2 is marked down, nova takes no action to reschedule instances on node3 (as expected). Instances scheduled on node2 will enter ERROR state once node2 rejoins the computing cluster.
Clarifying comment #0: node exploding == dying a horrible death == $ echo c > /proc/sysrq_trigger
This is a well-understood gap at the moment. There are several efforts underway to close this, all of which will be things we land in Liberty (hopefully). None of them will be candidates for backporting to Kilo, IMHO. First, service groups based on tooz will massively shorten the delay between a node going down and nova noticing. Second, the mark-host-down functionality will allow things like pacemaker to be in the *middle* of decisions about host failure, evacuation decisions, etc. Third, the additional information about in-process evacuations provided by the robustify-evacuation work in the form of recording evacuations as migrations, as well as new notifications about progress, will allow things like pacemaker to make much more informed decisions about restarting this process if it does race with a secondary node failure.
Dan I understand this is a work in progress and probably fixed in Liberty. The issue here is that it defeats the work of Instance HA in case of some failure at given times (basically it's not real HA). The problem here is not a workaround of when nova notices the host is down (or how). The problem is that nova tries to schedule something, it doesn't even check if the other hand has received the request and let things die there. This could happen even when booting a new instance I guess. Some level of: "hey nodeX can you boot an instance?" and receive no reply from node X, nova should take action to ask node Y. It's not even a matter if nova knows that X is down or not. Anyway we will need some level of fixing in Kilo here since Instance HA is going to be flagship feature for OSP7.
Realistically there is nothing that can be done about this for 7.0, so I'm marking for 8.0/7.0.z.
I believe we need to enhance the fence agent for Nova to use the new mark host down API, where it is available, to handle this.
(In reply to Stephen Gordon from comment #13) > I believe we need to enhance the fence agent for Nova to use the new mark > host down API, where it is available, to handle this. Can you please don´t bounce bugs around if you are not sure what they are for? If you are in doubt please ask before proceeding.
With or without using the new nova api mark down, there is still a race condition within nova. This bug about this specific race condition that has to be fixed as pointed out in the comments above.
The mark-host-down API was merged in Liberty (OSP8). It's an API and internals change, so it won't be backported to anything before that. But, it sounds like that API doesn't really even help your use case. Nova's boot operation is fundamentally a cast and that's not really going to ever change, AFAIK. Specific work items that make it easier to externally detect that an instance boot has been dropped on the floor are certainly up for discussion. I don't think this generalized bug has any specific work to do, so +1 for closing.