This appears to be a bug in pacemaker's scheduling, specifically that the probe of the resource does not wait for the unfencing to successfully complete. More investigation will be needed to confirm and devise a fix.
The bug is in Pacemaker's handling of a deprecated syntax that is present in this configuration (most likely created by the OpenStack installer). The following will correct the issue: pcs resource meta compute-unfence-trigger requires=unfencing The legacy syntax (deprecated for as long as Pacemaker has been in RHEL) is: <op id="compute-unfence-trigger-start-interval-0s" interval="0s" name="start" requires="unfencing"/> The current syntax is to set "requires" in the resource meta-attributes instead of the start operation. There appears to be a bug in pacemaker's support for the legacy syntax; I am still investigating whether that is a recent regression and how to fix it. Also, we'll need to confirm whether the OpenStack installer introduced the deprecated syntax, and clone this bz to address that issue if so.
I have confirmed that this is not a regression, but the intentional difference between the deprecated and current syntax: the deprecated syntax supports only resources that require unfencing before the start operation, while the current syntax was added to also support resources (such as here) that require unfencing before probes.
(In reply to Ken Gaillot from comment #1) > This appears to be a bug in pacemaker's scheduling, specifically that the > probe of the resource does not wait for the unfencing to successfully > complete. More investigation will be needed to confirm and devise a fix. Just for completion, the timing of probing the resource has no baring on the unfencing operation. The unfencing is just clearing a flag in nova, the machine is already powered on.
Instruction for testing (based on https://review.opendev.org/#/c/656256): 1. Deploy an HA overcloud with Instance HA 2. Make sure that the compute-unfence-trigger resource has the requires=unfencing flag set as a meta attribute: # pcs cluster cib [...] <clone id="compute-unfence-trigger-clone"> <primitive class="ocf" id="compute-unfence-trigger" provider="pacemaker" type="Dummy"> <meta_attributes id="compute-unfence-trigger-meta_attributes"> <nvpair id="compute-unfence-trigger-meta_attributes-requires" name="requires" value="unfencing"/> </meta_attributes> <operations> [...] 3. run regular instance HA tests
(In reply to Damien Ciabrini from comment #19) > Instruction for testing (based on https://review.opendev.org/#/c/656256): > > 1. Deploy an HA overcloud with Instance HA > > 2. Make sure that the compute-unfence-trigger resource has the > requires=unfencing flag set as a meta attribute: > > # pcs cluster cib > [...] > <clone id="compute-unfence-trigger-clone"> > <primitive class="ocf" id="compute-unfence-trigger" > provider="pacemaker" type="Dummy"> > <meta_attributes id="compute-unfence-trigger-meta_attributes"> > <nvpair id="compute-unfence-trigger-meta_attributes-requires" > name="requires" value="unfencing"/> > </meta_attributes> > <operations> > [...] > > 3. run regular instance HA tests deployed puddle 2019-06-13.2 attribute is there: <clone id="compute-unfence-trigger-clone"> <primitive class="ocf" id="compute-unfence-trigger" provider="pacemaker" type="Dummy"> <meta_attributes id="compute-unfence-trigger-meta_attributes"> <nvpair id="compute-unfence-trigger-meta_attributes-requires" name="requires" value="unfencing"/> </meta_attributes> tested booting a vm, crashing the compute where it was running. compute was fenced, vm migrated, host nova service disabled then re-enabled after a short while. looks ok to me.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:1738