Bug 1701256 - [osp13] A pacemaker_remoted node fails monitor (probe) and stop operations on a resource because it returns "rc=189"
Summary: [osp13] A pacemaker_remoted node fails monitor (probe) and stop operations on...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: puppet-tripleo
Version: 13.0 (Queens)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: z7
: 13.0 (Queens)
Assignee: Michele Baldessari
QA Contact: nlevinki
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-04-18 13:54 UTC by Shane Bradley
Modified: 2019-12-29 00:57 UTC (History)
12 users (show)

Fixed In Version: puppet-tripleo-8.4.1-7.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-07-10 13:05:11 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 656256 0 'None' MERGED Move unfencing to meta_params 2021-01-29 00:38:52 UTC
Red Hat Knowledge Base (Solution) 3390831 0 None None None 2019-04-18 13:57:31 UTC
Red Hat Knowledge Base (Solution) 4056221 0 None None None 2019-04-18 13:56:55 UTC
Red Hat Product Errata RHBA-2019:1738 0 None None None 2019-07-10 13:05:23 UTC

Comment 1 Ken Gaillot 2019-04-22 18:20:58 UTC
This appears to be a bug in pacemaker's scheduling, specifically that the probe of the resource does not wait for the unfencing to successfully complete. More investigation will be needed to confirm and devise a fix.

Comment 3 Ken Gaillot 2019-04-22 22:32:07 UTC
The bug is in Pacemaker's handling of a deprecated syntax that is present in this configuration (most likely created by the OpenStack installer). The following will correct the issue:

  pcs resource meta compute-unfence-trigger requires=unfencing

The legacy syntax (deprecated for as long as Pacemaker has been in RHEL) is:

    <op id="compute-unfence-trigger-start-interval-0s" interval="0s" name="start" requires="unfencing"/>

The current syntax is to set "requires" in the resource meta-attributes instead of the start operation. There appears to be a bug in pacemaker's support for the legacy syntax; I am still investigating whether that is a recent regression and how to fix it. Also, we'll need to confirm whether the OpenStack installer introduced the deprecated syntax, and clone this bz to address that issue if so.

Comment 5 Ken Gaillot 2019-04-23 15:07:42 UTC
I have confirmed that this is not a regression, but the intentional difference between the deprecated and current syntax: the deprecated syntax supports only resources that require unfencing before the start operation, while the current syntax was added to also support resources (such as here) that require unfencing before probes.

Comment 10 Andrew Beekhof 2019-05-10 08:57:52 UTC
(In reply to Ken Gaillot from comment #1)
> This appears to be a bug in pacemaker's scheduling, specifically that the
> probe of the resource does not wait for the unfencing to successfully
> complete. More investigation will be needed to confirm and devise a fix.

Just for completion, the timing of probing the resource has no baring on the unfencing operation.
The unfencing is just clearing a flag in nova, the machine is already powered on.

Comment 19 Damien Ciabrini 2019-06-20 12:56:44 UTC
Instruction for testing (based on https://review.opendev.org/#/c/656256):

1. Deploy an HA overcloud with Instance HA

2. Make sure that the compute-unfence-trigger resource has the requires=unfencing flag set as a meta attribute:

# pcs cluster cib
[...]
      <clone id="compute-unfence-trigger-clone">
        <primitive class="ocf" id="compute-unfence-trigger" provider="pacemaker" type="Dummy">
          <meta_attributes id="compute-unfence-trigger-meta_attributes">
            <nvpair id="compute-unfence-trigger-meta_attributes-requires" name="requires" value="unfencing"/>
          </meta_attributes>
          <operations>
[...]

3. run regular instance HA tests

Comment 20 Luca Miccini 2019-06-20 19:55:43 UTC
(In reply to Damien Ciabrini from comment #19)
> Instruction for testing (based on https://review.opendev.org/#/c/656256):
> 
> 1. Deploy an HA overcloud with Instance HA
> 
> 2. Make sure that the compute-unfence-trigger resource has the
> requires=unfencing flag set as a meta attribute:
> 
> # pcs cluster cib
> [...]
>       <clone id="compute-unfence-trigger-clone">
>         <primitive class="ocf" id="compute-unfence-trigger"
> provider="pacemaker" type="Dummy">
>           <meta_attributes id="compute-unfence-trigger-meta_attributes">
>             <nvpair id="compute-unfence-trigger-meta_attributes-requires"
> name="requires" value="unfencing"/>
>           </meta_attributes>
>           <operations>
> [...]
> 
> 3. run regular instance HA tests

deployed puddle 2019-06-13.2

attribute is there:

      <clone id="compute-unfence-trigger-clone">
        <primitive class="ocf" id="compute-unfence-trigger" provider="pacemaker" type="Dummy">
          <meta_attributes id="compute-unfence-trigger-meta_attributes">
            <nvpair id="compute-unfence-trigger-meta_attributes-requires" name="requires" value="unfencing"/>
          </meta_attributes>

tested booting a vm, crashing the compute where it was running.
compute was fenced, vm migrated, host nova service disabled then re-enabled after a short while.

looks ok to me.

Comment 23 errata-xmlrpc 2019-07-10 13:05:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:1738


Note You need to log in before you can comment on or make changes to this bug.