1701256 – [osp13] A pacemaker_remoted node fails monitor (probe) and stop operations on a resource because it returns "rc=189"

Bug 1701256 - [osp13] A pacemaker_remoted node fails monitor (probe) and stop operations on a resource because it returns "rc=189"

Summary: [osp13] A pacemaker_remoted node fails monitor (probe) and stop operations on...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	puppet-tripleo
Sub Component:
Version:	13.0 (Queens)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	z7
Target Release:	13.0 (Queens)
Assignee:	Michele Baldessari
QA Contact:	nlevinki
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-04-18 13:54 UTC by Shane Bradley
Modified:	2019-12-29 00:57 UTC (History)
CC List:	12 users (show)
Fixed In Version:	puppet-tripleo-8.4.1-7.el7ost
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-07-10 13:05:11 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
OpenStack gerrit	656256	'None'	MERGED	Move unfencing to meta_params	2021-01-29 00:38:52 UTC
Red Hat Knowledge Base (Solution)	3390831	None	None	None	2019-04-18 13:57:31 UTC
Red Hat Knowledge Base (Solution)	4056221	None	None	None	2019-04-18 13:56:55 UTC
Red Hat Product Errata	RHBA-2019:1738	None	None	None	2019-07-10 13:05:23 UTC

Comment 1 Ken Gaillot 2019-04-22 18:20:58 UTC

This appears to be a bug in pacemaker's scheduling, specifically that the probe of the resource does not wait for the unfencing to successfully complete. More investigation will be needed to confirm and devise a fix.

Comment 3 Ken Gaillot 2019-04-22 22:32:07 UTC

The bug is in Pacemaker's handling of a deprecated syntax that is present in this configuration (most likely created by the OpenStack installer). The following will correct the issue:

  pcs resource meta compute-unfence-trigger requires=unfencing

The legacy syntax (deprecated for as long as Pacemaker has been in RHEL) is:

    <op id="compute-unfence-trigger-start-interval-0s" interval="0s" name="start" requires="unfencing"/>

The current syntax is to set "requires" in the resource meta-attributes instead of the start operation. There appears to be a bug in pacemaker's support for the legacy syntax; I am still investigating whether that is a recent regression and how to fix it. Also, we'll need to confirm whether the OpenStack installer introduced the deprecated syntax, and clone this bz to address that issue if so.

Comment 5 Ken Gaillot 2019-04-23 15:07:42 UTC

I have confirmed that this is not a regression, but the intentional difference between the deprecated and current syntax: the deprecated syntax supports only resources that require unfencing before the start operation, while the current syntax was added to also support resources (such as here) that require unfencing before probes.

Comment 10 Andrew Beekhof 2019-05-10 08:57:52 UTC

(In reply to Ken Gaillot from comment #1)
> This appears to be a bug in pacemaker's scheduling, specifically that the
> probe of the resource does not wait for the unfencing to successfully
> complete. More investigation will be needed to confirm and devise a fix.

Just for completion, the timing of probing the resource has no baring on the unfencing operation.
The unfencing is just clearing a flag in nova, the machine is already powered on.

Comment 19 Damien Ciabrini 2019-06-20 12:56:44 UTC

Instruction for testing (based on https://review.opendev.org/#/c/656256):

1. Deploy an HA overcloud with Instance HA

2. Make sure that the compute-unfence-trigger resource has the requires=unfencing flag set as a meta attribute:

# pcs cluster cib
[...]
      <clone id="compute-unfence-trigger-clone">
        <primitive class="ocf" id="compute-unfence-trigger" provider="pacemaker" type="Dummy">
          <meta_attributes id="compute-unfence-trigger-meta_attributes">
            <nvpair id="compute-unfence-trigger-meta_attributes-requires" name="requires" value="unfencing"/>
          </meta_attributes>
          <operations>
[...]

3. run regular instance HA tests

Comment 20 Luca Miccini 2019-06-20 19:55:43 UTC

(In reply to Damien Ciabrini from comment #19)
> Instruction for testing (based on https://review.opendev.org/#/c/656256):
> 
> 1. Deploy an HA overcloud with Instance HA
> 
> 2. Make sure that the compute-unfence-trigger resource has the
> requires=unfencing flag set as a meta attribute:
> 
> # pcs cluster cib
> [...]
>       <clone id="compute-unfence-trigger-clone">
>         <primitive class="ocf" id="compute-unfence-trigger"
> provider="pacemaker" type="Dummy">
>           <meta_attributes id="compute-unfence-trigger-meta_attributes">
>             <nvpair id="compute-unfence-trigger-meta_attributes-requires"
> name="requires" value="unfencing"/>
>           </meta_attributes>
>           <operations>
> [...]
> 
> 3. run regular instance HA tests

deployed puddle 2019-06-13.2

attribute is there:

      <clone id="compute-unfence-trigger-clone">
        <primitive class="ocf" id="compute-unfence-trigger" provider="pacemaker" type="Dummy">
          <meta_attributes id="compute-unfence-trigger-meta_attributes">
            <nvpair id="compute-unfence-trigger-meta_attributes-requires" name="requires" value="unfencing"/>
          </meta_attributes>

tested booting a vm, crashing the compute where it was running.
compute was fenced, vm migrated, host nova service disabled then re-enabled after a short while.

looks ok to me.

Comment 23 errata-xmlrpc 2019-07-10 13:05:11 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:1738

Note You need to log in before you can comment on or make changes to this bug.