1881114 – galera resource agent fails promotion during a rolling restart

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1881114 - galera resource agent fails promotion during a rolling restart

Summary: galera resource agent fails promotion during a rolling restart

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	resource-agents
Sub Component:
Version:	8.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	rc
Target Release:	8.0
Assignee:	Oyvind Albrigtsen
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-09-21 14:43 UTC by Damien Ciabrini
Modified:	2021-05-18 15:11 UTC (History)
CC List:	8 users (show)
Fixed In Version:	resource-agents-4.1.1-70.el8
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-05-18 15:11:15 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:
Flags:	pm-rhel: mirror+

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	ClusterLabs resource-agents pull 1557	0	None	closed	galera: recover from joining a non existing cluster	2021-01-20 03:47:43 UTC

Description Damien Ciabrini 2020-09-21 14:43:46 UTC

Description of problem:
galera being a M/S resource, the resource agent decides when and how
to promote a resource replica (i.e. start a galera process locally)
based on the current state of the entire galera cluster:

  . If there's no galera cluster, the replica is promoted as the
    bootstrap node: it will start a galera process that will bootstrap
    a new galera cluster.

  . If there's a running galera cluster: the replica is promoted as a
    joiner node: it will join the running galera cluster, and will
    never try to bootstrap a new cluster.

When one changes a property of a pacemaker resource, pacemaker must restart
the resource on all the nodes. For instance for galera:

[root@controller-0 ~]# pcs resource show galera
 Resource: galera (class=ocf provider=heartbeat type=galera)
  Attributes: additional_parameters=--open-files-limit=16384 cluster_host_map=controller-0:controller-0.internalapi.redhat.local;controller-1:controller-1.internalapi.redhat.local;controller-2:controller-2.internalapi.redhat.local enable_creation=true log=/var/log/mysql/mysqld.log wsrep_cluster_address=gcomm://controller-0.internalapi.redhat.local,controller-1.internalapi.redhat.local,controller-2.internalapi.redhat.local
  Meta Attrs: container-attribute-target=host master-max=3 ordered=true
  Operations: demote interval=0s timeout=120s (galera-demote-interval-0s)
              monitor interval=20s timeout=30s (galera-monitor-interval-20s)
              monitor interval=10s role=Master timeout=30s (galera-monitor-interval-10s)
              monitor interval=30s role=Slave timeout=30s (galera-monitor-interval-30s)
              promote interval=0s on-fail=block timeout=300s (galera-promote-interval-0s)
              start interval=0s timeout=120s (galera-start-interval-0s)
              stop interval=0s timeout=120s (galera-stop-interval-0s)

[root@controller-0 ~]# pcs resource update galera additional_parameters=--open-files-limit=20000

For all galera replicas, pacemaker will trigger a demote, the the resource agent
will eventually request a promote operation, and configure the local node as a
bootstrap node or a joiner node depending on the state of the galera cluster
when the agent was called.

During such a rolling restart, on galera node can request a promotion
as a joiner node because when the agent was called some replicas were
still running as Master.

However, ther be some time between the moment when a node is promoted and
when the promote operation effectively takes place. So if a node is
promoted for joining a cluster, all the running galera nodes are
stopped before the promote operation start, the joining node won't be
able to join the cluster, and it can't bootstrap a new one either
because it doesn't have the most recent copy of the DB.

This promotion window makes the resource agent fail its promotion, and
blocks the replica on this node until a manual "pcs resource cleanup galera"
is executed.


Version-Release number of selected component (if applicable):
resource-agents-4.1.1-61.el7.x86_64

How reproducible:
Timing-dependent, but happens almost always in OpenStack HA control plane

Steps to Reproduce:
1. Deploy an HA overcloud
2. On a controller node, change a resource parameter as shown above

Actual results:
One replica failed to restart due to all master being stopped before
the promotion effectively takes place:

* galera_promote_0 on galera-bundle-2 'unknown error' (1): call=998, status=complete, exitreason='Failure, Attempted to promote Master instance of galera before bootstrap node has been detected.',
    last-rc-change='Mon Sep 21 14:30:34 2020', queued=0ms, exec=1535ms


Expected results:
Promotion should no longer be attempted because Master are gone, this shouldn't be fatal.

Comment 8 Damien Ciabrini 2020-12-04 08:57:39 UTC

Steps to verify the fix:

with the old resource agent:

. deploy and HA overcloud

. on a controller, update a random parameter in the galera resource

pcs resource update galera additional_parameters=--open-files-limit=20000

. observe that pacemaker restarts the resource on all nodes due to the config change.

. one of the node should fail to restart. this is a racy failure, but odds are high that the failure will happen.


with the new resource agent:

. deploy an HA overcloud

. on a controller, update a random parameter in the galera resource

pcs resource update galera additional_parameters=--open-files-limit=20000

. observe that pacemaker restarts the resource on all nodes due to the config change.

. the resource will be promoted back to Master on all nodes. Some nodes may have retried the promotion, in which case some logs will be present in the journal:

"There is no running cluster to join, demoting ourself"

Comment 9 pkomarov 2020-12-05 00:09:02 UTC

Verified ,
[stack@undercloud-0 ~]$ ansible database -b -mshell -a'podman exec `podman ps -f name=galera-bundle -q`  sh -c "rpm -q resource-agents";rpm -q resource-agents'
[WARNING]: Found both group and host with same name: undercloud
database-0 | CHANGED | rc=0 >>
resource-agents-4.1.1-79.el8.x86_64
resource-agents-4.1.1-79.el8.x86_64
database-2 | CHANGED | rc=0 >>
resource-agents-4.1.1-79.el8.x86_64
resource-agents-4.1.1-79.el8.x86_64
database-1 | CHANGED | rc=0 >>
resource-agents-4.1.1-79.el8.x86_64
resource-agents-4.1.1-79.el8.x86_64

#pcs resource update galera additional_parameters=--open-files-limit=20000
    * galera-bundle-2   (ocf::heartbeat:galera):         Demoting database-0

    * galera-bundle-0   (ocf::heartbeat:galera):         Master database-2
    * galera-bundle-1   (ocf::heartbeat:galera):         Demoting database-1
    * galera-bundle-2   (ocf::heartbeat:galera):         Slave database-0


   * galera-bundle-0   (ocf::heartbeat:galera):         Promoting database-2
    * galera-bundle-1   (ocf::heartbeat:galera):         Slave database-1
    * galera-bundle-2   (ocf::heartbeat:galera):         Master database-0


    * galera-bundle-0   (ocf::heartbeat:galera):         Master database-2
    * galera-bundle-1   (ocf::heartbeat:galera):         Master database-1
    * galera-bundle-2   (ocf::heartbeat:galera):         Master database-0

Comment 11 errata-xmlrpc 2021-05-18 15:11:15 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (resource-agents bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:1736

Note You need to log in before you can comment on or make changes to this bug.