1652752 – Master/Slave bundle resource does not failover Master state across replicas

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1652752 - Master/Slave bundle resource does not failover Master state across replicas

Summary: Master/Slave bundle resource does not failover Master state across replicas

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	pacemaker
Sub Component:
Version:	7.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	7.7
Assignee:	Ken Gaillot
QA Contact:	pkomarov
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1652613 1654602 1658631
TreeView+	depends on / blocked

Reported:	2018-11-22 21:57 UTC by Damien Ciabrini
Modified:	2019-08-06 12:54 UTC (History)
CC List:	7 users (show)
Fixed In Version:	pacemaker-1.1.20-1.el7
Doc Type:	Bug Fix
Doc Text:	Previously, a clone notification scheduled for a Pacemaker Remote node or bundle that was disconnected sometimes blocked Pacemaker from all further cluster actions. With this update, notifications are scheduled correctly, and a notification on a disconnected remote connection does not prevent the cluster from taking further actions. As a result, the cluster continues to manage resources correctly.
Clone Of:
Clones:	1654602 (view as bug list)
Environment:
Last Closed:	2019-08-06 12:53:44 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
crm_report taken after ban (430.92 KB, application/x-bzip) 2018-11-22 21:57 UTC, Damien Ciabrini	no flags	Details
redis config (1.37 KB, text/plain) 2018-11-22 22:01 UTC, Damien Ciabrini	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:2129	0	None	None	None	2019-08-06 12:54:11 UTC

Description Damien Ciabrini 2018-11-22 21:57:15 UTC

Created attachment 1508125 [details]
crm_report taken after ban

Description of problem:
Given a Master/Slave resource configured with 1 Master, 2 Slaves;

When the Master resource is stopped (e.g. before a cluster node
reboot, or resource banned on the node that currently hosts the
Master), it seems that no Slave takes over the Master role, and the
resource is left with no Master indefinitely.

This problem has been initially observed in [1] and it's happening to
every Master/Slave bundle resource that we use in our OpenStack
pacemaker clusters.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1652613


Version-Release number of selected component (if applicable):
pacemaker-1.1.19-8.el7_6.1.x86_64

How reproducible:
Always

Steps to Reproduce:
For the sake of simplicity this bugzilla reproduces the original
problem with the Redis resource agent.

0. create a 3-node cluster on three hosts called ra1, ra2, ra3

1. install the redis config provided in attachement on all cluster nodes
mkdir -p /var/lib/redis /var/log/redis /var/run/redis
# copy redis.conf attachment in /etc/redis.conf
# set file ownership for kolla container user
chown 42460:42460 /etc/redis.conf
chown 42460:42460 /var/log/redis/redis.log 
chown -R 42460:42460 /var/lib/redis
chown -R 42460:42460 /var/run/redis/
chown -R 42460:42460 /var/log/redis/

2. create a bundled redis master/slave resource on a cluster
pcs resource bundle create redis-bundle container docker image=docker.io/tripleoqueens/centos-binary-redis:current-tripleo-rdo network=host options="--user=root --log-driver=journald" replicas=3 masters=1 run-command="/usr/sbin/pacemaker_remoted" network control-port=3123 storage-map id=map0 source-dir=/dev/log target-dir=/dev/log storage-map id=map1 source-dir=/dev/zero target-dir=/etc/libqb/force-filesystem-sockets options=ro storage-map id=map2 source-dir=/etc/hosts target-dir=/etc/hosts options=ro storage-map id=map3 source-dir=/etc/localtime target-dir=/etc/localtime options=ro storage-map id=map4 source-dir=/etc/redis.conf target-dir=/etc/redis.conf options=ro storage-map id=map5 source-dir=/var/lib/redis target-dir=/var/lib/redis options=rw storage-map id=map6 source-dir=/var/log/redis target-dir=/var/log/redis options=rw storage-map id=map7 source-dir=/var/run/redis target-dir=/var/run/redis options=rw storage-map id=map8 source-dir=/usr/lib/ocf target-dir=/usr/lib/ocf options=rw storage-map id=pcmk1 source-dir=/var/log/pacemaker target-dir=/var/log/pacemaker options=rw --disabled

pcs resource create redis ocf:heartbeat:redis wait_last_known_master=true op stop interval=0s timeout=200s meta container-attribute-target=host notify=true bundle redis-bundle

pcs resource enable redis-bundle

3. ban the resource on the node with the master replica, e.g.:
pcs resource manage galera-bundle

Actual results:
None of the Slave node become Master

Expected results:
A Master should have been promote

Comment 2 Damien Ciabrini 2018-11-22 22:01:33 UTC

Created attachment 1508126 [details]
redis config

Comment 7 Ken Gaillot 2018-11-26 23:54:26 UTC

In both 7.5 and 7.6, notify actions are scheduled to run on a bundle node that has just been stopped, and so the cluster node that had the bundle connection fakes the result of the notify.

The regression in behavior is a side effect of unrelated bug fixes improving fail-safe checking of faked results.

In 7.5 in this situation, the faked result would be unconditionally processed.

In 7.6, we first check whether the node has resource info for the result being faked. Since the node was stopped, that info doesn't exist, and the fake result is not processed. This leads to the notify action being lost, so the transition is restarted, which gets into a loop doing the same thing again.

Beekhof's patch addresses the underlying (and pre-existing) issue, so the result does not need to be faked.

Comment 10 Ken Gaillot 2018-11-29 03:49:06 UTC

Fixed in upstream master branch by commit be5d23c1 (which will make it into RHEL 8), backported to upstream 1.1 branch as commit 32fac002 (which will make it into RHEL 7.7 as part of this bz)

Comment 13 pkomarov 2019-02-25 07:19:28 UTC

Verified , initial state : 

(undercloud) [stack@undercloud-0 ~]$ ansible controller -m shell -b -a 'rpm -qa|grep pacemaker'
 [WARNING]: Found both group and host with same name: undercloud

 [WARNING]: Consider using the yum, dnf or zypper module rather than running rpm.  If you need to use command because
yum, dnf or zypper is insufficient you can add warn=False to this command task or set command_warnings=False in
ansible.cfg to get rid of this message.

controller-0 | SUCCESS | rc=0 >>
pacemaker-cli-1.1.20-1.el7.x86_64
pacemaker-remote-1.1.20-1.el7.x86_64
pacemaker-1.1.20-1.el7.x86_64
pacemaker-cluster-libs-1.1.20-1.el7.x86_64
puppet-pacemaker-0.7.2-0.20181008172520.9a4bc2d.el7ost.noarch
ansible-pacemaker-1.0.4-0.20180827141254.0e4d7c0.el7ost.noarch
pacemaker-libs-1.1.20-1.el7.x86_64

controller-2 | SUCCESS | rc=0 >>
pacemaker-cli-1.1.20-1.el7.x86_64
pacemaker-remote-1.1.20-1.el7.x86_64
pacemaker-1.1.20-1.el7.x86_64
pacemaker-cluster-libs-1.1.20-1.el7.x86_64
puppet-pacemaker-0.7.2-0.20181008172520.9a4bc2d.el7ost.noarch
ansible-pacemaker-1.0.4-0.20180827141254.0e4d7c0.el7ost.noarch
pacemaker-libs-1.1.20-1.el7.x86_64

controller-1 | SUCCESS | rc=0 >>
pacemaker-cli-1.1.20-1.el7.x86_64
pacemaker-remote-1.1.20-1.el7.x86_64
pacemaker-1.1.20-1.el7.x86_64
pacemaker-cluster-libs-1.1.20-1.el7.x86_64
puppet-pacemaker-0.7.2-0.20181008172520.9a4bc2d.el7ost.noarch
ansible-pacemaker-1.0.4-0.20180827141254.0e4d7c0.el7ost.noarch
pacemaker-libs-1.1.20-1.el7.x86_64

(undercloud) [stack@undercloud-0 ~]$ ansible controller-1 -m shell -b -a 'pcs status'

controller-1 | SUCCESS | rc=0 >>
Cluster name: tripleo_cluster
Stack: corosync
Current DC: controller-2 (version 1.1.20-1.el7-1642a7f847) - partition with quorum
Last updated: Mon Feb 25 06:57:02 2019
Last change: Mon Feb 25 06:52:20 2019 by hacluster via crmd on controller-2

12 nodes configured
37 resources configured

Online: [ controller-0 controller-1 controller-2 ]
GuestOnline: [ galera-bundle-0@controller-2 galera-bundle-1@controller-0 galera-bundle-2@controller-1 rabbitmq-bundle-0@controller-2 rabbitmq-bundle-1@controller-0 rabbitmq-bundle-2@controller-1 redis-bundle-0@controller-2 redis-bundle-1@controller-0 redis-bundle-2@controller-1 ]

Full list of resources:

 Docker container set: rabbitmq-bundle [192.168.24.1:8787/rhosp14/openstack-rabbitmq:pcmklatest]
   rabbitmq-bundle-0	(ocf::heartbeat:rabbitmq-cluster):	Started controller-2
   rabbitmq-bundle-1	(ocf::heartbeat:rabbitmq-cluster):	Started controller-0
   rabbitmq-bundle-2	(ocf::heartbeat:rabbitmq-cluster):	Started controller-1
 Docker container set: galera-bundle [192.168.24.1:8787/rhosp14/openstack-mariadb:pcmklatest]
   galera-bundle-0	(ocf::heartbeat:galera):	Master controller-2
   galera-bundle-1	(ocf::heartbeat:galera):	Master controller-0
   galera-bundle-2	(ocf::heartbeat:galera):	Master controller-1
 Docker container set: redis-bundle [192.168.24.1:8787/rhosp14/openstack-redis:pcmklatest]
   redis-bundle-0	(ocf::heartbeat:redis):	Slave controller-2
   redis-bundle-1	(ocf::heartbeat:redis):	Master controller-0
   redis-bundle-2	(ocf::heartbeat:redis):	Slave controller-1
 ip-192.168.24.14	(ocf::heartbeat:IPaddr2):	Started controller-2
 ip-10.0.0.101	(ocf::heartbeat:IPaddr2):	Started controller-2
 ip-172.17.1.12	(ocf::heartbeat:IPaddr2):	Started controller-2
 ip-172.17.1.21	(ocf::heartbeat:IPaddr2):	Started controller-2
 ip-172.17.3.23	(ocf::heartbeat:IPaddr2):	Started controller-2
 ip-172.17.4.30	(ocf::heartbeat:IPaddr2):	Started controller-2
 Docker container set: haproxy-bundle [192.168.24.1:8787/rhosp14/openstack-haproxy:pcmklatest]
   haproxy-bundle-docker-0	(ocf::heartbeat:docker):	Started controller-2
   haproxy-bundle-docker-1	(ocf::heartbeat:docker):	Started controller-0
   haproxy-bundle-docker-2	(ocf::heartbeat:docker):	Started controller-1
 Docker container: openstack-cinder-volume [192.168.24.1:8787/rhosp14/openstack-cinder-volume:pcmklatest]
   openstack-cinder-volume-docker-0	(ocf::heartbeat:docker):	Started controller-2

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

(undercloud) [stack@undercloud-0 ~]$ ansible controller-1 -m shell -b -a 'cat /etc/*release*'
 [WARNING]: Found both group and host with same name: undercloud

controller-1 | SUCCESS | rc=0 >>
NAME="Red Hat Enterprise Linux Server"
VERSION="7.6 (Maipo)"
ID="rhel"
ID_LIKE="fedora"
VARIANT="Server"
VARIANT_ID="server"
VERSION_ID="7.6"
PRETTY_NAME="Red Hat Enterprise Linux Server 7.6 (Maipo)"

#check master->slave failover:

[root@controller-1 ~]# pcs status|grep redis
GuestOnline: [ galera-bundle-0@controller-2 galera-bundle-1@controller-0 galera-bundle-2@controller-1 rabbitmq-bundle-0@controller-2 rabbitmq-bundle-1@controller-0 rabbitmq-bundle-2@controller-1 redis-bundle-0@controller-2 redis-bundle-1@controller-0 redis-bundle-2@controller-1 ]
 Docker container set: redis-bundle [192.168.24.1:8787/rhosp14/openstack-redis:pcmklatest]
   redis-bundle-0	(ocf::heartbeat:redis):	Slave controller-2
   redis-bundle-1	(ocf::heartbeat:redis):	Master controller-0
   redis-bundle-2	(ocf::heartbeat:redis):	Slave controller-1



[root@controller-0 ~]# pcs resource ban redis-bundle controller-0;crm_mon
...
   redis-bundle-1	(ocf::heartbeat:redis): Demoting controller-0
...
   redis-bundle-1	(ocf::heartbeat:redis): Slave controller-0
...
   redis-bundle-1	(ocf::heartbeat:redis): Stopped controller-0
...
...	   redis-bundle-1	(ocf::heartbeat:redis): Stopped

...   redis-bundle-0	(ocf::heartbeat:redis): Master controller-2

Comment 15 errata-xmlrpc 2019-08-06 12:53:44 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2129

Note You need to log in before you can comment on or make changes to this bug.