1459503 – OpenStack is not compatible with pcs management of remote and guest nodes

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1459503 - OpenStack is not compatible with pcs management of remote and guest nodes

Summary: OpenStack is not compatible with pcs management of remote and guest nodes

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	pcs
Sub Component:
Version:	7.4
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	---
Assignee:	Tomas Jelinek
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-06-07 10:11 UTC by Michele Baldessari
Modified:	2017-08-10 16:59 UTC (History)
CC List:	10 users (show)
Fixed In Version:	pcs-0.9.158-5.el7
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-08-01 18:26:07 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
proposed fix (6.68 KB, patch) 2017-06-08 08:46 UTC, Ivan Devat	no flags	Details \| Diff
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1176018	urgent	CLOSED	pcs/pcsd should be able to configure pacemaker remote	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	1480311	unspecified	CLOSED	Deleting a guestnode resource without a node removal may lead to node fencing	2021-02-22 00:41:40 UTC
Red Hat Product Errata	RHBA-2017:1958	normal	SHIPPED_LIVE	pcs bug fix and enhancement update	2017-08-01 18:09:47 UTC

Internal Links: 1176018 1480311

Description Michele Baldessari 2017-06-07 10:11:39 UTC

Currently in OSP we create pacemaker remote resources the 'old' way with puppet:
1) We generate a /etc/pacemaker/authkey key on all nodes (full nodes and remote ones)
2) We start and enable pacemaker_remote on the remote nodes
3) We create the remote resource via:
pcs resource create ocf:pacemaker:remote server=1.2.3.4 reconnect_interval=30 op monitor=20

Starting with pcs-0.9.158 this no longer works. Here is the list of issues we have found so far:
1) [minor] # pcs resource create test ocf:pacemaker:remote server=1.2.3.4
Error: this command is not sufficient for creating a remote connection, use 'pcs cluster node add-remote', use --force to override

So if we add --force to this command it will work.

2) [major] If /etc/pacemaker/authkey file was precreated via puppet, pcs seems to actually rewrite it unconditionally during cluster setup?
2.1) Here the file was created with puppet as usual:
2017-06-07 08:45:14 +0000 /Stage[main]/Pacemaker::Corosync/File[etc-pacemaker-authkey]/ensure (notice): defined content as '{md5}a0674b1598979cb2c8d9f7d3d5014f01'
2.2) pcsd went ahead and created its own (shorter) key:
I, [2017-06-07T08:45:18.151270 #15638]  INFO -- : {"files":{"pacemaker_remote authkey":{"code":"written","message":""}}}

So at this stage we have a discrepancy between keys on the remotes (which have not been touched by pcsd as it is not running there) and
the remote resources will stay in Stopped state due to auth issues.

It seems to me that in pcs/cluster.py when we call cluster_setup we unconditionally set our authkey every time even though it is pre-existing? https://github.com/ClusterLabs/pcs/blob/master/pcs/cluster.py#L461

We're fine with working around 1) but 2) is quite problematic at this stage.
In the longer term we can work to use the new commands to set up the remote nodes, but that is a larger chunk of work at this stage and rather risky if we consider that older released osp versions are free to update to the latest rhel (7.4) versions when they get released.

Comment 2 Tomas Jelinek 2017-06-07 11:05:39 UTC

Pcs does not break pcmk remote setup. Pcs implements management of pcmk remote nodes. This is a feature requested among others by OpenStack: bz1176018

There are new commands "pcs cluster node add-remote" and "pcs cluster node add-guest". These not only edit cib but also distribute pcmk authkey to new nodes and start and enable pcmk remote daemon as requested. For the commands to work pcsd must run on remote / guest nodes.

Also "pcs cluster setup" creates a pcmk authkey and sends it to all nodes. So later when a remote node is added the key is only sent to the new node. This way there is no need for all the nodes to be online when adding a remote node.

Comment 3 Tomas Jelinek 2017-06-07 11:09:07 UTC

We can do a downstream patch for 7.4 which will:
* automatically force pcs resource create ocf:pacemaker:remote
* not generate new pcmk authkey in cluster setup if one already exists

Comment 4 Michele Baldessari 2017-06-07 11:53:04 UTC

(In reply to Tomas Jelinek from comment #2)
> Pcs does not break pcmk remote setup. Pcs implements management of pcmk
> remote nodes. This is a feature requested among others by OpenStack:
> bz1176018

Thanks Tomas, I realize the new feature is what prompted this change.
While it does not break the new way of creating it does break the older documented way of setting up pacemaker remote nodes (https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html-single/High_Availability_Add-On_Reference/index.html#pacemaker_remote). These are not really workarounds, it's what we have documented for quite some time.

Don't get me wrong, I like the change and the overall direction, but we are
just not able to change everything under the hood in one pcs release (OSP cycle).
This would break anyone that uses puppet or ansible with pcs to set up a cluster with remotes, it's not really OSP specific as such.

> There are new commands "pcs cluster node add-remote" and "pcs cluster node
> add-guest". These not only edit cib but also distribute pcmk authkey to new
> nodes and start and enable pcmk remote daemon as requested. For the commands
> to work pcsd must run on remote / guest nodes.
> 
> Also "pcs cluster setup" creates a pcmk authkey and sends it to all nodes.
> So later when a remote node is added the key is only sent to the new node.
> This way there is no need for all the nodes to be online when adding a
> remote node.

Ack, yes the feature is very nice in itself. I think if we can just not rewrite /etc/pacemaker/authkey if it already exists that should do it (at least for us) for the OSP case. If you're super swamped I can give it a shot as well, just ping me.

Thanks for all your help as usual,
Michele

Comment 5 Tomas Jelinek 2017-06-07 12:19:58 UTC

(In reply to Michele Baldessari from comment #4)
> These are not really workarounds, it's what we have documented for quite
> some time.

By "workaround" I meant workaround for a state when pcs does not provide full support for remote nodes.

> This would break anyone that uses puppet or ansible with pcs to set up a
> cluster with remotes, it's not really OSP specific as such.

Not necessarily. If the authkey is distributed by puppet or ansible after cluster setup is done, everything should work as before.

Ad --force in resource create:
If it only emitted a warning, the user would have to delete the new node just to create it with the new command.

Comment 6 Michele Baldessari 2017-06-07 14:10:35 UTC

(In reply to Tomas Jelinek from comment #5)
> (In reply to Michele Baldessari from comment #4)
> > These are not really workarounds, it's what we have documented for quite
> > some time.
> 
> By "workaround" I meant workaround for a state when pcs does not provide
> full support for remote nodes.

Right, the subject implies that we're doing something hacky, which is not the case (this time ;).
 
> > This would break anyone that uses puppet or ansible with pcs to set up a
> > cluster with remotes, it's not really OSP specific as such.
> 
> Not necessarily. If the authkey is distributed by puppet or ansible after
> cluster setup is done, everything should work as before.

Right, but it's definitely a change in requirement / behaviour that does break existing automation.

> Ad --force in resource create:
> If it only emitted a warning, the user would have to delete the new node
> just to create it with the new command.

I am just saying that if you fail the remote creation without --force, we're fine with that.

I have tested the patch Ivan gave me and it works as expected and I am able to create pacemaker remote resource.

Thanks again for your quick help!

Comment 7 Ivan Devat 2017-06-08 08:46:55 UTC

Created attachment 1286082 [details]
proposed fix

Comment 8 Ivan Devat 2017-06-08 08:50:09 UTC

Tests:

After Fix:

> 1) setup reuses existing pacemaker authkey

[vm-rhel72-1 ~] $ cat /etc/pacemaker/authkey
existing atuhkey content

[vm-rhel72-1 ~] $ pcs cluster setup --name=devcluster vm-rhel72-1 vm-rhel72-3
Destroying cluster on nodes: vm-rhel72-1, vm-rhel72-3...
vm-rhel72-1: Stopping Cluster (pacemaker)...
vm-rhel72-3: Stopping Cluster (pacemaker)...
vm-rhel72-1: Successfully destroyed cluster
vm-rhel72-3: Successfully destroyed cluster

Sending 'pacemaker_remote authkey' to 'vm-rhel72-1', 'vm-rhel72-3'
vm-rhel72-1: successful distribution of the file 'pacemaker_remote authkey'
vm-rhel72-1: successful distribution of the file 'pacemaker_remote authkey'
vm-rhel72-3: successful distribution of the file 'pacemaker_remote authkey'
Sending cluster config files to the nodes...
vm-rhel72-1: Succeeded
vm-rhel72-3: Succeeded

Synchronizing pcsd certificates on nodes vm-rhel72-1, vm-rhel72-3...
vm-rhel72-3: Success
vm-rhel72-1: Success
Restarting pcsd on the nodes in order to reload the certificates...
vm-rhel72-1: Success
vm-rhel72-3: Success

[vm-rhel72-1 ~] $ cat /etc/pacemaker/authkey
existing atuhkey content
[vm-rhel72-1 ~] $ ssh vm-rhel72-3 'cat /etc/pacemaker/authkey'
existing atuhkey content


> 2) allow crate remote / guest resource without force

[vm-rhel72-1 ~] $ pcs resource create RN ocf:pacemaker:remote
Warning: this command is not sufficient for creating a remote connection, use 'pcs cluster node add-remote'
[vm-rhel72-1 ~] $ echo $?
0

[vm-rhel72-1 ~] $ pcs resource create R ocf:heartbeat:Dummy meta remote-node="vm-rhel72-2"
Warning: this command is not sufficient for creating a guest node, use 'pcs cluster node add-guest'
[vm-rhel72-1 ~] $ echo $?
0

[vm-rhel72-1 ~] $ pcs resource update R meta remote-node=
Warning: this command is not sufficient for removing a guest node, use 'pcs cluster node remove-guest'
[vm-rhel72-1 ~] $ echo $?
0

[vm-rhel72-1 ~] $ pcs resource meta R remote-node="vm-rhel72-2"
Warning: this command is not sufficient for creating a guest node, use 'pcs cluster node add-guest'
[vm-rhel72-1 ~] $ echo $?
0

Comment 11 Damien Ciabrini 2017-06-21 16:05:27 UTC

Additionally, Michele Baldessari and myself are using the features from this build for Openstack upstream, so I can say that's it's working as expected for us.

We have a puppet-based scenario that relies on puppet-pacemaker [1] to deploy a HA Openstack overcloud on pacemaker remote nodes.

After the fix, the deploy pass as expected, and we can validate that the existing key generated by puppet in /etc/pacemaker/authkey is the one which is being used to initialize the pacemaker remote nodes in the cluster.

We also validate that we don't need the --force flag to succefully create a remote resource.   


[1] https://github.com/openstack/puppet-pacemaker/blob/master/manifests/resource/remote.pp

Comment 13 errata-xmlrpc 2017-08-01 18:26:07 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1958

Note You need to log in before you can comment on or make changes to this bug.