Bug 1459503

Summary:

OpenStack is not compatible with pcs management of remote and guest nodes

Product:

Red Hat Enterprise Linux 7

Reporter:

Michele Baldessari <michele>

Component:

pcs

Assignee:

Tomas Jelinek <tojeline>

Status:

CLOSED ERRATA

QA Contact:

cluster-qe <cluster-qe>

Severity:

urgent

Docs Contact:

Priority:

urgent

Version:

7.4

CC:

cfeist, chjones, cluster-maint, dciabrin, fdinitto, idevat, mkrcmari, omular, royoung, tojeline

Target Milestone:

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

pcs-0.9.158-5.el7

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2017-08-01 18:26:07 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
proposed fix	none

Description Michele Baldessari 2017-06-07 10:11:39 UTC

Currently in OSP we create pacemaker remote resources the 'old' way with puppet:
1) We generate a /etc/pacemaker/authkey key on all nodes (full nodes and remote ones)
2) We start and enable pacemaker_remote on the remote nodes
3) We create the remote resource via:
pcs resource create ocf:pacemaker:remote server=1.2.3.4 reconnect_interval=30 op monitor=20

Starting with pcs-0.9.158 this no longer works. Here is the list of issues we have found so far:
1) [minor] # pcs resource create test ocf:pacemaker:remote server=1.2.3.4
Error: this command is not sufficient for creating a remote connection, use 'pcs cluster node add-remote', use --force to override

So if we add --force to this command it will work.

2) [major] If /etc/pacemaker/authkey file was precreated via puppet, pcs seems to actually rewrite it unconditionally during cluster setup?
2.1) Here the file was created with puppet as usual:
2017-06-07 08:45:14 +0000 /Stage[main]/Pacemaker::Corosync/File[etc-pacemaker-authkey]/ensure (notice): defined content as '{md5}a0674b1598979cb2c8d9f7d3d5014f01'
2.2) pcsd went ahead and created its own (shorter) key:
I, [2017-06-07T08:45:18.151270 #15638]  INFO -- : {"files":{"pacemaker_remote authkey":{"code":"written","message":""}}}

So at this stage we have a discrepancy between keys on the remotes (which have not been touched by pcsd as it is not running there) and
the remote resources will stay in Stopped state due to auth issues.

It seems to me that in pcs/cluster.py when we call cluster_setup we unconditionally set our authkey every time even though it is pre-existing? https://github.com/ClusterLabs/pcs/blob/master/pcs/cluster.py#L461

We're fine with working around 1) but 2) is quite problematic at this stage.
In the longer term we can work to use the new commands to set up the remote nodes, but that is a larger chunk of work at this stage and rather risky if we consider that older released osp versions are free to update to the latest rhel (7.4) versions when they get released.

Comment 2 Tomas Jelinek 2017-06-07 11:05:39 UTC

Pcs does not break pcmk remote setup. Pcs implements management of pcmk remote nodes. This is a feature requested among others by OpenStack: bz1176018

There are new commands "pcs cluster node add-remote" and "pcs cluster node add-guest". These not only edit cib but also distribute pcmk authkey to new nodes and start and enable pcmk remote daemon as requested. For the commands to work pcsd must run on remote / guest nodes.

Also "pcs cluster setup" creates a pcmk authkey and sends it to all nodes. So later when a remote node is added the key is only sent to the new node. This way there is no need for all the nodes to be online when adding a remote node.

Comment 3 Tomas Jelinek 2017-06-07 11:09:07 UTC

We can do a downstream patch for 7.4 which will:
* automatically force pcs resource create ocf:pacemaker:remote
* not generate new pcmk authkey in cluster setup if one already exists

Comment 4 Michele Baldessari 2017-06-07 11:53:04 UTC

(In reply to Tomas Jelinek from comment #2)
> Pcs does not break pcmk remote setup. Pcs implements management of pcmk
> remote nodes. This is a feature requested among others by OpenStack:
> bz1176018

Thanks Tomas, I realize the new feature is what prompted this change.
While it does not break the new way of creating it does break the older documented way of setting up pacemaker remote nodes (https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html-single/High_Availability_Add-On_Reference/index.html#pacemaker_remote). These are not really workarounds, it's what we have documented for quite some time.

Don't get me wrong, I like the change and the overall direction, but we are
just not able to change everything under the hood in one pcs release (OSP cycle).
This would break anyone that uses puppet or ansible with pcs to set up a cluster with remotes, it's not really OSP specific as such.

> There are new commands "pcs cluster node add-remote" and "pcs cluster node
> add-guest". These not only edit cib but also distribute pcmk authkey to new
> nodes and start and enable pcmk remote daemon as requested. For the commands
> to work pcsd must run on remote / guest nodes.
> 
> Also "pcs cluster setup" creates a pcmk authkey and sends it to all nodes.
> So later when a remote node is added the key is only sent to the new node.
> This way there is no need for all the nodes to be online when adding a
> remote node.

Ack, yes the feature is very nice in itself. I think if we can just not rewrite /etc/pacemaker/authkey if it already exists that should do it (at least for us) for the OSP case. If you're super swamped I can give it a shot as well, just ping me.

Thanks for all your help as usual,
Michele

Comment 5 Tomas Jelinek 2017-06-07 12:19:58 UTC

(In reply to Michele Baldessari from comment #4)
> These are not really workarounds, it's what we have documented for quite
> some time.

By "workaround" I meant workaround for a state when pcs does not provide full support for remote nodes.

> This would break anyone that uses puppet or ansible with pcs to set up a
> cluster with remotes, it's not really OSP specific as such.

Not necessarily. If the authkey is distributed by puppet or ansible after cluster setup is done, everything should work as before.

Ad --force in resource create:
If it only emitted a warning, the user would have to delete the new node just to create it with the new command.

Comment 6 Michele Baldessari 2017-06-07 14:10:35 UTC

(In reply to Tomas Jelinek from comment #5)
> (In reply to Michele Baldessari from comment #4)
> > These are not really workarounds, it's what we have documented for quite
> > some time.
> 
> By "workaround" I meant workaround for a state when pcs does not provide
> full support for remote nodes.

Right, the subject implies that we're doing something hacky, which is not the case (this time ;).
 
> > This would break anyone that uses puppet or ansible with pcs to set up a
> > cluster with remotes, it's not really OSP specific as such.
> 
> Not necessarily. If the authkey is distributed by puppet or ansible after
> cluster setup is done, everything should work as before.

Right, but it's definitely a change in requirement / behaviour that does break existing automation.

> Ad --force in resource create:
> If it only emitted a warning, the user would have to delete the new node
> just to create it with the new command.

I am just saying that if you fail the remote creation without --force, we're fine with that.

I have tested the patch Ivan gave me and it works as expected and I am able to create pacemaker remote resource.

Thanks again for your quick help!

Comment 7 Ivan Devat 2017-06-08 08:46:55 UTC

Created attachment 1286082 [details]
proposed fix

Comment 8 Ivan Devat 2017-06-08 08:50:09 UTC

Tests:

After Fix:

> 1) setup reuses existing pacemaker authkey

[vm-rhel72-1 ~] $ cat /etc/pacemaker/authkey
existing atuhkey content

[vm-rhel72-1 ~] $ pcs cluster setup --name=devcluster vm-rhel72-1 vm-rhel72-3
Destroying cluster on nodes: vm-rhel72-1, vm-rhel72-3...
vm-rhel72-1: Stopping Cluster (pacemaker)...
vm-rhel72-3: Stopping Cluster (pacemaker)...
vm-rhel72-1: Successfully destroyed cluster
vm-rhel72-3: Successfully destroyed cluster

Sending 'pacemaker_remote authkey' to 'vm-rhel72-1', 'vm-rhel72-3'
vm-rhel72-1: successful distribution of the file 'pacemaker_remote authkey'
vm-rhel72-1: successful distribution of the file 'pacemaker_remote authkey'
vm-rhel72-3: successful distribution of the file 'pacemaker_remote authkey'
Sending cluster config files to the nodes...
vm-rhel72-1: Succeeded
vm-rhel72-3: Succeeded

Synchronizing pcsd certificates on nodes vm-rhel72-1, vm-rhel72-3...
vm-rhel72-3: Success
vm-rhel72-1: Success
Restarting pcsd on the nodes in order to reload the certificates...
vm-rhel72-1: Success
vm-rhel72-3: Success

[vm-rhel72-1 ~] $ cat /etc/pacemaker/authkey
existing atuhkey content
[vm-rhel72-1 ~] $ ssh vm-rhel72-3 'cat /etc/pacemaker/authkey'
existing atuhkey content


> 2) allow crate remote / guest resource without force

[vm-rhel72-1 ~] $ pcs resource create RN ocf:pacemaker:remote
Warning: this command is not sufficient for creating a remote connection, use 'pcs cluster node add-remote'
[vm-rhel72-1 ~] $ echo $?
0

[vm-rhel72-1 ~] $ pcs resource create R ocf:heartbeat:Dummy meta remote-node="vm-rhel72-2"
Warning: this command is not sufficient for creating a guest node, use 'pcs cluster node add-guest'
[vm-rhel72-1 ~] $ echo $?
0

[vm-rhel72-1 ~] $ pcs resource update R meta remote-node=
Warning: this command is not sufficient for removing a guest node, use 'pcs cluster node remove-guest'
[vm-rhel72-1 ~] $ echo $?
0

[vm-rhel72-1 ~] $ pcs resource meta R remote-node="vm-rhel72-2"
Warning: this command is not sufficient for creating a guest node, use 'pcs cluster node add-guest'
[vm-rhel72-1 ~] $ echo $?
0

Comment 11 Damien Ciabrini 2017-06-21 16:05:27 UTC

Additionally, Michele Baldessari and myself are using the features from this build for Openstack upstream, so I can say that's it's working as expected for us.

We have a puppet-based scenario that relies on puppet-pacemaker [1] to deploy a HA Openstack overcloud on pacemaker remote nodes.

After the fix, the deploy pass as expected, and we can validate that the existing key generated by puppet in /etc/pacemaker/authkey is the one which is being used to initialize the pacemaker remote nodes in the cluster.

We also validate that we don't need the --force flag to succefully create a remote resource.   


[1] https://github.com/openstack/puppet-pacemaker/blob/master/manifests/resource/remote.pp

Comment 13 errata-xmlrpc 2017-08-01 18:26:07 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1958