Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1726680

Summary:	OSP 14->15: puppet-pacemaker seems to not run `pcs host auth` during upgrade
Product:	Red Hat OpenStack	Reporter:	Jiri Stransky <jstransk>
Component:	puppet-pacemaker	Assignee:	Michele Baldessari <michele>
Status:	CLOSED EOL	QA Contact:	nlevinki <nlevinki>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	15.0 (Stein)	CC:	jjoyce, jschluet, michele, slinaber, stchen, tvignaud
Target Milestone:	---	Keywords:	Triaged, ZStream
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1817702 (view as bug list)		Environment:
Last Closed:	2020-09-30 19:48:52 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1727807, 1817702

Description Jiri Stransky 2019-07-03 12:21:25 UTC

When upgrading controller-0 from 14 to 15, upgrade fails on step 1 of deployment tasks:

TASK [Debug output for task: Run puppet host configuration for step 1] *********

.. snip ..

        "<13>Jun 24 11:38:27 puppet-user: Error: /Stage[main]/Pacemaker::Corosync/Exec[Create Cluster tripleo_cluster]/returns: change from 'notrun' to ['0'] failed: '/sbin/pcs cluster setup tripleo_cluster cont
roller-0 addr=172.17.1.41' returned 1 instead of one of [0]",

When run manually on the node:

[root@controller-0 ~]# /sbin/pcs cluster setup tripleo_cluster controller-0 addr=172.17.1.41
Error: Host 'controller-0' is not known to pcs, try to authenticate the host using 'pcs host auth controller-0' command
Error: None of hosts is known to pcs.
Error: Errors have occurred, therefore pcs is unable to continue

After running `pcs host auth controller-0` and providing the user+password taken from hiera and re-running the upgrade step, the cluster gets created.


Even later when i started scaling up the upgraded controllers to 2 and 3, i still had to manually run `pcs host auth <node>`.

Our WIP upgrade workflow is:

https://gitlab.cee.redhat.com/osp15/osp-upgrade-el8

Comment 1 Jiri Stransky 2019-07-03 16:16:33 UTC

I think in this BZ it's not immediately obvious why exactly doesn't the auth happen. I should upload more logs when i manage to reproduce it.

Comment 4 Jiri Stransky 2019-07-09 16:13:35 UTC

I think i can see why it's failing. The auth resources in Puppet are "refreshonly" and they're triggered by creation of "hacluster" user. However, on upgrade this user most likely already exists, so the auth is never triggered. I'll try if deleting the "hacluster" user before `upgrade run` will be enough to work around.

Comment 5 Michele Baldessari 2019-07-10 07:01:05 UTC

Right but since hacluster is already authenticated, I wonder why pcs is not happy with that and needs rerun the auth. I guess it might be related to pcmk2/corosyn3 requiring some info around interfaces due to the introduction of knet?

No matter the reason, let's assume that we need indeed to run some 'pcs host auth'. If I rewrite the init code to use that as an authentication system (which is probably a good idea no matter what), I think we still hit this issue because:
- RHEL7/OSP < 15 will have had the cluster set up the old way
- RHEL8/OSP >=15 with the new way
- The upgrade is still likely to hit this issue.

So am thinking that we need to:
A) Indeed rewrite the init code to use pcs host auth
B) somehow add an additional trigger on updates that refreshes the pcs host auth resource in puppet.

Comment 6 Jiri Stransky 2019-07-10 08:43:05 UTC

Re being already authenticated -- before running the upgrade from RHEL 7 to 8 via Leapp, we run not only `pcs cluster stop` on the upgraded node, but also `pcs cluster destroy`, otherwise Leapp will refuse to perform the upgrade. That is probably the reason we need to re-run the auth?

Comment 7 Michele Baldessari 2019-07-10 09:13:43 UTC

Ah boom yes, if we do that in leapp that explains it perfectly (sorry I am still a complete leapp newbie). Then we 'just' need to retrigger the auth code in this case somehow. I don't think that rewriting the init code to use pcs host auth will do much here, we want to retrigger the current auth codepath.
The easiest way is as you did by removing the hauser (or tweaking it). I wonder how we could be smarter in this situation, is there a simple way to detect that we just started after a leapp upgrade that we could leverage for triggering the pcs auth codepaths? or should we use a variable maybe? other ideas?

Comment 8 Jiri Stransky 2019-07-10 09:32:39 UTC

Can we somehow ask pacemaker if it knows about a particular node? If we don't have a CLI for it maybe we could come up with some way to look at some pacemaker files directly, if there's a relatively dependable way to do it? (The CIB XML or something in that sense...) That would probably lead to the best (most reliably idempotent) solution.

I'm not very keen to inject in a variable for "are we running right after Leapp now" because that will disincentivize coming up with idempotent solutions, which can in turn increase occurence of bugs like "this command failed due to env issue, i've fixed the root cause, but on re-run of the command i now keep hitting an unrelated unfixable error".

As a last resort, we could really just delete the hacluster user at the same time we're running `pcs cluster destroy`, and rely on the current partially-idempotent solution?

Comment 9 Jiri Stransky 2019-07-11 12:04:59 UTC

Correction of the workaround as debugged and discussed today with Michele: We should work around *in dev/test environments only* by setting the hacluster password to something else than what Puppet set. This will force the refreshonly Puppet resources to run, but will not result in changing the hacluster UID/GID. Deleting the user would change UID/GID and cause bug 1728678.

Comment 10 Michele Baldessari 2019-07-16 11:20:46 UTC

(In reply to Michele Baldessari from comment #5)
> Right but since hacluster is already authenticated, I wonder why pcs is not
> happy with that and needs rerun the auth. I guess it might be related to
> pcmk2/corosyn3 requiring some info around interfaces due to the introduction
> of knet?
> 
> No matter the reason, let's assume that we need indeed to run some 'pcs host
> auth'. If I rewrite the init code to use that as an authentication system
> (which is probably a good idea no matter what), I think we still hit this
> issue because:
> - RHEL7/OSP < 15 will have had the cluster set up the old way
> - RHEL8/OSP >=15 with the new way
> - The upgrade is still likely to hit this issue.
> 
> So am thinking that we need to:
> A) Indeed rewrite the init code to use pcs host auth

Exhibit number 1) for bandini having gone completely senile. I had done this a few months ago already:
https://github.com/openstack/puppet-pacemaker/commit/c7b21ee0cbe0c48c678c6b4c63bcf314fb9a4f32

> B) somehow add an additional trigger on updates that refreshes the pcs host
> auth resource in puppet.

I guess here we either add a comment to the user resource that will trigger the resource refresh (and does not 
change the password) or, since we are in the presence of scaling up, we try to use the knowledge of the number
of existing nodes to trigger this as well. Am a bit worried that the later could be bit more fragile given all the
cases this has to work in and be idempotent (on the other hand I don't think triggering an extra spurious host auth
is the end of the world..)

Comment 11 Shelley Dunne 2019-09-19 18:29:50 UTC

Re-setting Target Milestone z1 to --- to begin the 15z1 Maintenance Release.

Comment 14 stchen 2020-09-30 19:48:52 UTC

Closing EOL, OSP 15 has been retired as of Sept 19, 2020