Bug 1726680
| Summary: | OSP 14->15: puppet-pacemaker seems to not run `pcs host auth` during upgrade | |||
|---|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Jiri Stransky <jstransk> | |
| Component: | puppet-pacemaker | Assignee: | Michele Baldessari <michele> | |
| Status: | CLOSED EOL | QA Contact: | nlevinki <nlevinki> | |
| Severity: | medium | Docs Contact: | ||
| Priority: | medium | |||
| Version: | 15.0 (Stein) | CC: | jjoyce, jschluet, michele, slinaber, stchen, tvignaud | |
| Target Milestone: | --- | Keywords: | Triaged, ZStream | |
| Target Release: | --- | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1817702 (view as bug list) | Environment: | ||
| Last Closed: | 2020-09-30 19:48:52 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1727807, 1817702 | |||
I think in this BZ it's not immediately obvious why exactly doesn't the auth happen. I should upload more logs when i manage to reproduce it. I think i can see why it's failing. The auth resources in Puppet are "refreshonly" and they're triggered by creation of "hacluster" user. However, on upgrade this user most likely already exists, so the auth is never triggered. I'll try if deleting the "hacluster" user before `upgrade run` will be enough to work around. Right but since hacluster is already authenticated, I wonder why pcs is not happy with that and needs rerun the auth. I guess it might be related to pcmk2/corosyn3 requiring some info around interfaces due to the introduction of knet? No matter the reason, let's assume that we need indeed to run some 'pcs host auth'. If I rewrite the init code to use that as an authentication system (which is probably a good idea no matter what), I think we still hit this issue because: - RHEL7/OSP < 15 will have had the cluster set up the old way - RHEL8/OSP >=15 with the new way - The upgrade is still likely to hit this issue. So am thinking that we need to: A) Indeed rewrite the init code to use pcs host auth B) somehow add an additional trigger on updates that refreshes the pcs host auth resource in puppet. Re being already authenticated -- before running the upgrade from RHEL 7 to 8 via Leapp, we run not only `pcs cluster stop` on the upgraded node, but also `pcs cluster destroy`, otherwise Leapp will refuse to perform the upgrade. That is probably the reason we need to re-run the auth? Ah boom yes, if we do that in leapp that explains it perfectly (sorry I am still a complete leapp newbie). Then we 'just' need to retrigger the auth code in this case somehow. I don't think that rewriting the init code to use pcs host auth will do much here, we want to retrigger the current auth codepath. The easiest way is as you did by removing the hauser (or tweaking it). I wonder how we could be smarter in this situation, is there a simple way to detect that we just started after a leapp upgrade that we could leverage for triggering the pcs auth codepaths? or should we use a variable maybe? other ideas? Can we somehow ask pacemaker if it knows about a particular node? If we don't have a CLI for it maybe we could come up with some way to look at some pacemaker files directly, if there's a relatively dependable way to do it? (The CIB XML or something in that sense...) That would probably lead to the best (most reliably idempotent) solution. I'm not very keen to inject in a variable for "are we running right after Leapp now" because that will disincentivize coming up with idempotent solutions, which can in turn increase occurence of bugs like "this command failed due to env issue, i've fixed the root cause, but on re-run of the command i now keep hitting an unrelated unfixable error". As a last resort, we could really just delete the hacluster user at the same time we're running `pcs cluster destroy`, and rely on the current partially-idempotent solution? Correction of the workaround as debugged and discussed today with Michele: We should work around *in dev/test environments only* by setting the hacluster password to something else than what Puppet set. This will force the refreshonly Puppet resources to run, but will not result in changing the hacluster UID/GID. Deleting the user would change UID/GID and cause bug 1728678. (In reply to Michele Baldessari from comment #5) > Right but since hacluster is already authenticated, I wonder why pcs is not > happy with that and needs rerun the auth. I guess it might be related to > pcmk2/corosyn3 requiring some info around interfaces due to the introduction > of knet? > > No matter the reason, let's assume that we need indeed to run some 'pcs host > auth'. If I rewrite the init code to use that as an authentication system > (which is probably a good idea no matter what), I think we still hit this > issue because: > - RHEL7/OSP < 15 will have had the cluster set up the old way > - RHEL8/OSP >=15 with the new way > - The upgrade is still likely to hit this issue. > > So am thinking that we need to: > A) Indeed rewrite the init code to use pcs host auth Exhibit number 1) for bandini having gone completely senile. I had done this a few months ago already: https://github.com/openstack/puppet-pacemaker/commit/c7b21ee0cbe0c48c678c6b4c63bcf314fb9a4f32 > B) somehow add an additional trigger on updates that refreshes the pcs host > auth resource in puppet. I guess here we either add a comment to the user resource that will trigger the resource refresh (and does not change the password) or, since we are in the presence of scaling up, we try to use the knowledge of the number of existing nodes to trigger this as well. Am a bit worried that the later could be bit more fragile given all the cases this has to work in and be idempotent (on the other hand I don't think triggering an extra spurious host auth is the end of the world..) Re-setting Target Milestone z1 to --- to begin the 15z1 Maintenance Release. Closing EOL, OSP 15 has been retired as of Sept 19, 2020 |
When upgrading controller-0 from 14 to 15, upgrade fails on step 1 of deployment tasks: TASK [Debug output for task: Run puppet host configuration for step 1] ********* .. snip .. "<13>Jun 24 11:38:27 puppet-user: Error: /Stage[main]/Pacemaker::Corosync/Exec[Create Cluster tripleo_cluster]/returns: change from 'notrun' to ['0'] failed: '/sbin/pcs cluster setup tripleo_cluster cont roller-0 addr=172.17.1.41' returned 1 instead of one of [0]", When run manually on the node: [root@controller-0 ~]# /sbin/pcs cluster setup tripleo_cluster controller-0 addr=172.17.1.41 Error: Host 'controller-0' is not known to pcs, try to authenticate the host using 'pcs host auth controller-0' command Error: None of hosts is known to pcs. Error: Errors have occurred, therefore pcs is unable to continue After running `pcs host auth controller-0` and providing the user+password taken from hiera and re-running the upgrade step, the cluster gets created. Even later when i started scaling up the upgraded controllers to 2 and 3, i still had to manually run `pcs host auth <node>`. Our WIP upgrade workflow is: https://gitlab.cee.redhat.com/osp15/osp-upgrade-el8