1978696 – [FFWD13 -> 16.2] Third controller not joining the cluster during the overcloud upgrade, fails to authenticate

Bug 1978696 - [FFWD13 -> 16.2] Third controller not joining the cluster during the overcloud upgrade, fails to authenticate

Summary: [FFWD13 -> 16.2] Third controller not joining the cluster during the overclou...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	puppet-pacemaker
Sub Component:
Version:	16.2 (Train)
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	z2
Target Release:	16.2 (Train on RHEL 8.4)
Assignee:	Michele Baldessari
QA Contact:	Jason Grosso
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	2053018 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-07-02 14:12 UTC by Jose Luis Franco
Modified:	2022-06-30 15:14 UTC (History)
CC List:	11 users (show)
Fixed In Version:	puppet-pacemaker-1.2.1-2.20210810224808.90be0f9.el8ost
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-03-23 22:10:48 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1938129	None	None	None	2021-07-27 06:51:17 UTC
OpenStack gerrit	802428	None	MERGED	Increase the scaleup/startup tries	2021-11-08 15:38:31 UTC
Red Hat Issue Tracker	OSP-5737	None	None	None	2022-01-28 15:42:59 UTC
Red Hat Product Errata	RHBA-2022:1001	None	None	None	2022-03-23 22:11:15 UTC

Description Jose Luis Franco 2021-07-02 14:12:22 UTC

Description of problem:

It has been found in few CI jobs for the inplace upgade from OSP13 to OSP16.2 that the upgrade run step is failing when trying to upgrade the third controller. The upgrade fails with the log:

http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-upgrades-ffu-16.2-from-13-passed_phase2-3cont_3hci-ipv4-ovs_dvr/57/undercloud-0/home/stack/overcloud_upgrade_run-controller-2,controller-1,controller-0.log.gz
            Config: 1625086516", "<13>Jun 30 20:58:56 puppet-user:            Puppet: 5.5.10", "<13>Jun 30 20:58:56 puppet-user: Debug: Applying settings catalog for sections reporting, metrics", "<13>Jun 30 20:58:56 puppet-user: Debug: Finishing transaction 47050753925080", "<13>Jun 30 20:58:56 puppet-user: Debug: Received report to process from controller-0.redhat.local", "<13>Jun 30 20:58:56 puppet-user: Debug: Evicting cache entry for environment 'production'", "<13>Jun 30 20:58:56 puppet-user: Debug: Caching environment 'production' (ttl = 0 sec)", "<13>Jun 30 20:58:56 puppet-user: Debug: Processing report from controller-0.redhat.local with processor Puppet::Reports::Store"], "stdout": "", "stdout_lines": []}
2021-06-30 20:59:05 | 2021-06-30 20:59:00.308673 | 52540047-4641-c51f-edaa-000000002370 |     TIMING | Wait for puppet host configuration to finish | controller-0 | 0:08:35.493348 | 228.82s
2021-06-30 20:59:05 | 2021-06-30 20:59:02.782783 | 52540047-4641-c51f-edaa-0000000025d6 |    WAITING | Wait for puppet host configuration to finish | controller-2 | 1167 retries left
2021-06-30 21:01:49 | 
2021-06-30 21:01:49 | 2021-06-30 20:59:06.054651 | 52540047-4641-c51f-edaa-0000000025d6 |    WAITING | Wait for puppet host configuration to finish | controller-2 | 1166 retries left
.....
2021-06-30 22:02:24 | 2021-06-30 22:02:20.594308 | 52540047-4641-c51f-edaa-0000000025d6 |    WAITING | Wait for puppet host configuration to finish | controller-2 | 1 retries left
2021-06-30 22:02:24 | 2021-06-30 22:02:23.843845 | 52540047-4641-c51f-edaa-0000000025d6 |      FATAL | Wait for puppet host configuration to finish | controller-2 | error={"ansible_job_id": "733311726782.57253", "attempts": 1200, "changed": false, "failed_when_result": true, "finished": 0, "started": 1}
2021-06-30 22:02:24 | 
2021-06-30 22:02:24 | 2021-06-30 22:02:23.844525 | 52540047-4641-c51f-edaa-0000000025d6 |     TIMING | Wait for puppet host configuration to finish | controller-2 | 1:11:59.029217 | 3909.28s
2021-06-30 22:02:24 | 
2021-06-30 22:02:24 | PLAY RECAP *********************************************************************
2021-06-30 22:02:24 | controller-0               : ok=233  changed=82   unreachable=0    failed=1    skipped=116  rescued=0    ignored=0   
2021-06-30 22:02:24 | controller-1               : ok=250  changed=96   unreachable=0    failed=0    skipped=131  rescued=0    ignored=0   
2021-06-30 22:02:24 | controller-2               : ok=234  changed=133  unreachable=0    failed=1    skipped=107  rescued=0    ignored=0   


When going into the controller-2 logs, we can see the followign error:

http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-upgrades-ffu-16.2-from-13-passed_phase2-3cont_3hci-ipv4-ovs_dvr/57/controller-2/var/log/messages.gz

Jun 30 20:58:40 controller-2 podman[9691]: 2021-06-30 20:58:40.095916 7ff3a1fb4700  1 mgr send_beacon standby
Jun 30 20:58:40 controller-2 puppet-user[57261]: Debug: /Firewall[003 accept ssh from all ipv6]: Nothing to manage: no ensure and the resource doesn't exist
Jun 30 20:58:40 controller-2 puppet-user[57261]: Debug: Exec[check-for-local-authentication](provider=posix): Executing check '/sbin/pcs status pcsd controller-2 2>&1 | grep 'Unable to authenticate''
Jun 30 20:58:40 controller-2 puppet-user[57261]: Debug: Executing: '/sbin/pcs status pcsd controller-2 2>&1 | grep 'Unable to authenticate''
Jun 30 20:58:40 controller-2 puppet-user[57261]: Debug: /Stage[main]/Pacemaker::Corosync/Exec[check-for-local-authentication]/onlyif:   controller-2: Unable to authenticate
Jun 30 20:58:40 controller-2 puppet-user[57261]: Debug: Exec[check-for-local-authentication](provider=posix): Executing '/bin/echo 'local pcsd auth failed, triggering a reauthentication''
Jun 30 20:58:40 controller-2 puppet-user[57261]: Debug: Executing: '/bin/echo 'local pcsd auth failed, triggering a reauthentication''
Jun 30 20:58:40 controller-2 puppet-user[57261]: Notice: /Stage[main]/Pacemaker::Corosync/Exec[check-for-local-authentication]/returns: executed successfully
Jun 30 20:58:40 controller-2 puppet-user[57261]: Info: /Stage[main]/Pacemaker::Corosync/Exec[check-for-local-authentication]: Scheduling refresh of Exec[reauthenticate-across-all-nodes]
Jun 30 20:58:40 controller-2 puppet-user[57261]: Debug: /Stage[main]/Pacemaker::Corosync/Exec[check-for-local-authentication]: The container Class[Pacemaker::Corosync] will propagate my refresh event
Jun 30 20:58:40 controller-2 puppet-user[57261]: Debug: /Stage[main]/Pacemaker::Corosync/Exec[reauthenticate-across-all-nodes]/returns: Exec try 1/360
Jun 30 20:58:40 controller-2 puppet-user[57261]: Debug: Exec[reauthenticate-across-all-nodes](provider=posix): Executing '/sbin/pcs host auth controller-0 controller-1 controller-2 -u hacluster -p Y74kTbraVcE3Kvwb'
Jun 30 20:58:40 controller-2 puppet-user[57261]: Debug: Executing: '/sbin/pcs host auth controller-0 controller-1 controller-2 -u hacluster -p Y74kTbraVcE3Kvwb'


It seems that the third controller can't authenticate in the cluster.

CI job logs:

1st job failing: http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-upgrades-ffu-16.2-from-13-passed_phase2-3cont_3hci-ipv4-ovs_dvr/57/

2nd job failing: http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-enterprise-baremetal-upgrade-16.2_from_13-3control_2compute_externalceph-regression/46/

Version-Release number of selected component (if applicable):

[root@controller-1 ~]# sudo rpm -qa | grep pcs
pcs-0.10.8-1.el8.x86_64
[root@controller-1 ~]# sudo rpm -qa | grep corosync
corosync-3.1.0-3.el8.x86_64
corosynclib-3.1.0-3.el8.x86_64
puppet-corosync-6.0.2-2.20210528025812.961add3.el8ost.2.noarch
[root@controller-1 ~]# sudo rpm -qa | grep pacemaker
pacemaker-libs-2.0.5-9.el8.x86_64
puppet-pacemaker-1.1.0-2.20210528101831.6e272bf.el8ost.2.noarch
pacemaker-2.0.5-9.el8.x86_64
pacemaker-schemas-2.0.5-9.el8.noarch
pacemaker-cli-2.0.5-9.el8.x86_64
pacemaker-cluster-libs-2.0.5-9.el8.x86_64


How reproducible:

Trigger CI job: https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/view/DFG/view/upgrades/view/ffu/job/DFG-upgrades-ffu-16.2-from-13-passed_phase2-3cont_3hci-ipv4-ovs_dvr/

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 22 errata-xmlrpc 2022-03-23 22:10:48 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Release of components for Red Hat OpenStack Platform 16.2.2), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:1001

Comment 23 Luca Miccini 2022-06-30 15:14:00 UTC

*** Bug 2053018 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.