Description of problem: It has been found in few CI jobs for the inplace upgade from OSP13 to OSP16.2 that the upgrade run step is failing when trying to upgrade the third controller. The upgrade fails with the log: http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-upgrades-ffu-16.2-from-13-passed_phase2-3cont_3hci-ipv4-ovs_dvr/57/undercloud-0/home/stack/overcloud_upgrade_run-controller-2,controller-1,controller-0.log.gz Config: 1625086516", "<13>Jun 30 20:58:56 puppet-user: Puppet: 5.5.10", "<13>Jun 30 20:58:56 puppet-user: Debug: Applying settings catalog for sections reporting, metrics", "<13>Jun 30 20:58:56 puppet-user: Debug: Finishing transaction 47050753925080", "<13>Jun 30 20:58:56 puppet-user: Debug: Received report to process from controller-0.redhat.local", "<13>Jun 30 20:58:56 puppet-user: Debug: Evicting cache entry for environment 'production'", "<13>Jun 30 20:58:56 puppet-user: Debug: Caching environment 'production' (ttl = 0 sec)", "<13>Jun 30 20:58:56 puppet-user: Debug: Processing report from controller-0.redhat.local with processor Puppet::Reports::Store"], "stdout": "", "stdout_lines": []} 2021-06-30 20:59:05 | 2021-06-30 20:59:00.308673 | 52540047-4641-c51f-edaa-000000002370 | TIMING | Wait for puppet host configuration to finish | controller-0 | 0:08:35.493348 | 228.82s 2021-06-30 20:59:05 | 2021-06-30 20:59:02.782783 | 52540047-4641-c51f-edaa-0000000025d6 | WAITING | Wait for puppet host configuration to finish | controller-2 | 1167 retries left 2021-06-30 21:01:49 | 2021-06-30 21:01:49 | 2021-06-30 20:59:06.054651 | 52540047-4641-c51f-edaa-0000000025d6 | WAITING | Wait for puppet host configuration to finish | controller-2 | 1166 retries left ..... 2021-06-30 22:02:24 | 2021-06-30 22:02:20.594308 | 52540047-4641-c51f-edaa-0000000025d6 | WAITING | Wait for puppet host configuration to finish | controller-2 | 1 retries left 2021-06-30 22:02:24 | 2021-06-30 22:02:23.843845 | 52540047-4641-c51f-edaa-0000000025d6 | FATAL | Wait for puppet host configuration to finish | controller-2 | error={"ansible_job_id": "733311726782.57253", "attempts": 1200, "changed": false, "failed_when_result": true, "finished": 0, "started": 1} 2021-06-30 22:02:24 | 2021-06-30 22:02:24 | 2021-06-30 22:02:23.844525 | 52540047-4641-c51f-edaa-0000000025d6 | TIMING | Wait for puppet host configuration to finish | controller-2 | 1:11:59.029217 | 3909.28s 2021-06-30 22:02:24 | 2021-06-30 22:02:24 | PLAY RECAP ********************************************************************* 2021-06-30 22:02:24 | controller-0 : ok=233 changed=82 unreachable=0 failed=1 skipped=116 rescued=0 ignored=0 2021-06-30 22:02:24 | controller-1 : ok=250 changed=96 unreachable=0 failed=0 skipped=131 rescued=0 ignored=0 2021-06-30 22:02:24 | controller-2 : ok=234 changed=133 unreachable=0 failed=1 skipped=107 rescued=0 ignored=0 When going into the controller-2 logs, we can see the followign error: http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-upgrades-ffu-16.2-from-13-passed_phase2-3cont_3hci-ipv4-ovs_dvr/57/controller-2/var/log/messages.gz Jun 30 20:58:40 controller-2 podman[9691]: 2021-06-30 20:58:40.095916 7ff3a1fb4700 1 mgr send_beacon standby Jun 30 20:58:40 controller-2 puppet-user[57261]: Debug: /Firewall[003 accept ssh from all ipv6]: Nothing to manage: no ensure and the resource doesn't exist Jun 30 20:58:40 controller-2 puppet-user[57261]: Debug: Exec[check-for-local-authentication](provider=posix): Executing check '/sbin/pcs status pcsd controller-2 2>&1 | grep 'Unable to authenticate'' Jun 30 20:58:40 controller-2 puppet-user[57261]: Debug: Executing: '/sbin/pcs status pcsd controller-2 2>&1 | grep 'Unable to authenticate'' Jun 30 20:58:40 controller-2 puppet-user[57261]: Debug: /Stage[main]/Pacemaker::Corosync/Exec[check-for-local-authentication]/onlyif: controller-2: Unable to authenticate Jun 30 20:58:40 controller-2 puppet-user[57261]: Debug: Exec[check-for-local-authentication](provider=posix): Executing '/bin/echo 'local pcsd auth failed, triggering a reauthentication'' Jun 30 20:58:40 controller-2 puppet-user[57261]: Debug: Executing: '/bin/echo 'local pcsd auth failed, triggering a reauthentication'' Jun 30 20:58:40 controller-2 puppet-user[57261]: Notice: /Stage[main]/Pacemaker::Corosync/Exec[check-for-local-authentication]/returns: executed successfully Jun 30 20:58:40 controller-2 puppet-user[57261]: Info: /Stage[main]/Pacemaker::Corosync/Exec[check-for-local-authentication]: Scheduling refresh of Exec[reauthenticate-across-all-nodes] Jun 30 20:58:40 controller-2 puppet-user[57261]: Debug: /Stage[main]/Pacemaker::Corosync/Exec[check-for-local-authentication]: The container Class[Pacemaker::Corosync] will propagate my refresh event Jun 30 20:58:40 controller-2 puppet-user[57261]: Debug: /Stage[main]/Pacemaker::Corosync/Exec[reauthenticate-across-all-nodes]/returns: Exec try 1/360 Jun 30 20:58:40 controller-2 puppet-user[57261]: Debug: Exec[reauthenticate-across-all-nodes](provider=posix): Executing '/sbin/pcs host auth controller-0 controller-1 controller-2 -u hacluster -p Y74kTbraVcE3Kvwb' Jun 30 20:58:40 controller-2 puppet-user[57261]: Debug: Executing: '/sbin/pcs host auth controller-0 controller-1 controller-2 -u hacluster -p Y74kTbraVcE3Kvwb' It seems that the third controller can't authenticate in the cluster. CI job logs: 1st job failing: http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-upgrades-ffu-16.2-from-13-passed_phase2-3cont_3hci-ipv4-ovs_dvr/57/ 2nd job failing: http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-enterprise-baremetal-upgrade-16.2_from_13-3control_2compute_externalceph-regression/46/ Version-Release number of selected component (if applicable): [root@controller-1 ~]# sudo rpm -qa | grep pcs pcs-0.10.8-1.el8.x86_64 [root@controller-1 ~]# sudo rpm -qa | grep corosync corosync-3.1.0-3.el8.x86_64 corosynclib-3.1.0-3.el8.x86_64 puppet-corosync-6.0.2-2.20210528025812.961add3.el8ost.2.noarch [root@controller-1 ~]# sudo rpm -qa | grep pacemaker pacemaker-libs-2.0.5-9.el8.x86_64 puppet-pacemaker-1.1.0-2.20210528101831.6e272bf.el8ost.2.noarch pacemaker-2.0.5-9.el8.x86_64 pacemaker-schemas-2.0.5-9.el8.noarch pacemaker-cli-2.0.5-9.el8.x86_64 pacemaker-cluster-libs-2.0.5-9.el8.x86_64 How reproducible: Trigger CI job: https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/view/DFG/view/upgrades/view/ffu/job/DFG-upgrades-ffu-16.2-from-13-passed_phase2-3cont_3hci-ipv4-ovs_dvr/ Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Release of components for Red Hat OpenStack Platform 16.2.2), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:1001
*** Bug 2053018 has been marked as a duplicate of this bug. ***