Bug 1978696

Summary: [FFWD13 -> 16.2] Third controller not joining the cluster during the overcloud upgrade, fails to authenticate
Product: Red Hat OpenStack Reporter: Jose Luis Franco <jfrancoa>
Component: puppet-pacemakerAssignee: Michele Baldessari <michele>
Status: CLOSED ERRATA QA Contact: Jason Grosso <jgrosso>
Severity: medium Docs Contact:
Priority: medium    
Version: 16.2 (Train)CC: elicohen, enothen, jgrosso, jjoyce, jpretori, jschluet, lmiccini, michele, omcgonag, slinaber, tvignaud
Target Milestone: z2Keywords: Triaged
Target Release: 16.2 (Train on RHEL 8.4)   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: puppet-pacemaker-1.2.1-2.20210810224808.90be0f9.el8ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-03-23 22:10:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jose Luis Franco 2021-07-02 14:12:22 UTC
Description of problem:

It has been found in few CI jobs for the inplace upgade from OSP13 to OSP16.2 that the upgrade run step is failing when trying to upgrade the third controller. The upgrade fails with the log:

http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-upgrades-ffu-16.2-from-13-passed_phase2-3cont_3hci-ipv4-ovs_dvr/57/undercloud-0/home/stack/overcloud_upgrade_run-controller-2,controller-1,controller-0.log.gz
            Config: 1625086516", "<13>Jun 30 20:58:56 puppet-user:            Puppet: 5.5.10", "<13>Jun 30 20:58:56 puppet-user: Debug: Applying settings catalog for sections reporting, metrics", "<13>Jun 30 20:58:56 puppet-user: Debug: Finishing transaction 47050753925080", "<13>Jun 30 20:58:56 puppet-user: Debug: Received report to process from controller-0.redhat.local", "<13>Jun 30 20:58:56 puppet-user: Debug: Evicting cache entry for environment 'production'", "<13>Jun 30 20:58:56 puppet-user: Debug: Caching environment 'production' (ttl = 0 sec)", "<13>Jun 30 20:58:56 puppet-user: Debug: Processing report from controller-0.redhat.local with processor Puppet::Reports::Store"], "stdout": "", "stdout_lines": []}
2021-06-30 20:59:05 | 2021-06-30 20:59:00.308673 | 52540047-4641-c51f-edaa-000000002370 |     TIMING | Wait for puppet host configuration to finish | controller-0 | 0:08:35.493348 | 228.82s
2021-06-30 20:59:05 | 2021-06-30 20:59:02.782783 | 52540047-4641-c51f-edaa-0000000025d6 |    WAITING | Wait for puppet host configuration to finish | controller-2 | 1167 retries left
2021-06-30 21:01:49 | 
2021-06-30 21:01:49 | 2021-06-30 20:59:06.054651 | 52540047-4641-c51f-edaa-0000000025d6 |    WAITING | Wait for puppet host configuration to finish | controller-2 | 1166 retries left
.....
2021-06-30 22:02:24 | 2021-06-30 22:02:20.594308 | 52540047-4641-c51f-edaa-0000000025d6 |    WAITING | Wait for puppet host configuration to finish | controller-2 | 1 retries left
2021-06-30 22:02:24 | 2021-06-30 22:02:23.843845 | 52540047-4641-c51f-edaa-0000000025d6 |      FATAL | Wait for puppet host configuration to finish | controller-2 | error={"ansible_job_id": "733311726782.57253", "attempts": 1200, "changed": false, "failed_when_result": true, "finished": 0, "started": 1}
2021-06-30 22:02:24 | 
2021-06-30 22:02:24 | 2021-06-30 22:02:23.844525 | 52540047-4641-c51f-edaa-0000000025d6 |     TIMING | Wait for puppet host configuration to finish | controller-2 | 1:11:59.029217 | 3909.28s
2021-06-30 22:02:24 | 
2021-06-30 22:02:24 | PLAY RECAP *********************************************************************
2021-06-30 22:02:24 | controller-0               : ok=233  changed=82   unreachable=0    failed=1    skipped=116  rescued=0    ignored=0   
2021-06-30 22:02:24 | controller-1               : ok=250  changed=96   unreachable=0    failed=0    skipped=131  rescued=0    ignored=0   
2021-06-30 22:02:24 | controller-2               : ok=234  changed=133  unreachable=0    failed=1    skipped=107  rescued=0    ignored=0   


When going into the controller-2 logs, we can see the followign error:

http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-upgrades-ffu-16.2-from-13-passed_phase2-3cont_3hci-ipv4-ovs_dvr/57/controller-2/var/log/messages.gz

Jun 30 20:58:40 controller-2 podman[9691]: 2021-06-30 20:58:40.095916 7ff3a1fb4700  1 mgr send_beacon standby
Jun 30 20:58:40 controller-2 puppet-user[57261]: Debug: /Firewall[003 accept ssh from all ipv6]: Nothing to manage: no ensure and the resource doesn't exist
Jun 30 20:58:40 controller-2 puppet-user[57261]: Debug: Exec[check-for-local-authentication](provider=posix): Executing check '/sbin/pcs status pcsd controller-2 2>&1 | grep 'Unable to authenticate''
Jun 30 20:58:40 controller-2 puppet-user[57261]: Debug: Executing: '/sbin/pcs status pcsd controller-2 2>&1 | grep 'Unable to authenticate''
Jun 30 20:58:40 controller-2 puppet-user[57261]: Debug: /Stage[main]/Pacemaker::Corosync/Exec[check-for-local-authentication]/onlyif:   controller-2: Unable to authenticate
Jun 30 20:58:40 controller-2 puppet-user[57261]: Debug: Exec[check-for-local-authentication](provider=posix): Executing '/bin/echo 'local pcsd auth failed, triggering a reauthentication''
Jun 30 20:58:40 controller-2 puppet-user[57261]: Debug: Executing: '/bin/echo 'local pcsd auth failed, triggering a reauthentication''
Jun 30 20:58:40 controller-2 puppet-user[57261]: Notice: /Stage[main]/Pacemaker::Corosync/Exec[check-for-local-authentication]/returns: executed successfully
Jun 30 20:58:40 controller-2 puppet-user[57261]: Info: /Stage[main]/Pacemaker::Corosync/Exec[check-for-local-authentication]: Scheduling refresh of Exec[reauthenticate-across-all-nodes]
Jun 30 20:58:40 controller-2 puppet-user[57261]: Debug: /Stage[main]/Pacemaker::Corosync/Exec[check-for-local-authentication]: The container Class[Pacemaker::Corosync] will propagate my refresh event
Jun 30 20:58:40 controller-2 puppet-user[57261]: Debug: /Stage[main]/Pacemaker::Corosync/Exec[reauthenticate-across-all-nodes]/returns: Exec try 1/360
Jun 30 20:58:40 controller-2 puppet-user[57261]: Debug: Exec[reauthenticate-across-all-nodes](provider=posix): Executing '/sbin/pcs host auth controller-0 controller-1 controller-2 -u hacluster -p Y74kTbraVcE3Kvwb'
Jun 30 20:58:40 controller-2 puppet-user[57261]: Debug: Executing: '/sbin/pcs host auth controller-0 controller-1 controller-2 -u hacluster -p Y74kTbraVcE3Kvwb'


It seems that the third controller can't authenticate in the cluster.

CI job logs:

1st job failing: http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-upgrades-ffu-16.2-from-13-passed_phase2-3cont_3hci-ipv4-ovs_dvr/57/

2nd job failing: http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-enterprise-baremetal-upgrade-16.2_from_13-3control_2compute_externalceph-regression/46/

Version-Release number of selected component (if applicable):

[root@controller-1 ~]# sudo rpm -qa | grep pcs
pcs-0.10.8-1.el8.x86_64
[root@controller-1 ~]# sudo rpm -qa | grep corosync
corosync-3.1.0-3.el8.x86_64
corosynclib-3.1.0-3.el8.x86_64
puppet-corosync-6.0.2-2.20210528025812.961add3.el8ost.2.noarch
[root@controller-1 ~]# sudo rpm -qa | grep pacemaker
pacemaker-libs-2.0.5-9.el8.x86_64
puppet-pacemaker-1.1.0-2.20210528101831.6e272bf.el8ost.2.noarch
pacemaker-2.0.5-9.el8.x86_64
pacemaker-schemas-2.0.5-9.el8.noarch
pacemaker-cli-2.0.5-9.el8.x86_64
pacemaker-cluster-libs-2.0.5-9.el8.x86_64


How reproducible:

Trigger CI job: https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/view/DFG/view/upgrades/view/ffu/job/DFG-upgrades-ffu-16.2-from-13-passed_phase2-3cont_3hci-ipv4-ovs_dvr/

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 22 errata-xmlrpc 2022-03-23 22:10:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Release of components for Red Hat OpenStack Platform 16.2.2), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:1001

Comment 23 Luca Miccini 2022-06-30 15:14:00 UTC
*** Bug 2053018 has been marked as a duplicate of this bug. ***