Created attachment 1835573 [details] openstack upgrade log Description of problem: command openstack overcloud upgrade run --yes --stack overcloud --limit overcloud-controller-0,overcloud-controller-1,overcloud-controller-2 --playbook all fails with error: puppet-user: Notice: /Stage[main]/Pacemaker::Corosync/Exec[Adding Cluster node: overcloud-controller-2 addr=10.1.0.22 to Cluster tripleo_cluster]/returns: Error: overcloud-controller-2: Cluster configuration files found, the host seems to be in a cluster already, use --force to override", "<13>Oct 20 23:48:23 puppet-user: Notice: /Stage[main]/Pacemaker::Corosync/Exec[Adding Cluster node: overcloud-controller-2 addr=10.1.0.22 to Cluster tripleo_cluster]/returns: Error: Some nodes are already in a cluster. Enforcing this will destroy existing cluster on those nodes. You should remove the nodes from their clusters instead to keep the clusters working properly, use --force to override", "<13>Oct 20 23:48:23 puppet-user: Notice: /Stage[main]/Pacemaker::Corosync/Exec[Adding Cluster node: overcloud-controller-2 addr=10.1.0.22 to Cluster tripleo_cluster]/returns: Error: Errors have occurred, therefore pcs is unable to continue", "<13>Oct 20 23:48:23 puppet-user: Error: '/sbin/pcs cluster node add overcloud-controller-2 addr=10.1.0.22 --start --wait' returned 1 instead of one of [0]" Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: The openstack upgrade command should be idempotent. If node is already added to the cluster this should be checked before adding. Task should be skipped or ansible should use --force, or remove the node and add it back. It shouldn't cause failure. Additional info:
The error you got: puppet-user: Notice: /Stage[main]/Pacemaker::Corosync/Exec[Adding Cluster node: overcloud-controller-2 addr=10.1.0.22 to Cluster tripleo_cluster]/returns: Error: overcloud-controller-2: Cluster configuration files found, the host seems to be in a cluster already, use --force to override", "<13>Oct 20 23:48:23 puppet-user: Notice: /Stage[main]/Pacemaker::Corosync/Exec[Adding Cluster node: overcloud-controller-2 addr=10.1.0.22 to Cluster tripleo_cluster]/returns: Error: Some nodes are already in a cluster. Enforcing this will destroy existing cluster on those nodes. You should remove the nodes from their clusters instead to keep the clusters working properly, use --force to override", "<13>Oct 20 23:48:23 puppet-user: Notice: /Stage[main]/Pacemaker::Corosync/Exec[Adding Cluster node: overcloud-controller-2 addr=10.1.0.22 to Cluster tripleo_cluster]/returns: Error: Errors have occurred, therefore pcs is unable to continue", "<13>Oct 20 23:48:23 puppet-user: Error: '/sbin/pcs cluster node add overcloud-controller-2 addr=10.1.0.22 --start --wait' returned 1 instead of one of [0]" Corresponds to the following scale up code in puppet-pacemaker: if count($nodes_added) > 0 { $nodes_added.each |$node_to_add| { $node_name = split($node_to_add, ' ')[0] if $::pacemaker::pcs_010 { exec {"Authenticating new cluster node: ${node_to_add}": command => "${::pacemaker::pcs_bin} host auth ${node_name} -u hacluster -p ${::pacemaker::hacluster_pwd}", timeout => $cluster_start_timeout, tries => $cluster_start_tries, try_sleep => $cluster_start_try_sleep, require => [Service['pcsd'], User['hacluster']], tag => 'pacemaker-auth', } } exec {"Adding Cluster node: ${node_to_add} to Cluster ${cluster_name}": unless => "${::pacemaker::pcs_bin} status 2>&1 | grep -e \"^Online:.* ${node_name} .*\"", command => "${::pacemaker::pcs_bin} cluster node add ${node_to_add} ${node_add_start_part} --wait", timeout => $cluster_start_timeout, tries => $cluster_start_tries, try_sleep => $cluster_start_try_sleep, notify => Exec["node-cluster-start-${node_name}"], tag => 'pacemaker-scaleup', } ... So this code already has some idempotency to it, namely: A) It only gets invoked when the number of current nodes vs the nodes listen in hiera is not identical (the count($nodes_added)) B) We call cluster node add only when the node to add is not already part of the newly-formed rhel8-based cluster Without more logs or info, the best hypothesis is that indeed the code was triggered because overcloud-controller-2 (10.1.0.22) was not part of the upgraded overcloud-controller-0+overcloud-controller-1 cluster, but the problem here is that there was another (older, maybe still rhel7-based cluster) running on overcloud-controller-2 and that is why things failed in this case. Can we get the ffu logs plus sosreports from the three nodes please?
Updated the log file here since the file size is very big and can't upload to BZ: I have shared the link with you Michele and Jesse. https://junipernetworks.sharepoint.com/:f:/s/CTOContrailSolutionTest/EvUELy2y8S5EtqngkPy7l38B4g_kMFZ2_usDxvraewGJQA?e=Q4Bth8
(In reply to shaju from comment #2) > Updated the log file here since the file size is very big and can't upload > to BZ: I have shared the link with you Michele and Jesse. > > https://junipernetworks.sharepoint.com/:f:/s/CTOContrailSolutionTest/ > EvUELy2y8S5EtqngkPy7l38B4g_kMFZ2_usDxvraewGJQA?e=Q4Bth8 Hello, could you please upload a sosreport from controller-1 and controller-2 (and please add me to the share if you don't mind). Thanks.
(In reply to Luca Miccini from comment #4) > (In reply to shaju from comment #2) > > Updated the log file here since the file size is very big and can't upload > > to BZ: I have shared the link with you Michele and Jesse. > > > > https://junipernetworks.sharepoint.com/:f:/s/CTOContrailSolutionTest/ > > EvUELy2y8S5EtqngkPy7l38B4g_kMFZ2_usDxvraewGJQA?e=Q4Bth8 > > Hello, > could you please upload a sosreport from controller-1 and controller-2 (and > please add me to the share if you don't mind). > > Thanks. bump. thanks for sharing the files but without the sosreports from controller-1 and controller-2 we cannot root cause this.
Hi Luca, Apologies for the delay, I've added sos report from controller-1 and controller-2 at https://junipernetworks.sharepoint.com/:f:/r/sites/CTOContrailSolutionTest/Shared%20Documents/RH-FFU-debug?csf=1&web=1&e=JLlQGO Please let me know if you have any access Issue. I've add your email id in shared list. Regards, Shaju
Tentative fix in https://review.opendev.org/c/openstack/puppet-pacemaker/+/817184
Tested with : $ rpm -qa|grep openstack-tripleo-heat-templates openstack-tripleo-heat-templates-11.3.2-1.20220114223345.el8ost.noarch Before scale down: (undercloud) [stack@undercloud-0 ~]$ openstack server list +--------------------------------------+--------------+--------+------------------------+----------------+------------+ | ID | Name | Status | Networks | Image | Flavor | +--------------------------------------+--------------+--------+------------------------+----------------+------------+ | b59ee548-2f6b-4131-b915-a2b56065798d | controller-2 | ACTIVE | ctlplane=192.168.24.9 | overcloud-full | controller | | 1ee28c02-504a-40d8-b8a5-d2c12db204c8 | controller-0 | ACTIVE | ctlplane=192.168.24.10 | overcloud-full | controller | | d7fa4f32-79fd-4844-92ed-9f2178af8e54 | controller-1 | ACTIVE | ctlplane=192.168.24.32 | overcloud-full | controller | | 1427a856-6371-49b1-9898-3f68592dff8b | compute-1 | ACTIVE | ctlplane=192.168.24.22 | overcloud-full | compute | | f66b055d-b7d8-4ccf-a2b1-958497a79855 | compute-0 | ACTIVE | ctlplane=192.168.24.42 | overcloud-full | compute | +--------------------------------------+--------------+--------+------------------------+----------------+------------+ After scale down: (undercloud) [stack@undercloud-0 ~]$ openstack server list +--------------------------------------+--------------+--------+------------------------+----------------+------------+ | ID | Name | Status | Networks | Image | Flavor | +--------------------------------------+--------------+--------+------------------------+----------------+------------+ | b59ee548-2f6b-4131-b915-a2b56065798d | controller-2 | ACTIVE | ctlplane=192.168.24.9 | overcloud-full | controller | | 1ee28c02-504a-40d8-b8a5-d2c12db204c8 | controller-0 | ACTIVE | ctlplane=192.168.24.10 | overcloud-full | controller | | d7fa4f32-79fd-4844-92ed-9f2178af8e54 | controller-1 | ACTIVE | ctlplane=192.168.24.32 | overcloud-full | controller | | 1427a856-6371-49b1-9898-3f68592dff8b | compute-1 | ACTIVE | ctlplane=192.168.24.22 | overcloud-full | compute | +--------------------------------------+--------------+--------+------------------------+----------------+------------+ During scale up: [root@controller-0 ~]# podman restart haproxy-bundle-podman-0 b40a511b3a094e0f6cb12d81393c7a44bb1f88a2cff5c0047925d5e9a60fa04c After Scale up: (undercloud) [stack@undercloud-0 ~]$ openstack server list +--------------------------------------+--------------+--------+------------------------+----------------+------------+ | ID | Name | Status | Networks | Image | Flavor | +--------------------------------------+--------------+--------+------------------------+----------------+------------+ | 81e8bc2f-a845-4ecb-a0f4-ab423aa5ecce | compute-2 | ACTIVE | ctlplane=192.168.24.34 | overcloud-full | compute | | b59ee548-2f6b-4131-b915-a2b56065798d | controller-2 | ACTIVE | ctlplane=192.168.24.9 | overcloud-full | controller | | 1ee28c02-504a-40d8-b8a5-d2c12db204c8 | controller-0 | ACTIVE | ctlplane=192.168.24.10 | overcloud-full | controller | | d7fa4f32-79fd-4844-92ed-9f2178af8e54 | controller-1 | ACTIVE | ctlplane=192.168.24.32 | overcloud-full | controller | | 1427a856-6371-49b1-9898-3f68592dff8b | compute-1 | ACTIVE | ctlplane=192.168.24.22 | overcloud-full | compute | +--------------------------------------+--------------+--------+------------------------+----------------+------------+
Please ignore the previous comment maid by mistake
tested in CI: https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/view/DFG/view/upgrades/view/ffu/job/DFG-upgrades-ffu-16.1-from-13-latest_cdn-3cont_2comp_3ceph_1ipa-ipv4-ovn_dvr/ rpm used: puppet-pacemaker-1.0.1-1.20220112144821.b3596d1.el8ost.noarch http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-upgrades-ffu-16.1-from-13-latest_cdn-3cont_2comp_3ceph_1ipa-ipv4-ovn_dvr/205/undercloud-0/var/log/extra/rpm-list.txt.gz
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 16.1.8 bug fix and enhancement advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:0986