2016358 – [FFU] OSP13->OSP16.1 upgrade controller fails because of pcs can't add new node

Bug 2016358 - [FFU] OSP13->OSP16.1 upgrade controller fails because of pcs can't add new node

Summary: [FFU] OSP13->OSP16.1 upgrade controller fails because of pcs can't add new node

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	puppet-pacemaker
Sub Component:
Version:	16.1 (Train)
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	z8
Target Release:	16.1 (Train on RHEL 8.2)
Assignee:	Luca Miccini
QA Contact:	nlevinki
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-10-21 11:34 UTC by Gleb Galkin
Modified:	2022-03-24 11:02 UTC (History)
CC List:	13 users (show)
Fixed In Version:	puppet-pacemaker-1.0.1-1.20220112144821.b3596d1.el8ost
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-03-24 11:01:44 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
openstack upgrade log (2.61 MB, text/plain) 2021-10-21 11:34 UTC, Gleb Galkin	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	OSP-10511	0	None	None	None	2021-11-15 12:34:28 UTC
Red Hat Product Errata	RHBA-2022:0986	0	None	None	None	2022-03-24 11:02:05 UTC

Description Gleb Galkin 2021-10-21 11:34:34 UTC

Created attachment 1835573 [details]
openstack upgrade log

Description of problem:

command openstack overcloud upgrade run --yes --stack overcloud --limit overcloud-controller-0,overcloud-controller-1,overcloud-controller-2 --playbook all 

fails with error:

puppet-user: Notice: /Stage[main]/Pacemaker::Corosync/Exec[Adding Cluster node: overcloud-controller-2 addr=10.1.0.22 to Cluster tripleo_cluster]/returns: Error: overcloud-controller-2: Cluster configuration files found, the host seems to be in a cluster already, use --force to override", "<13>Oct 20 23:48:23 puppet-user: Notice: /Stage[main]/Pacemaker::Corosync/Exec[Adding Cluster node: overcloud-controller-2 addr=10.1.0.22 to Cluster tripleo_cluster]/returns: Error: Some nodes are already in a cluster. Enforcing this will destroy existing cluster on those nodes. You should remove the nodes from their clusters instead to keep the clusters working properly, use --force to override", "<13>Oct 20 23:48:23 puppet-user: Notice: /Stage[main]/Pacemaker::Corosync/Exec[Adding Cluster node: overcloud-controller-2 addr=10.1.0.22 to Cluster tripleo_cluster]/returns: Error: Errors have occurred, therefore pcs is unable to continue", "<13>Oct 20 23:48:23 puppet-user: Error: '/sbin/pcs cluster node add overcloud-controller-2 addr=10.1.0.22 --start --wait' returned 1 instead of one of [0]"


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:
The openstack upgrade command should be idempotent. If node is already added to the cluster this should be checked before adding. Task should be skipped or ansible should use --force, or remove the node and add it back. It shouldn't cause failure.




Additional info:

Comment 1 Michele Baldessari 2021-10-26 08:01:23 UTC

The error you got:
puppet-user: Notice: /Stage[main]/Pacemaker::Corosync/Exec[Adding Cluster node: overcloud-controller-2 addr=10.1.0.22 to Cluster tripleo_cluster]/returns: Error: overcloud-controller-2: Cluster configuration files found, the host seems to be in a cluster already, use --force to override", "<13>Oct 20 23:48:23 puppet-user: Notice: /Stage[main]/Pacemaker::Corosync/Exec[Adding Cluster node: overcloud-controller-2 addr=10.1.0.22 to Cluster tripleo_cluster]/returns: Error: Some nodes are already in a cluster. Enforcing this will destroy existing cluster on those nodes. You should remove the nodes from their clusters instead to keep the clusters working properly, use --force to override", "<13>Oct 20 23:48:23 puppet-user: Notice: /Stage[main]/Pacemaker::Corosync/Exec[Adding Cluster node: overcloud-controller-2 addr=10.1.0.22 to Cluster tripleo_cluster]/returns: Error: Errors have occurred, therefore pcs is unable to continue", "<13>Oct 20 23:48:23 puppet-user: Error: '/sbin/pcs cluster node add overcloud-controller-2 addr=10.1.0.22 --start --wait' returned 1 instead of one of [0]"

Corresponds to the following scale up code in puppet-pacemaker:
if count($nodes_added) > 0 {
$nodes_added.each |$node_to_add| {
  $node_name = split($node_to_add, ' ')[0]
  if $::pacemaker::pcs_010 {
    exec {"Authenticating new cluster node: ${node_to_add}":
      command   => "${::pacemaker::pcs_bin} host auth ${node_name} -u hacluster -p ${::pacemaker::hacluster_pwd}",
      timeout   => $cluster_start_timeout,
      tries     => $cluster_start_tries,
      try_sleep => $cluster_start_try_sleep,
      require   => [Service['pcsd'], User['hacluster']],
      tag       => 'pacemaker-auth',
    }
  }
  exec {"Adding Cluster node: ${node_to_add} to Cluster ${cluster_name}":
    unless    => "${::pacemaker::pcs_bin} status 2>&1 | grep -e \"^Online:.* ${node_name} .*\"",
    command   => "${::pacemaker::pcs_bin} cluster node add ${node_to_add} ${node_add_start_part} --wait",
    timeout   => $cluster_start_timeout,
    tries     => $cluster_start_tries,
    try_sleep => $cluster_start_try_sleep,
    notify    => Exec["node-cluster-start-${node_name}"],
    tag       => 'pacemaker-scaleup',
  }
...

So this code already has some idempotency to it, namely:
A) It only gets invoked when the number of current nodes vs the nodes listen in hiera is not identical (the count($nodes_added))
B) We call cluster node add only when the node to add is not already part of the newly-formed rhel8-based cluster

Without more logs or info, the best hypothesis is that indeed the code was triggered because overcloud-controller-2 (10.1.0.22) was not part of the upgraded overcloud-controller-0+overcloud-controller-1 cluster, but the problem here is that there was another (older, maybe still rhel7-based cluster) running on overcloud-controller-2 and that is why things failed in this case.

Can we get the ffu logs plus sosreports from the three nodes please?

Comment 2 shaju 2021-10-27 19:51:08 UTC

Updated the log file here since the file size is very big and can't upload to BZ: I have shared the link with you Michele and Jesse.

https://junipernetworks.sharepoint.com/:f:/s/CTOContrailSolutionTest/EvUELy2y8S5EtqngkPy7l38B4g_kMFZ2_usDxvraewGJQA?e=Q4Bth8

Comment 4 Luca Miccini 2021-11-02 12:50:39 UTC

(In reply to shaju from comment #2)
> Updated the log file here since the file size is very big and can't upload
> to BZ: I have shared the link with you Michele and Jesse.
> 
> https://junipernetworks.sharepoint.com/:f:/s/CTOContrailSolutionTest/
> EvUELy2y8S5EtqngkPy7l38B4g_kMFZ2_usDxvraewGJQA?e=Q4Bth8

Hello,
could you please upload a sosreport from controller-1 and controller-2 (and please add me to the share if you don't mind).

Thanks.

Comment 5 Luca Miccini 2021-11-03 06:31:50 UTC

(In reply to Luca Miccini from comment #4)
> (In reply to shaju from comment #2)
> > Updated the log file here since the file size is very big and can't upload
> > to BZ: I have shared the link with you Michele and Jesse.
> > 
> > https://junipernetworks.sharepoint.com/:f:/s/CTOContrailSolutionTest/
> > EvUELy2y8S5EtqngkPy7l38B4g_kMFZ2_usDxvraewGJQA?e=Q4Bth8
> 
> Hello,
> could you please upload a sosreport from controller-1 and controller-2 (and
> please add me to the share if you don't mind).
> 
> Thanks.

bump. thanks for sharing the files but without the sosreports from controller-1 and controller-2 we cannot root cause this.

Comment 6 shaju 2021-11-03 18:07:39 UTC

Hi Luca, 

Apologies for the delay, I've added sos report from controller-1 and controller-2 at https://junipernetworks.sharepoint.com/:f:/r/sites/CTOContrailSolutionTest/Shared%20Documents/RH-FFU-debug?csf=1&web=1&e=JLlQGO

Please let me know if you have any access Issue. I've add your email id in shared list.

Regards,
Shaju

Comment 8 Damien Ciabrini 2021-11-09 10:35:08 UTC

Tentative fix in https://review.opendev.org/c/openstack/puppet-pacemaker/+/817184

Comment 12 dabarzil 2022-03-02 08:48:44 UTC

Tested with :
$ rpm -qa|grep openstack-tripleo-heat-templates
openstack-tripleo-heat-templates-11.3.2-1.20220114223345.el8ost.noarch

Before scale down:
(undercloud) [stack@undercloud-0 ~]$ openstack server list
+--------------------------------------+--------------+--------+------------------------+----------------+------------+
| ID                                   | Name         | Status | Networks               | Image          | Flavor     |
+--------------------------------------+--------------+--------+------------------------+----------------+------------+
| b59ee548-2f6b-4131-b915-a2b56065798d | controller-2 | ACTIVE | ctlplane=192.168.24.9  | overcloud-full | controller |
| 1ee28c02-504a-40d8-b8a5-d2c12db204c8 | controller-0 | ACTIVE | ctlplane=192.168.24.10 | overcloud-full | controller |
| d7fa4f32-79fd-4844-92ed-9f2178af8e54 | controller-1 | ACTIVE | ctlplane=192.168.24.32 | overcloud-full | controller |
| 1427a856-6371-49b1-9898-3f68592dff8b | compute-1    | ACTIVE | ctlplane=192.168.24.22 | overcloud-full | compute    |
| f66b055d-b7d8-4ccf-a2b1-958497a79855 | compute-0    | ACTIVE | ctlplane=192.168.24.42 | overcloud-full | compute    |
+--------------------------------------+--------------+--------+------------------------+----------------+------------+


After scale down:
(undercloud) [stack@undercloud-0 ~]$ openstack server list
+--------------------------------------+--------------+--------+------------------------+----------------+------------+
| ID                                   | Name         | Status | Networks               | Image          | Flavor     |
+--------------------------------------+--------------+--------+------------------------+----------------+------------+
| b59ee548-2f6b-4131-b915-a2b56065798d | controller-2 | ACTIVE | ctlplane=192.168.24.9  | overcloud-full | controller |
| 1ee28c02-504a-40d8-b8a5-d2c12db204c8 | controller-0 | ACTIVE | ctlplane=192.168.24.10 | overcloud-full | controller |
| d7fa4f32-79fd-4844-92ed-9f2178af8e54 | controller-1 | ACTIVE | ctlplane=192.168.24.32 | overcloud-full | controller |
| 1427a856-6371-49b1-9898-3f68592dff8b | compute-1    | ACTIVE | ctlplane=192.168.24.22 | overcloud-full | compute    |
+--------------------------------------+--------------+--------+------------------------+----------------+------------+


During scale up:
[root@controller-0 ~]# podman restart haproxy-bundle-podman-0
b40a511b3a094e0f6cb12d81393c7a44bb1f88a2cff5c0047925d5e9a60fa04c

After Scale up:
(undercloud) [stack@undercloud-0 ~]$ openstack server list
+--------------------------------------+--------------+--------+------------------------+----------------+------------+
| ID                                   | Name         | Status | Networks               | Image          | Flavor     |
+--------------------------------------+--------------+--------+------------------------+----------------+------------+
| 81e8bc2f-a845-4ecb-a0f4-ab423aa5ecce | compute-2    | ACTIVE | ctlplane=192.168.24.34 | overcloud-full | compute    |
| b59ee548-2f6b-4131-b915-a2b56065798d | controller-2 | ACTIVE | ctlplane=192.168.24.9  | overcloud-full | controller |
| 1ee28c02-504a-40d8-b8a5-d2c12db204c8 | controller-0 | ACTIVE | ctlplane=192.168.24.10 | overcloud-full | controller |
| d7fa4f32-79fd-4844-92ed-9f2178af8e54 | controller-1 | ACTIVE | ctlplane=192.168.24.32 | overcloud-full | controller |
| 1427a856-6371-49b1-9898-3f68592dff8b | compute-1    | ACTIVE | ctlplane=192.168.24.22 | overcloud-full | compute    |
+--------------------------------------+--------------+--------+------------------------+----------------+------------+

Comment 13 dabarzil 2022-03-02 08:52:20 UTC

Please ignore the previous comment maid by mistake

Comment 14 dabarzil 2022-03-02 12:48:18 UTC

tested in CI:
https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/view/DFG/view/upgrades/view/ffu/job/DFG-upgrades-ffu-16.1-from-13-latest_cdn-3cont_2comp_3ceph_1ipa-ipv4-ovn_dvr/

rpm used:
puppet-pacemaker-1.0.1-1.20220112144821.b3596d1.el8ost.noarch
http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-upgrades-ffu-16.1-from-13-latest_cdn-3cont_2comp_3ceph_1ipa-ipv4-ovn_dvr/205/undercloud-0/var/log/extra/rpm-list.txt.gz

Comment 20 errata-xmlrpc 2022-03-24 11:01:44 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1.8 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:0986

Note You need to log in before you can comment on or make changes to this bug.