1843469 – [OSP13->OSP16.1] Overcloud upgrade run fails trying to stop cluster when the cluster is already down.

Bug 1843469 - [OSP13->OSP16.1] Overcloud upgrade run fails trying to stop cluster when the cluster is already down.

Summary: [OSP13->OSP16.1] Overcloud upgrade run fails trying to stop cluster when the ...

Keywords:
Status:	CLOSED DUPLICATE of bug 1846444
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tripleo-heat-templates
Sub Component:
Version:	16.1 (Train)
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Sergii Golovatiuk
QA Contact:	David Rosenfeld
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-06-03 11:27 UTC by Sergii Golovatiuk
Modified:	2020-06-23 18:34 UTC (History)
CC List:	5 users (show)
Fixed In Version:	openstack-tripleo-heat-templates-11.3.2-0.20200616081526.396affd.el8ost
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-06-22 18:52:06 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
OpenStack gerrit	732842	None	MERGED	Check transfer data flag to skip pacemaker normal upgrade.	2020-06-22 14:23:22 UTC
OpenStack gerrit	733077	None	MERGED	Revert "Only enable leapp tasks when distribution is correct"	2020-06-22 16:30:05 UTC
OpenStack gerrit	734824	None	MERGED	Only enable leapp tasks when distribution is correct	2020-06-22 14:23:22 UTC

Description Sergii Golovatiuk 2020-06-03 11:27:47 UTC

Description of problem:

https://review.opendev.org/#/c/728390/ broke FFWD as don't evaluate controllers properly. So, pcs cannot create pacemaker cluster as don't get the proper list of servers.


How reproducible:

Always.

Steps to Reproduce:
1. Install 13 env. Start FFWD controller-0 upgrade.

Run, leapp. run actual upgrade

 openstack overcloud upgrade run \
        --limit controller-0 \
        | tee oc-c0-upgrade-run.log

Actual results:

it fails on creating pacemaker cluster as

pcs cluster setup included controller-0,1,2 instead of controller-0 as it was not evaluated properly,


Expected results:

Created pacemaker cluster with single controller (controller-0)


Additional info:

Revert https://review.opendev.org/#/c/733077/ should help

Comment 1 Jose Luis Franco 2020-06-03 12:02:23 UTC

Logs for the failure:

2020-06-02 06:04:05 | 
2020-06-02 06:04:05 | TASK [Set fact galera_pcs_res] *************************************************
2020-06-02 06:04:05 | ok: [controller-0] => {"ansible_facts": {"galera_pcs_res": false}, "changed": false}
2020-06-02 06:04:05 | Tuesday 02 June 2020  06:03:58 -0400 (0:00:00.180)       0:00:15.389 ********** 
2020-06-02 06:04:05 | 
2020-06-02 06:04:05 | TASK [set is_mysql_bootstrap_node fact] ****************************************
2020-06-02 06:04:05 | ok: [controller-0] => {"ansible_facts": {"is_mysql_bootstrap_node": true}, "changed": false}
2020-06-02 06:04:05 | Tuesday 02 June 2020  06:03:58 -0400 (0:00:00.181)       0:00:15.570 ********** 
2020-06-02 06:04:05 | 
2020-06-02 06:04:05 | TASK [Gather missing facts] ****************************************************
2020-06-02 06:04:05 | skipping: [controller-0] => {"changed": false, "skip_reason": "Conditional result was False"}
2020-06-02 06:04:05 | Tuesday 02 June 2020  06:03:58 -0400 (0:00:00.178)       0:00:15.748 ********** 
2020-06-02 06:04:05 | 
2020-06-02 06:04:05 | TASK [Set fact upgrade_leapp_enabled] ******************************************
2020-06-02 06:04:05 | ok: [controller-0] => {"ansible_facts": {"upgrade_leapp_enabled": false}, "changed": false}
2020-06-02 06:04:05 | Tuesday 02 June 2020  06:03:59 -0400 (0:00:00.175)       0:00:15.924 ********** 
2020-06-02 06:04:05 | 
2020-06-02 06:04:05 | TASK [Check pacemaker cluster running before upgrade] **************************
2020-06-02 06:04:05 | fatal: [controller-0]: FAILED! => {"ansible_job_id": "200500165647.9038", "changed": false, "cmd": "pcs cluster status", "finished": 1, "msg": "[Errno 2] No such file or directory: 'pcs': 'pcs'", "rc": 2}
2020-06-02 06:04:05 | 
2020-06-02 06:04:05 | PLAY RECAP *********************************************************************
2020-06-02 06:04:05 | controller-0               : ok=25   changed=5    unreachable=0    failed=1    skipped=13   rescued=0    ignored=0   
2020-06-02 06:04:05 | 
2020-06-02 06:04:05 | Tuesday 02 June 2020  06:04:04 -0400 (0:00:05.661)       0:00:21.585 ********** 
2020-06-02 06:04:05 | =============================================================================== 
2020-06-02 06:04:05 | 
2020-06-02 06:04:05 | Ansible failed, check log at /var/log/mistral/package_update.log.

Log file: http://cougar11.scl.lab.tlv.redhat.com/DFG-upgrades-ffu-ffu-upgrade-13-16.1_director-rhel-virthost-3cont_2comp-ipv4-vxlan-HA-no-ceph/33/undercloud-0.tar.gz?undercloud-0/home/stack/overcloud_upgrade_run_controller-0.log

Full job logs: http://cougar11.scl.lab.tlv.redhat.com/DFG-upgrades-ffu-ffu-upgrade-13-16.1_director-rhel-virthost-3cont_2comp-ipv4-vxlan-HA-no-ceph/33/

Currently, the control plan upgrade when jumping from OSP13 to OSP16 consists in three phases, per controller node: 
1. Rhel upgrade vía leapp. The patch reverted in https://review.opendev.org/#/c/733077/, tried to automatically detect if the leapp
upgrade was required or not.
2. A transfer data step, to persist database information.
3. The controller upgrade.

In the third stage, a new pacemaker cluster will get created with the newly upgraded to RHEL8 nodes. However, this stage also depends a lot on the same parameter as stage 1 (upgrade_leapp_enabled). But this stage is run once the node is in RHEL8, so the upgrade_leapp_enabled will get set to false due to https://review.opendev.org/#/c/728390/. 

If upgrade_leapp_enabled is set to false, the upgrade workflow will try to upgrade the pacemaker services as if an N to N+1 upgrade: https://github.com/openstack/tripleo-heat-templates/blob/master/deployment/pacemaker/pacemaker-baremetal-puppet.yaml#L239-L243 and it won't create the new pacemaker cluster adding one node at a time, but adding all of them at the same time causing errors.

The code needs to decouple these two stages 1 and 3, so that they can run independently and a change in one won't impact the other.

Comment 2 Jesse Pretorius 2020-06-22 18:52:06 UTC


*** This bug has been marked as a duplicate of bug 1846444 ***

Note You need to log in before you can comment on or make changes to this bug.