Bug 1843469

Summary:	[OSP13->OSP16.1] Overcloud upgrade run fails trying to stop cluster when the cluster is already down.
Product:	Red Hat OpenStack	Reporter:	Sergii Golovatiuk <sgolovat>
Component:	openstack-tripleo-heat-templates	Assignee:	Sergii Golovatiuk <sgolovat>
Status:	CLOSED DUPLICATE	QA Contact:	David Rosenfeld <drosenfe>
Severity:	high	Docs Contact:
Priority:	high
Version:	16.1 (Train)	CC:	jfrancoa, jpretori, lbezdick, mburns, morazi
Target Milestone:	rc	Keywords:	Triaged
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	openstack-tripleo-heat-templates-11.3.2-0.20200616081526.396affd.el8ost	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-06-22 18:52:06 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Sergii Golovatiuk 2020-06-03 11:27:47 UTC

Description of problem:

https://review.opendev.org/#/c/728390/ broke FFWD as don't evaluate controllers properly. So, pcs cannot create pacemaker cluster as don't get the proper list of servers.


How reproducible:

Always.

Steps to Reproduce:
1. Install 13 env. Start FFWD controller-0 upgrade.

Run, leapp. run actual upgrade

 openstack overcloud upgrade run \
        --limit controller-0 \
        | tee oc-c0-upgrade-run.log

Actual results:

it fails on creating pacemaker cluster as

pcs cluster setup included controller-0,1,2 instead of controller-0 as it was not evaluated properly,


Expected results:

Created pacemaker cluster with single controller (controller-0)


Additional info:

Revert https://review.opendev.org/#/c/733077/ should help

Comment 1 Jose Luis Franco 2020-06-03 12:02:23 UTC

Logs for the failure:

2020-06-02 06:04:05 | 
2020-06-02 06:04:05 | TASK [Set fact galera_pcs_res] *************************************************
2020-06-02 06:04:05 | ok: [controller-0] => {"ansible_facts": {"galera_pcs_res": false}, "changed": false}
2020-06-02 06:04:05 | Tuesday 02 June 2020  06:03:58 -0400 (0:00:00.180)       0:00:15.389 ********** 
2020-06-02 06:04:05 | 
2020-06-02 06:04:05 | TASK [set is_mysql_bootstrap_node fact] ****************************************
2020-06-02 06:04:05 | ok: [controller-0] => {"ansible_facts": {"is_mysql_bootstrap_node": true}, "changed": false}
2020-06-02 06:04:05 | Tuesday 02 June 2020  06:03:58 -0400 (0:00:00.181)       0:00:15.570 ********** 
2020-06-02 06:04:05 | 
2020-06-02 06:04:05 | TASK [Gather missing facts] ****************************************************
2020-06-02 06:04:05 | skipping: [controller-0] => {"changed": false, "skip_reason": "Conditional result was False"}
2020-06-02 06:04:05 | Tuesday 02 June 2020  06:03:58 -0400 (0:00:00.178)       0:00:15.748 ********** 
2020-06-02 06:04:05 | 
2020-06-02 06:04:05 | TASK [Set fact upgrade_leapp_enabled] ******************************************
2020-06-02 06:04:05 | ok: [controller-0] => {"ansible_facts": {"upgrade_leapp_enabled": false}, "changed": false}
2020-06-02 06:04:05 | Tuesday 02 June 2020  06:03:59 -0400 (0:00:00.175)       0:00:15.924 ********** 
2020-06-02 06:04:05 | 
2020-06-02 06:04:05 | TASK [Check pacemaker cluster running before upgrade] **************************
2020-06-02 06:04:05 | fatal: [controller-0]: FAILED! => {"ansible_job_id": "200500165647.9038", "changed": false, "cmd": "pcs cluster status", "finished": 1, "msg": "[Errno 2] No such file or directory: 'pcs': 'pcs'", "rc": 2}
2020-06-02 06:04:05 | 
2020-06-02 06:04:05 | PLAY RECAP *********************************************************************
2020-06-02 06:04:05 | controller-0               : ok=25   changed=5    unreachable=0    failed=1    skipped=13   rescued=0    ignored=0   
2020-06-02 06:04:05 | 
2020-06-02 06:04:05 | Tuesday 02 June 2020  06:04:04 -0400 (0:00:05.661)       0:00:21.585 ********** 
2020-06-02 06:04:05 | =============================================================================== 
2020-06-02 06:04:05 | 
2020-06-02 06:04:05 | Ansible failed, check log at /var/log/mistral/package_update.log.

Log file: http://cougar11.scl.lab.tlv.redhat.com/DFG-upgrades-ffu-ffu-upgrade-13-16.1_director-rhel-virthost-3cont_2comp-ipv4-vxlan-HA-no-ceph/33/undercloud-0.tar.gz?undercloud-0/home/stack/overcloud_upgrade_run_controller-0.log

Full job logs: http://cougar11.scl.lab.tlv.redhat.com/DFG-upgrades-ffu-ffu-upgrade-13-16.1_director-rhel-virthost-3cont_2comp-ipv4-vxlan-HA-no-ceph/33/

Currently, the control plan upgrade when jumping from OSP13 to OSP16 consists in three phases, per controller node: 
1. Rhel upgrade vía leapp. The patch reverted in https://review.opendev.org/#/c/733077/, tried to automatically detect if the leapp
upgrade was required or not.
2. A transfer data step, to persist database information.
3. The controller upgrade.

In the third stage, a new pacemaker cluster will get created with the newly upgraded to RHEL8 nodes. However, this stage also depends a lot on the same parameter as stage 1 (upgrade_leapp_enabled). But this stage is run once the node is in RHEL8, so the upgrade_leapp_enabled will get set to false due to https://review.opendev.org/#/c/728390/. 

If upgrade_leapp_enabled is set to false, the upgrade workflow will try to upgrade the pacemaker services as if an N to N+1 upgrade: https://github.com/openstack/tripleo-heat-templates/blob/master/deployment/pacemaker/pacemaker-baremetal-puppet.yaml#L239-L243 and it won't create the new pacemaker cluster adding one node at a time, but adding all of them at the same time causing errors.

The code needs to decouple these two stages 1 and 3, so that they can run independently and a change in one won't impact the other.

Comment 2 Jesse Pretorius 2020-06-22 18:52:06 UTC


*** This bug has been marked as a duplicate of bug 1846444 ***