Bug 1843469
| Summary: | [OSP13->OSP16.1] Overcloud upgrade run fails trying to stop cluster when the cluster is already down. | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Sergii Golovatiuk <sgolovat> |
| Component: | openstack-tripleo-heat-templates | Assignee: | Sergii Golovatiuk <sgolovat> |
| Status: | CLOSED DUPLICATE | QA Contact: | David Rosenfeld <drosenfe> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 16.1 (Train) | CC: | jfrancoa, jpretori, lbezdick, mburns, morazi |
| Target Milestone: | rc | Keywords: | Triaged |
| Target Release: | --- | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | openstack-tripleo-heat-templates-11.3.2-0.20200616081526.396affd.el8ost | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-06-22 18:52:06 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Sergii Golovatiuk
2020-06-03 11:27:47 UTC
Logs for the failure:
2020-06-02 06:04:05 |
2020-06-02 06:04:05 | TASK [Set fact galera_pcs_res] *************************************************
2020-06-02 06:04:05 | ok: [controller-0] => {"ansible_facts": {"galera_pcs_res": false}, "changed": false}
2020-06-02 06:04:05 | Tuesday 02 June 2020 06:03:58 -0400 (0:00:00.180) 0:00:15.389 **********
2020-06-02 06:04:05 |
2020-06-02 06:04:05 | TASK [set is_mysql_bootstrap_node fact] ****************************************
2020-06-02 06:04:05 | ok: [controller-0] => {"ansible_facts": {"is_mysql_bootstrap_node": true}, "changed": false}
2020-06-02 06:04:05 | Tuesday 02 June 2020 06:03:58 -0400 (0:00:00.181) 0:00:15.570 **********
2020-06-02 06:04:05 |
2020-06-02 06:04:05 | TASK [Gather missing facts] ****************************************************
2020-06-02 06:04:05 | skipping: [controller-0] => {"changed": false, "skip_reason": "Conditional result was False"}
2020-06-02 06:04:05 | Tuesday 02 June 2020 06:03:58 -0400 (0:00:00.178) 0:00:15.748 **********
2020-06-02 06:04:05 |
2020-06-02 06:04:05 | TASK [Set fact upgrade_leapp_enabled] ******************************************
2020-06-02 06:04:05 | ok: [controller-0] => {"ansible_facts": {"upgrade_leapp_enabled": false}, "changed": false}
2020-06-02 06:04:05 | Tuesday 02 June 2020 06:03:59 -0400 (0:00:00.175) 0:00:15.924 **********
2020-06-02 06:04:05 |
2020-06-02 06:04:05 | TASK [Check pacemaker cluster running before upgrade] **************************
2020-06-02 06:04:05 | fatal: [controller-0]: FAILED! => {"ansible_job_id": "200500165647.9038", "changed": false, "cmd": "pcs cluster status", "finished": 1, "msg": "[Errno 2] No such file or directory: 'pcs': 'pcs'", "rc": 2}
2020-06-02 06:04:05 |
2020-06-02 06:04:05 | PLAY RECAP *********************************************************************
2020-06-02 06:04:05 | controller-0 : ok=25 changed=5 unreachable=0 failed=1 skipped=13 rescued=0 ignored=0
2020-06-02 06:04:05 |
2020-06-02 06:04:05 | Tuesday 02 June 2020 06:04:04 -0400 (0:00:05.661) 0:00:21.585 **********
2020-06-02 06:04:05 | ===============================================================================
2020-06-02 06:04:05 |
2020-06-02 06:04:05 | Ansible failed, check log at /var/log/mistral/package_update.log.
Log file: http://cougar11.scl.lab.tlv.redhat.com/DFG-upgrades-ffu-ffu-upgrade-13-16.1_director-rhel-virthost-3cont_2comp-ipv4-vxlan-HA-no-ceph/33/undercloud-0.tar.gz?undercloud-0/home/stack/overcloud_upgrade_run_controller-0.log
Full job logs: http://cougar11.scl.lab.tlv.redhat.com/DFG-upgrades-ffu-ffu-upgrade-13-16.1_director-rhel-virthost-3cont_2comp-ipv4-vxlan-HA-no-ceph/33/
Currently, the control plan upgrade when jumping from OSP13 to OSP16 consists in three phases, per controller node:
1. Rhel upgrade vía leapp. The patch reverted in https://review.opendev.org/#/c/733077/, tried to automatically detect if the leapp
upgrade was required or not.
2. A transfer data step, to persist database information.
3. The controller upgrade.
In the third stage, a new pacemaker cluster will get created with the newly upgraded to RHEL8 nodes. However, this stage also depends a lot on the same parameter as stage 1 (upgrade_leapp_enabled). But this stage is run once the node is in RHEL8, so the upgrade_leapp_enabled will get set to false due to https://review.opendev.org/#/c/728390/.
If upgrade_leapp_enabled is set to false, the upgrade workflow will try to upgrade the pacemaker services as if an N to N+1 upgrade: https://github.com/openstack/tripleo-heat-templates/blob/master/deployment/pacemaker/pacemaker-baremetal-puppet.yaml#L239-L243 and it won't create the new pacemaker cluster adding one node at a time, but adding all of them at the same time causing errors.
The code needs to decouple these two stages 1 and 3, so that they can run independently and a change in one won't impact the other.
*** This bug has been marked as a duplicate of bug 1846444 *** |