Description of problem: https://review.opendev.org/#/c/728390/ broke FFWD as don't evaluate controllers properly. So, pcs cannot create pacemaker cluster as don't get the proper list of servers. How reproducible: Always. Steps to Reproduce: 1. Install 13 env. Start FFWD controller-0 upgrade. Run, leapp. run actual upgrade openstack overcloud upgrade run \ --limit controller-0 \ | tee oc-c0-upgrade-run.log Actual results: it fails on creating pacemaker cluster as pcs cluster setup included controller-0,1,2 instead of controller-0 as it was not evaluated properly, Expected results: Created pacemaker cluster with single controller (controller-0) Additional info: Revert https://review.opendev.org/#/c/733077/ should help
Logs for the failure: 2020-06-02 06:04:05 | 2020-06-02 06:04:05 | TASK [Set fact galera_pcs_res] ************************************************* 2020-06-02 06:04:05 | ok: [controller-0] => {"ansible_facts": {"galera_pcs_res": false}, "changed": false} 2020-06-02 06:04:05 | Tuesday 02 June 2020 06:03:58 -0400 (0:00:00.180) 0:00:15.389 ********** 2020-06-02 06:04:05 | 2020-06-02 06:04:05 | TASK [set is_mysql_bootstrap_node fact] **************************************** 2020-06-02 06:04:05 | ok: [controller-0] => {"ansible_facts": {"is_mysql_bootstrap_node": true}, "changed": false} 2020-06-02 06:04:05 | Tuesday 02 June 2020 06:03:58 -0400 (0:00:00.181) 0:00:15.570 ********** 2020-06-02 06:04:05 | 2020-06-02 06:04:05 | TASK [Gather missing facts] **************************************************** 2020-06-02 06:04:05 | skipping: [controller-0] => {"changed": false, "skip_reason": "Conditional result was False"} 2020-06-02 06:04:05 | Tuesday 02 June 2020 06:03:58 -0400 (0:00:00.178) 0:00:15.748 ********** 2020-06-02 06:04:05 | 2020-06-02 06:04:05 | TASK [Set fact upgrade_leapp_enabled] ****************************************** 2020-06-02 06:04:05 | ok: [controller-0] => {"ansible_facts": {"upgrade_leapp_enabled": false}, "changed": false} 2020-06-02 06:04:05 | Tuesday 02 June 2020 06:03:59 -0400 (0:00:00.175) 0:00:15.924 ********** 2020-06-02 06:04:05 | 2020-06-02 06:04:05 | TASK [Check pacemaker cluster running before upgrade] ************************** 2020-06-02 06:04:05 | fatal: [controller-0]: FAILED! => {"ansible_job_id": "200500165647.9038", "changed": false, "cmd": "pcs cluster status", "finished": 1, "msg": "[Errno 2] No such file or directory: 'pcs': 'pcs'", "rc": 2} 2020-06-02 06:04:05 | 2020-06-02 06:04:05 | PLAY RECAP ********************************************************************* 2020-06-02 06:04:05 | controller-0 : ok=25 changed=5 unreachable=0 failed=1 skipped=13 rescued=0 ignored=0 2020-06-02 06:04:05 | 2020-06-02 06:04:05 | Tuesday 02 June 2020 06:04:04 -0400 (0:00:05.661) 0:00:21.585 ********** 2020-06-02 06:04:05 | =============================================================================== 2020-06-02 06:04:05 | 2020-06-02 06:04:05 | Ansible failed, check log at /var/log/mistral/package_update.log. Log file: http://cougar11.scl.lab.tlv.redhat.com/DFG-upgrades-ffu-ffu-upgrade-13-16.1_director-rhel-virthost-3cont_2comp-ipv4-vxlan-HA-no-ceph/33/undercloud-0.tar.gz?undercloud-0/home/stack/overcloud_upgrade_run_controller-0.log Full job logs: http://cougar11.scl.lab.tlv.redhat.com/DFG-upgrades-ffu-ffu-upgrade-13-16.1_director-rhel-virthost-3cont_2comp-ipv4-vxlan-HA-no-ceph/33/ Currently, the control plan upgrade when jumping from OSP13 to OSP16 consists in three phases, per controller node: 1. Rhel upgrade vía leapp. The patch reverted in https://review.opendev.org/#/c/733077/, tried to automatically detect if the leapp upgrade was required or not. 2. A transfer data step, to persist database information. 3. The controller upgrade. In the third stage, a new pacemaker cluster will get created with the newly upgraded to RHEL8 nodes. However, this stage also depends a lot on the same parameter as stage 1 (upgrade_leapp_enabled). But this stage is run once the node is in RHEL8, so the upgrade_leapp_enabled will get set to false due to https://review.opendev.org/#/c/728390/. If upgrade_leapp_enabled is set to false, the upgrade workflow will try to upgrade the pacemaker services as if an N to N+1 upgrade: https://github.com/openstack/tripleo-heat-templates/blob/master/deployment/pacemaker/pacemaker-baremetal-puppet.yaml#L239-L243 and it won't create the new pacemaker cluster adding one node at a time, but adding all of them at the same time causing errors. The code needs to decouple these two stages 1 and 3, so that they can run independently and a change in one won't impact the other.
*** This bug has been marked as a duplicate of bug 1846444 ***