Bug 2089512

Summary: Multi-stack v2 with TLS-E - reimplement to fix all regressions as of upstream Wallaby
Product: Red Hat OpenStack Reporter: James Parker <jparker>
Component: openstack-tripleo-heat-templatesAssignee: Bogdan Dobrelya <bdobreli>
Status: CLOSED ERRATA QA Contact: James Parker <jparker>
Severity: high Docs Contact:
Priority: high    
Version: 17.1 (Wallaby)CC: alee, alifshit, bdobreli, bshephar, ggrasza, hbrock, hjensas, igallagh, jgrosso, johfulto, jschluet, jslagle, lmiccini, mburns, mschuppe, owalsh, ramishra, skaplons, smooney
Target Milestone: rcKeywords: Regression, Triaged
Target Release: 17.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-14.3.1-1.20221226005336.9ada52b.el9ost Doc Type: Bug Fix
Doc Text:
The multi-cell and multi-stack overcloud features were not available in RHOSP 17.0, due to a regression. The regressions have been fixed, and multi-cell and multi-stack deployments are supported in RHOSP 17.1.
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-08-16 01:11:11 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2126157, 2137904    
Bug Blocks: 1759007, 1760256    

Comment 1 Brendan Shephard 2022-05-25 03:14:07 UTC
Hey, do we have the logs from that controller?
[[1;30m2022-05-23 16:05:54.796051 | 525400f4-e180-641e-7ecb-000000005e12 |    WAITING | Wait for puppet host configuration to finish | cell1-controller-0 | 204 retries left

Puppet failed on the controller, so we need to see what failed in /var/log/messages there.

Comment 4 Brendan Shephard 2022-05-26 11:46:09 UTC
Looks like no pacemaker stuff has been installed or configured:
[heat-admin@cell1-controller-0 ~]$ sudo pcs status
Error: error running crm_mon, is pacemaker running?
  crm_mon: Error: cluster is not available on this node


Which is why that job is failing:
May 23 15:27:16 cell1-controller-0 puppet-user[16280]: Debug: /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/returns: Exec try 280/360
May 23 15:27:16 cell1-controller-0 puppet-user[16280]: Debug: Exec[wait-for-settle](provider=posix): Executing '/sbin/pcs status | grep -q 'partition with quorum' > /dev/null 2>&1'
May 23 15:27:16 cell1-controller-0 puppet-user[16280]: Debug: Executing: '/sbin/pcs status | grep -q 'partition with quorum' > /dev/null 2>&1'

The roles_data file for the cell0 deployment is missing quite a few of the services we have for standard Controller roles. So I'm not sure if that's causing any issue.

But the pacemaker service is failing to start because corosync is failing:
[heat-admin@cell1-controller-0 ~]$ sudo systemctl start pacemaker
A dependency job for pacemaker.service failed. See 'journalctl -xe' for details.
[heat-admin@cell1-controller-0 ~]$ ^start^status
sudo systemctl status pacemaker
○ pacemaker.service - Pacemaker High Availability Cluster Manager
     Loaded: loaded (/usr/lib/systemd/system/pacemaker.service; enabled; vendor preset: disabled)
     Active: inactive (dead)
       Docs: man:pacemakerd
             https://clusterlabs.org/pacemaker/doc/

May 26 11:38:12 cell1-controller-0.redhat.local systemd[1]: Dependency failed for Pacemaker High Availability Cluster Manager.
May 26 11:38:12 cell1-controller-0.redhat.local systemd[1]: pacemaker.service: Job pacemaker.service/start failed with result 'dependency'.
May 26 11:43:27 cell1-controller-0.redhat.local systemd[1]: Dependency failed for Pacemaker High Availability Cluster Manager.
May 26 11:43:27 cell1-controller-0.redhat.local systemd[1]: pacemaker.service: Job pacemaker.service/start failed with result 'dependency'.

corosync is failing because there is no corosync.conf file:
[heat-admin@cell1-controller-0 ~]$ sudo systemctl start corosync
Job for corosync.service failed because the control process exited with error code.
See "systemctl status corosync.service" and "journalctl -xeu corosync.service" for details.
[heat-admin@cell1-controller-0 ~]$ ^start^status
sudo systemctl status corosync
× corosync.service - Corosync Cluster Engine
     Loaded: loaded (/usr/lib/systemd/system/corosync.service; enabled; vendor preset: disabled)
     Active: failed (Result: exit-code) since Thu 2022-05-26 11:44:46 UTC; 7s ago
       Docs: man:corosync
             man:corosync.conf
             man:corosync_overview
    Process: 49367 ExecStart=/usr/sbin/corosync -f $COROSYNC_OPTIONS (code=exited, status=8)
   Main PID: 49367 (code=exited, status=8)
        CPU: 4ms

May 26 11:44:46 cell1-controller-0.redhat.local systemd[1]: Starting Corosync Cluster Engine...
May 26 11:44:46 cell1-controller-0.redhat.local corosync[49367]: Can't read file /etc/corosync/corosync.conf: No such file or directory
May 26 11:44:46 cell1-controller-0.redhat.local systemd[1]: corosync.service: Main process exited, code=exited, status=8/n/a
May 26 11:44:46 cell1-controller-0.redhat.local systemd[1]: corosync.service: Failed with result 'exit-code'.
May 26 11:44:46 cell1-controller-0.redhat.local systemd[1]: Failed to start Corosync Cluster Engine.


Adding dfg:pidone to assist on this one.

Comment 10 Bogdan Dobrelya 2022-06-03 11:34:19 UTC
It seems that this is rather related to multi-stack export/import than to multi-cells. Adding DFG:DF.

Comment 12 James Slagle 2022-06-03 12:18:23 UTC
The export files created from "overcloud export" are now created automatically during the deployment. You can find them in the working dir for each stack. E.g., ~/overcloud-deploy/overcloud/overcloud-export.yaml and ~/overcloud-deploy/cell1/cell1-export.yaml.

We checked the environment and those files contain the pacemaker data.

I'm not sure of the intended difference between "overcloud export" and "cell export" and which one should be used when. That will need to be confirmed with Compute, looks like Martin Schupper wrote most of "cell export" from overcloud_cell.py. overcloud_cell.py is not updated for the ephemeral Heat changes, so it's not going to work, as it's still trying to pass in a Heat client to the export function.

Comment 13 Bogdan Dobrelya 2022-06-03 12:21:03 UTC
@jparker would you try to use that auto-generated file in

./overcloud_cell_deployment.sh

instead of /home/stack/cell1/cell-input.yaml ?

I think the latter one is no longer needed for osp17? I'm not sure, maybe cells export CLI is/should still be a thing there

Comment 15 James Slagle 2022-06-03 18:53:06 UTC
The linked patch may be needed as well:
844636: Update cell export for ephemeral heat | https://review.opendev.org/c/openstack/python-tripleoclient/+/844636

Comment 18 Artom Lifshitz 2022-06-08 15:48:39 UTC
Tentatively assigning Rajesh to this, as we suspect the mechanics of the fix will overlap with https://bugzilla.redhat.com/show_bug.cgi?id=1759007, which he is working on.

Comment 22 Bogdan Dobrelya 2022-06-23 13:58:50 UTC
Pushing this out of 17.0 for 17.1

Comment 55 Bogdan Dobrelya 2022-08-01 15:13:44 UTC
Let me clarify, this is a regression that would block multi-cell v2 upgrades.

Comment 69 Artom Lifshitz 2022-08-22 19:46:44 UTC
Know issue for 17.0 GA is tracked at https://bugzilla.redhat.com/show_bug.cgi?id=2120398, this will need a bug fix release note.

Comment 72 Bogdan Dobrelya 2022-09-12 15:45:52 UTC
*** Bug 2126157 has been marked as a duplicate of this bug. ***

Comment 115 Ollie Walsh 2022-10-24 22:23:35 UTC
Export was silently failing:

    $ openstack overcloud cell export -f --output-file /tmp/cell-input.yaml 
    File /home/stack/overcloud-deploy/overcloud/config-download/overcloud/overcloud/group_vars/overcloud.json was not found during export
    No data returned to export AllNodesExtraMapData from.
    Cell information exported to /tmp/cell-input.yaml.
    ...
    ...

Comment 132 errata-xmlrpc 2023-08-16 01:11:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Release of components for Red Hat OpenStack Platform 17.1 (Wallaby)), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2023:4577