Bug 2089512 - Multi-stack v2 with TLS-E - reimplement to fix all regressions as of upstream Wallaby
Summary: Multi-stack v2 with TLS-E - reimplement to fix all regressions as of upstream...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 17.1 (Wallaby)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: rc
: 17.1
Assignee: Bogdan Dobrelya
QA Contact: James Parker
URL:
Whiteboard:
: 2126157 (view as bug list)
Depends On: 2126157 2137904
Blocks: 1759007 1760256
TreeView+ depends on / blocked
 
Reported: 2022-05-23 20:22 UTC by James Parker
Modified: 2023-08-16 01:11 UTC (History)
19 users (show)

Fixed In Version: openstack-tripleo-heat-templates-14.3.1-1.20221226005336.9ada52b.el9ost
Doc Type: Bug Fix
Doc Text:
The multi-cell and multi-stack overcloud features were not available in RHOSP 17.0, due to a regression. The regressions have been fixed, and multi-cell and multi-stack deployments are supported in RHOSP 17.1.
Clone Of:
Environment:
Last Closed: 2023-08-16 01:11:11 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Gerrithub.io 541747 0 None MERGED OSP17: Multi-cell post tasks to embrace nova-less UC 2023-01-11 15:22:21 UTC
OpenStack gerrit 845831 0 None MERGED Add GlobalConfig to saved stack outputs 2022-09-29 14:28:49 UTC
OpenStack gerrit 851241 0 None MERGED Update cell export for ephemeral heat 2022-09-29 14:28:54 UTC
OpenStack gerrit 851853 0 None MERGED Add cell export to overcloud deploy 2022-09-29 14:28:55 UTC
OpenStack gerrit 861306 0 None MERGED Run nova_wait_for_compute_service only for the default cell deployment 2022-10-17 17:45:56 UTC
OpenStack gerrit 861618 0 None MERGED Fix insertafter for managing cell host entries 2023-01-11 15:22:00 UTC
OpenStack gerrit 861890 0 None MERGED Support ansible inventory merging 2023-01-11 15:22:04 UTC
OpenStack gerrit 862144 0 None MERGED Create entry for empty groups in inventory_rolemap 2023-01-11 15:22:07 UTC
OpenStack gerrit 862645 0 None MERGED Fix overcloud cell export paths 2023-01-11 15:22:11 UTC
OpenStack gerrit 864475 0 None MERGED Use python to template cell urls 2023-01-11 15:22:14 UTC
Red Hat Issue Tracker OSP-15359 0 None None None 2022-05-23 20:28:30 UTC
Red Hat Product Errata RHEA-2023:4577 0 None None None 2023-08-16 01:11:50 UTC

Comment 1 Brendan Shephard 2022-05-25 03:14:07 UTC
Hey, do we have the logs from that controller?
[[1;30m2022-05-23 16:05:54.796051 | 525400f4-e180-641e-7ecb-000000005e12 |    WAITING | Wait for puppet host configuration to finish | cell1-controller-0 | 204 retries left

Puppet failed on the controller, so we need to see what failed in /var/log/messages there.

Comment 4 Brendan Shephard 2022-05-26 11:46:09 UTC
Looks like no pacemaker stuff has been installed or configured:
[heat-admin@cell1-controller-0 ~]$ sudo pcs status
Error: error running crm_mon, is pacemaker running?
  crm_mon: Error: cluster is not available on this node


Which is why that job is failing:
May 23 15:27:16 cell1-controller-0 puppet-user[16280]: Debug: /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/returns: Exec try 280/360
May 23 15:27:16 cell1-controller-0 puppet-user[16280]: Debug: Exec[wait-for-settle](provider=posix): Executing '/sbin/pcs status | grep -q 'partition with quorum' > /dev/null 2>&1'
May 23 15:27:16 cell1-controller-0 puppet-user[16280]: Debug: Executing: '/sbin/pcs status | grep -q 'partition with quorum' > /dev/null 2>&1'

The roles_data file for the cell0 deployment is missing quite a few of the services we have for standard Controller roles. So I'm not sure if that's causing any issue.

But the pacemaker service is failing to start because corosync is failing:
[heat-admin@cell1-controller-0 ~]$ sudo systemctl start pacemaker
A dependency job for pacemaker.service failed. See 'journalctl -xe' for details.
[heat-admin@cell1-controller-0 ~]$ ^start^status
sudo systemctl status pacemaker
○ pacemaker.service - Pacemaker High Availability Cluster Manager
     Loaded: loaded (/usr/lib/systemd/system/pacemaker.service; enabled; vendor preset: disabled)
     Active: inactive (dead)
       Docs: man:pacemakerd
             https://clusterlabs.org/pacemaker/doc/

May 26 11:38:12 cell1-controller-0.redhat.local systemd[1]: Dependency failed for Pacemaker High Availability Cluster Manager.
May 26 11:38:12 cell1-controller-0.redhat.local systemd[1]: pacemaker.service: Job pacemaker.service/start failed with result 'dependency'.
May 26 11:43:27 cell1-controller-0.redhat.local systemd[1]: Dependency failed for Pacemaker High Availability Cluster Manager.
May 26 11:43:27 cell1-controller-0.redhat.local systemd[1]: pacemaker.service: Job pacemaker.service/start failed with result 'dependency'.

corosync is failing because there is no corosync.conf file:
[heat-admin@cell1-controller-0 ~]$ sudo systemctl start corosync
Job for corosync.service failed because the control process exited with error code.
See "systemctl status corosync.service" and "journalctl -xeu corosync.service" for details.
[heat-admin@cell1-controller-0 ~]$ ^start^status
sudo systemctl status corosync
× corosync.service - Corosync Cluster Engine
     Loaded: loaded (/usr/lib/systemd/system/corosync.service; enabled; vendor preset: disabled)
     Active: failed (Result: exit-code) since Thu 2022-05-26 11:44:46 UTC; 7s ago
       Docs: man:corosync
             man:corosync.conf
             man:corosync_overview
    Process: 49367 ExecStart=/usr/sbin/corosync -f $COROSYNC_OPTIONS (code=exited, status=8)
   Main PID: 49367 (code=exited, status=8)
        CPU: 4ms

May 26 11:44:46 cell1-controller-0.redhat.local systemd[1]: Starting Corosync Cluster Engine...
May 26 11:44:46 cell1-controller-0.redhat.local corosync[49367]: Can't read file /etc/corosync/corosync.conf: No such file or directory
May 26 11:44:46 cell1-controller-0.redhat.local systemd[1]: corosync.service: Main process exited, code=exited, status=8/n/a
May 26 11:44:46 cell1-controller-0.redhat.local systemd[1]: corosync.service: Failed with result 'exit-code'.
May 26 11:44:46 cell1-controller-0.redhat.local systemd[1]: Failed to start Corosync Cluster Engine.


Adding dfg:pidone to assist on this one.

Comment 10 Bogdan Dobrelya 2022-06-03 11:34:19 UTC
It seems that this is rather related to multi-stack export/import than to multi-cells. Adding DFG:DF.

Comment 12 James Slagle 2022-06-03 12:18:23 UTC
The export files created from "overcloud export" are now created automatically during the deployment. You can find them in the working dir for each stack. E.g., ~/overcloud-deploy/overcloud/overcloud-export.yaml and ~/overcloud-deploy/cell1/cell1-export.yaml.

We checked the environment and those files contain the pacemaker data.

I'm not sure of the intended difference between "overcloud export" and "cell export" and which one should be used when. That will need to be confirmed with Compute, looks like Martin Schupper wrote most of "cell export" from overcloud_cell.py. overcloud_cell.py is not updated for the ephemeral Heat changes, so it's not going to work, as it's still trying to pass in a Heat client to the export function.

Comment 13 Bogdan Dobrelya 2022-06-03 12:21:03 UTC
@jparker would you try to use that auto-generated file in

./overcloud_cell_deployment.sh

instead of /home/stack/cell1/cell-input.yaml ?

I think the latter one is no longer needed for osp17? I'm not sure, maybe cells export CLI is/should still be a thing there

Comment 15 James Slagle 2022-06-03 18:53:06 UTC
The linked patch may be needed as well:
844636: Update cell export for ephemeral heat | https://review.opendev.org/c/openstack/python-tripleoclient/+/844636

Comment 18 Artom Lifshitz 2022-06-08 15:48:39 UTC
Tentatively assigning Rajesh to this, as we suspect the mechanics of the fix will overlap with https://bugzilla.redhat.com/show_bug.cgi?id=1759007, which he is working on.

Comment 22 Bogdan Dobrelya 2022-06-23 13:58:50 UTC
Pushing this out of 17.0 for 17.1

Comment 55 Bogdan Dobrelya 2022-08-01 15:13:44 UTC
Let me clarify, this is a regression that would block multi-cell v2 upgrades.

Comment 69 Artom Lifshitz 2022-08-22 19:46:44 UTC
Know issue for 17.0 GA is tracked at https://bugzilla.redhat.com/show_bug.cgi?id=2120398, this will need a bug fix release note.

Comment 72 Bogdan Dobrelya 2022-09-12 15:45:52 UTC
*** Bug 2126157 has been marked as a duplicate of this bug. ***

Comment 115 Ollie Walsh 2022-10-24 22:23:35 UTC
Export was silently failing:

    $ openstack overcloud cell export -f --output-file /tmp/cell-input.yaml 
    File /home/stack/overcloud-deploy/overcloud/config-download/overcloud/overcloud/group_vars/overcloud.json was not found during export
    No data returned to export AllNodesExtraMapData from.
    Cell information exported to /tmp/cell-input.yaml.
    ...
    ...

Comment 132 errata-xmlrpc 2023-08-16 01:11:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Release of components for Red Hat OpenStack Platform 17.1 (Wallaby)), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2023:4577


Note You need to log in before you can comment on or make changes to this bug.