Bug 2089512
Summary: | Multi-stack v2 with TLS-E - reimplement to fix all regressions as of upstream Wallaby | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | James Parker <jparker> |
Component: | openstack-tripleo-heat-templates | Assignee: | Bogdan Dobrelya <bdobreli> |
Status: | CLOSED ERRATA | QA Contact: | James Parker <jparker> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 17.1 (Wallaby) | CC: | alee, alifshit, bdobreli, bshephar, ggrasza, hbrock, hjensas, igallagh, jgrosso, johfulto, jschluet, jslagle, lmiccini, mburns, mschuppe, owalsh, ramishra, skaplons, smooney |
Target Milestone: | rc | Keywords: | Regression, Triaged |
Target Release: | 17.1 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | openstack-tripleo-heat-templates-14.3.1-1.20221226005336.9ada52b.el9ost | Doc Type: | Bug Fix |
Doc Text: |
The multi-cell and multi-stack overcloud features were not available in RHOSP 17.0, due to a regression. The regressions have been fixed, and multi-cell and multi-stack deployments are supported in RHOSP 17.1.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2023-08-16 01:11:11 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 2126157, 2137904 | ||
Bug Blocks: | 1759007, 1760256 |
Comment 1
Brendan Shephard
2022-05-25 03:14:07 UTC
Looks like no pacemaker stuff has been installed or configured: [heat-admin@cell1-controller-0 ~]$ sudo pcs status Error: error running crm_mon, is pacemaker running? crm_mon: Error: cluster is not available on this node Which is why that job is failing: May 23 15:27:16 cell1-controller-0 puppet-user[16280]: Debug: /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/returns: Exec try 280/360 May 23 15:27:16 cell1-controller-0 puppet-user[16280]: Debug: Exec[wait-for-settle](provider=posix): Executing '/sbin/pcs status | grep -q 'partition with quorum' > /dev/null 2>&1' May 23 15:27:16 cell1-controller-0 puppet-user[16280]: Debug: Executing: '/sbin/pcs status | grep -q 'partition with quorum' > /dev/null 2>&1' The roles_data file for the cell0 deployment is missing quite a few of the services we have for standard Controller roles. So I'm not sure if that's causing any issue. But the pacemaker service is failing to start because corosync is failing: [heat-admin@cell1-controller-0 ~]$ sudo systemctl start pacemaker A dependency job for pacemaker.service failed. See 'journalctl -xe' for details. [heat-admin@cell1-controller-0 ~]$ ^start^status sudo systemctl status pacemaker ○ pacemaker.service - Pacemaker High Availability Cluster Manager Loaded: loaded (/usr/lib/systemd/system/pacemaker.service; enabled; vendor preset: disabled) Active: inactive (dead) Docs: man:pacemakerd https://clusterlabs.org/pacemaker/doc/ May 26 11:38:12 cell1-controller-0.redhat.local systemd[1]: Dependency failed for Pacemaker High Availability Cluster Manager. May 26 11:38:12 cell1-controller-0.redhat.local systemd[1]: pacemaker.service: Job pacemaker.service/start failed with result 'dependency'. May 26 11:43:27 cell1-controller-0.redhat.local systemd[1]: Dependency failed for Pacemaker High Availability Cluster Manager. May 26 11:43:27 cell1-controller-0.redhat.local systemd[1]: pacemaker.service: Job pacemaker.service/start failed with result 'dependency'. corosync is failing because there is no corosync.conf file: [heat-admin@cell1-controller-0 ~]$ sudo systemctl start corosync Job for corosync.service failed because the control process exited with error code. See "systemctl status corosync.service" and "journalctl -xeu corosync.service" for details. [heat-admin@cell1-controller-0 ~]$ ^start^status sudo systemctl status corosync × corosync.service - Corosync Cluster Engine Loaded: loaded (/usr/lib/systemd/system/corosync.service; enabled; vendor preset: disabled) Active: failed (Result: exit-code) since Thu 2022-05-26 11:44:46 UTC; 7s ago Docs: man:corosync man:corosync.conf man:corosync_overview Process: 49367 ExecStart=/usr/sbin/corosync -f $COROSYNC_OPTIONS (code=exited, status=8) Main PID: 49367 (code=exited, status=8) CPU: 4ms May 26 11:44:46 cell1-controller-0.redhat.local systemd[1]: Starting Corosync Cluster Engine... May 26 11:44:46 cell1-controller-0.redhat.local corosync[49367]: Can't read file /etc/corosync/corosync.conf: No such file or directory May 26 11:44:46 cell1-controller-0.redhat.local systemd[1]: corosync.service: Main process exited, code=exited, status=8/n/a May 26 11:44:46 cell1-controller-0.redhat.local systemd[1]: corosync.service: Failed with result 'exit-code'. May 26 11:44:46 cell1-controller-0.redhat.local systemd[1]: Failed to start Corosync Cluster Engine. Adding dfg:pidone to assist on this one. It seems that this is rather related to multi-stack export/import than to multi-cells. Adding DFG:DF. The export files created from "overcloud export" are now created automatically during the deployment. You can find them in the working dir for each stack. E.g., ~/overcloud-deploy/overcloud/overcloud-export.yaml and ~/overcloud-deploy/cell1/cell1-export.yaml. We checked the environment and those files contain the pacemaker data. I'm not sure of the intended difference between "overcloud export" and "cell export" and which one should be used when. That will need to be confirmed with Compute, looks like Martin Schupper wrote most of "cell export" from overcloud_cell.py. overcloud_cell.py is not updated for the ephemeral Heat changes, so it's not going to work, as it's still trying to pass in a Heat client to the export function. @jparker would you try to use that auto-generated file in ./overcloud_cell_deployment.sh instead of /home/stack/cell1/cell-input.yaml ? I think the latter one is no longer needed for osp17? I'm not sure, maybe cells export CLI is/should still be a thing there The linked patch may be needed as well: 844636: Update cell export for ephemeral heat | https://review.opendev.org/c/openstack/python-tripleoclient/+/844636 Tentatively assigning Rajesh to this, as we suspect the mechanics of the fix will overlap with https://bugzilla.redhat.com/show_bug.cgi?id=1759007, which he is working on. Pushing this out of 17.0 for 17.1 Let me clarify, this is a regression that would block multi-cell v2 upgrades. Know issue for 17.0 GA is tracked at https://bugzilla.redhat.com/show_bug.cgi?id=2120398, this will need a bug fix release note. *** Bug 2126157 has been marked as a duplicate of this bug. *** Export was silently failing: $ openstack overcloud cell export -f --output-file /tmp/cell-input.yaml File /home/stack/overcloud-deploy/overcloud/config-download/overcloud/overcloud/group_vars/overcloud.json was not found during export No data returned to export AllNodesExtraMapData from. Cell information exported to /tmp/cell-input.yaml. ... ... Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Release of components for Red Hat OpenStack Platform 17.1 (Wallaby)), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2023:4577 |