Bug 1666804 - [RFE] Support for cinder-backup Active / Active
Summary: [RFE] Support for cinder-backup Active / Active
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 13.0 (Queens)
Hardware: x86_64
OS: Linux
high
medium
Target Milestone: beta
: 17.1
Assignee: Alan Bishop
QA Contact: Evelina Shames
URL:
Whiteboard:
Depends On: 1783210 2167954
Blocks: 2128012
TreeView+ depends on / blocked
 
Reported: 2019-01-16 15:33 UTC by Alan Bishop
Modified: 2024-10-01 16:13 UTC (History)
13 users (show)

Fixed In Version: openstack-tripleo-heat-templates-14.3.1-1.20230423001017.004ef6e.el9ost
Doc Type: Enhancement
Doc Text:
With this update, the `cinder-backup` service can now be deployed in Active/Active mode.
Clone Of: 1665191
Environment:
Last Closed: 2023-08-16 01:09:21 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1849668 0 None None None 2019-10-24 13:00:15 UTC
OpenStack gerrit 786722 0 None MERGED Support removing cinder-backup from pcmk control 2021-04-22 20:41:35 UTC
OpenStack gerrit 879673 0 None MERGED Fix cinder-backup when running active/active 2023-04-12 17:13:46 UTC
Red Hat Issue Tracker OSP-2843 0 None None None 2022-03-13 17:10:57 UTC
Red Hat Product Errata RHEA-2023:4577 0 None None None 2023-08-16 01:10:48 UTC

Description Alan Bishop 2019-01-16 15:33:39 UTC
As per a comment in the BZ from which this bug was cloned, we should consider deploying cinder-backup in active/active configuration. Or, change the a/p deployment under pacemaker so a separate cinder service isn't created for each pacemaker node that happens to run the service.

+++ This bug was initially created as a clone of Bug #1665191 +++

Description of problem:

In a three controller environment, when the cinder-backup service moves from one controller to another, a cinder-backup service entry is left behind for the old controller host and is in a DOWN state.  This stale entry can be misleading and can be flagged up by service monitoring software.

The stale entry can be seen by running:
openstack volume service list --service cinder-backup

Version-Release number of selected component (if applicable):


How reproducible:

Easy to reproduce

Steps to Reproduce:
1. Move the cinder-backup service by rebooting the node hosting it (or by using pcs to move it)
2. The cinder-backup service will start running on another controller in the cluster
3. Viewing the cinder-backup service list will show a stale entry for the previous controller

Actual results:

When listing the volume services, an entry for cinder-backup will be listed for the active controller and the previously active controller (will show as down)

Expected results:

Viewing the volume services should show only the active cinder-backup service

--- Additional comment from Jon Bernard on 2019-01-10 19:51:24 UTC ---

This has to be manually removed from the database, cinder does not do this automatically. This I believe is by design so that DB entries remain consistent.

--- Additional comment from Jon Bernard on 2019-01-10 23:47:55 UTC ---

Closing for now, there’s no engineering solution to resolve this in the short term. Scripts could be provided but nothing within cinder proper. Please reopen if you strongly disagree.

--- Additional comment from Kellen Gattis on 2019-01-11 15:21:11 UTC ---

Apologies, I should have put this against rhosp-director as I agree that it's not a cinder-specific issue, but rather a consequence of rhosp director's HA implementation of cinder-backup.


Thanks.

--- Additional comment from Alan Bishop on 2019-01-11 16:16:27 UTC ---

Well, I don't think this will yield what you want. The director takes care of the deployment, but isn't involved in what goes on when pacemaker starts cinder-backup on another node. In fact, while I understand it feels wrong to see the cinder-backup service "down" on the prior node, neither cinder nor pacemaker have any basis for treating this as a stale service entry. If pacemaker were to restart the service on the original node, the service will report itself "up" again (although the other one will now report "down").

cinder-backup under pacemaker has always behaved this way, and its more of a side effect of that model than a bug. But again, I'm not trying to diminish the fact that the behaviour is less than ideal.

I think what we really need is an RFE that improves the cinder-backup service's deployment model. One approach would have the service use a common identifier (like the cinder-volume service's use of "hostgroup") instead of each node's hostname. Another would be to run cinder-backup active/active (i.e. NOT under pacemaker). Note: cinder-backup supports a/a, whereas the cinder-volume service will only have limited a/a support in OSP-15.

To summarize, I think this BZ should be recast as an RFE.

Comment 27 Luigi Toscano 2023-05-02 12:46:29 UTC
The feature is enabled by changing the command line of `openstack overcloud deploy` from

  --environment-file /usr/share/openstack-tripleo-heat-templates/environments/cinder-backup.yaml \

to 

  --environment-file /usr/share/openstack-tripleo-heat-templates/environments/cinder-backup-active-active.yaml \


With the new environment file, the results of all cinder-backup tempest tests (and a few others) show the same results of the non-A/A deployment for all 3 cinder-backup backends (swift - the default, ceph and file/NFS).

Undercloud:
openstack-tripleo-common-15.4.1-1.20230421001014.85a3a44.el9ost.noarch
openstack-tripleo-common-containers-15.4.1-1.20230421001014.85a3a44.el9ost.noarch
openstack-tripleo-heat-templates-14.3.1-1.20230423001017.004ef6e.el9ost.noarch
python3-tripleo-common-15.4.1-1.20230421001014.85a3a44.el9ost.noarch
python3-tripleoclient-16.5.1-1.20230421001504.78730a3.el9ost.noarch
tripleo-ansible-3.3.1-1.20230423001017.a5af4ea.el9ost.noarch

Overcloud (cinder containers):
openstack-cinder-18.2.2-1.20230411050850.109f91a.el9ost
python3-os-brick-4.3.4-1.20230128060810.cf69f92.el9ost

Comment 38 errata-xmlrpc 2023-08-16 01:09:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Release of components for Red Hat OpenStack Platform 17.1 (Wallaby)), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2023:4577


Note You need to log in before you can comment on or make changes to this bug.