Bug 1666804

Summary: [RFE] Support for cinder-backup Active / Active
Product: Red Hat OpenStack Reporter: Alan Bishop <abishop>
Component: openstack-tripleo-heat-templatesAssignee: Alan Bishop <abishop>
Status: CLOSED ERRATA QA Contact: Evelina Shames <eshames>
Severity: medium Docs Contact:
Priority: high    
Version: 13.0 (Queens)CC: adhingra, aefrat, gcharot, geguileo, gfidente, kgilliga, ltoscano, mariel, mburns, morazi, rheslop, sandyada, yrabl
Target Milestone: betaKeywords: FutureFeature, Triaged
Target Release: 17.1   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-14.3.1-1.20230423001017.004ef6e.el9ost Doc Type: Enhancement
Doc Text:
With this update, the `cinder-backup` service can now be deployed in Active/Active mode.
Story Points: ---
Clone Of: 1665191 Environment:
Last Closed: 2023-08-16 01:09:21 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1783210, 2167954    
Bug Blocks: 2128012    

Description Alan Bishop 2019-01-16 15:33:39 UTC
As per a comment in the BZ from which this bug was cloned, we should consider deploying cinder-backup in active/active configuration. Or, change the a/p deployment under pacemaker so a separate cinder service isn't created for each pacemaker node that happens to run the service.

+++ This bug was initially created as a clone of Bug #1665191 +++

Description of problem:

In a three controller environment, when the cinder-backup service moves from one controller to another, a cinder-backup service entry is left behind for the old controller host and is in a DOWN state.  This stale entry can be misleading and can be flagged up by service monitoring software.

The stale entry can be seen by running:
openstack volume service list --service cinder-backup

Version-Release number of selected component (if applicable):


How reproducible:

Easy to reproduce

Steps to Reproduce:
1. Move the cinder-backup service by rebooting the node hosting it (or by using pcs to move it)
2. The cinder-backup service will start running on another controller in the cluster
3. Viewing the cinder-backup service list will show a stale entry for the previous controller

Actual results:

When listing the volume services, an entry for cinder-backup will be listed for the active controller and the previously active controller (will show as down)

Expected results:

Viewing the volume services should show only the active cinder-backup service

--- Additional comment from Jon Bernard on 2019-01-10 19:51:24 UTC ---

This has to be manually removed from the database, cinder does not do this automatically. This I believe is by design so that DB entries remain consistent.

--- Additional comment from Jon Bernard on 2019-01-10 23:47:55 UTC ---

Closing for now, there’s no engineering solution to resolve this in the short term. Scripts could be provided but nothing within cinder proper. Please reopen if you strongly disagree.

--- Additional comment from Kellen Gattis on 2019-01-11 15:21:11 UTC ---

Apologies, I should have put this against rhosp-director as I agree that it's not a cinder-specific issue, but rather a consequence of rhosp director's HA implementation of cinder-backup.


Thanks.

--- Additional comment from Alan Bishop on 2019-01-11 16:16:27 UTC ---

Well, I don't think this will yield what you want. The director takes care of the deployment, but isn't involved in what goes on when pacemaker starts cinder-backup on another node. In fact, while I understand it feels wrong to see the cinder-backup service "down" on the prior node, neither cinder nor pacemaker have any basis for treating this as a stale service entry. If pacemaker were to restart the service on the original node, the service will report itself "up" again (although the other one will now report "down").

cinder-backup under pacemaker has always behaved this way, and its more of a side effect of that model than a bug. But again, I'm not trying to diminish the fact that the behaviour is less than ideal.

I think what we really need is an RFE that improves the cinder-backup service's deployment model. One approach would have the service use a common identifier (like the cinder-volume service's use of "hostgroup") instead of each node's hostname. Another would be to run cinder-backup active/active (i.e. NOT under pacemaker). Note: cinder-backup supports a/a, whereas the cinder-volume service will only have limited a/a support in OSP-15.

To summarize, I think this BZ should be recast as an RFE.

Comment 27 Luigi Toscano 2023-05-02 12:46:29 UTC
The feature is enabled by changing the command line of `openstack overcloud deploy` from

  --environment-file /usr/share/openstack-tripleo-heat-templates/environments/cinder-backup.yaml \

to 

  --environment-file /usr/share/openstack-tripleo-heat-templates/environments/cinder-backup-active-active.yaml \


With the new environment file, the results of all cinder-backup tempest tests (and a few others) show the same results of the non-A/A deployment for all 3 cinder-backup backends (swift - the default, ceph and file/NFS).

Undercloud:
openstack-tripleo-common-15.4.1-1.20230421001014.85a3a44.el9ost.noarch
openstack-tripleo-common-containers-15.4.1-1.20230421001014.85a3a44.el9ost.noarch
openstack-tripleo-heat-templates-14.3.1-1.20230423001017.004ef6e.el9ost.noarch
python3-tripleo-common-15.4.1-1.20230421001014.85a3a44.el9ost.noarch
python3-tripleoclient-16.5.1-1.20230421001504.78730a3.el9ost.noarch
tripleo-ansible-3.3.1-1.20230423001017.a5af4ea.el9ost.noarch

Overcloud (cinder containers):
openstack-cinder-18.2.2-1.20230411050850.109f91a.el9ost
python3-os-brick-4.3.4-1.20230128060810.cf69f92.el9ost

Comment 38 errata-xmlrpc 2023-08-16 01:09:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Release of components for Red Hat OpenStack Platform 17.1 (Wallaby)), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2023:4577