Description of problem: In a three controller environment, when the cinder-backup service moves from one controller to another, a cinder-backup service entry is left behind for the old controller host and is in a DOWN state. This stale entry can be misleading and can be flagged up by service monitoring software. The stale entry can be seen by running: openstack volume service list --service cinder-backup Version-Release number of selected component (if applicable): How reproducible: Easy to reproduce Steps to Reproduce: 1. Move the cinder-backup service by rebooting the node hosting it (or by using pcs to move it) 2. The cinder-backup service will start running on another controller in the cluster 3. Viewing the cinder-backup service list will show a stale entry for the previous controller Actual results: When listing the volume services, an entry for cinder-backup will be listed for the active controller and the previously active controller (will show as down) Expected results: Viewing the volume services should show only the active cinder-backup service
This has to be manually removed from the database, cinder does not do this automatically. This I believe is by design so that DB entries remain consistent.
Closing for now, there’s no engineering solution to resolve this in the short term. Scripts could be provided but nothing within cinder proper. Please reopen if you strongly disagree.
Apologies, I should have put this against rhosp-director as I agree that it's not a cinder-specific issue, but rather a consequence of rhosp director's HA implementation of cinder-backup. Thanks.
Well, I don't think this will yield what you want. The director takes care of the deployment, but isn't involved in what goes on when pacemaker starts cinder-backup on another node. In fact, while I understand it feels wrong to see the cinder-backup service "down" on the prior node, neither cinder nor pacemaker have any basis for treating this as a stale service entry. If pacemaker were to restart the service on the original node, the service will report itself "up" again (although the other one will now report "down"). cinder-backup under pacemaker has always behaved this way, and its more of a side effect of that model than a bug. But again, I'm not trying to diminish the fact that the behaviour is less than ideal. I think what we really need is an RFE that improves the cinder-backup service's deployment model. One approach would have the service use a common identifier (like the cinder-volume service's use of "hostgroup") instead of each node's hostname. Another would be to run cinder-backup active/active (i.e. NOT under pacemaker). Note: cinder-backup supports a/a, whereas the cinder-volume service will only have limited a/a support in OSP-15. To summarize, I think this BZ should be recast as an RFE.
I'm closing this BZ out in favor of an RFE I created (bug #1666804). The titles (and topics) are different enough to warrant tracking them in separate BZs.
Close loop wise, nothing to test/automate as WONTFIX per this BZ. However Alan's RFE will include QE coverage/testing once it lands. Eventually I'll use RFE to track close loop process per this customer request.
*** Bug 1899160 has been marked as a duplicate of this bug. ***