Bug 2106647 - Fix Cinder's pacemaker services restart due to config changes
Summary: Fix Cinder's pacemaker services restart due to config changes
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 16.2 (Train)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: z4
: 16.2 (Train on RHEL 8.4)
Assignee: Alan Bishop
QA Contact: Tzach Shefi
URL:
Whiteboard:
Depends On: 2106643
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-07-13 08:21 UTC by Tzach Shefi
Modified: 2022-12-07 19:23 UTC (History)
6 users (show)

Fixed In Version: openstack-tripleo-heat-templates-11.6.1-2.20220821010130.b1e9bfe.el8ost
Doc Type: Bug Fix
Doc Text:
Before this update, in overcloud deployments that enabled the Block Storage (cinder) backup service, a stack update affecting the Block Storage configuration did not restart the Block Storage service. This caused the Block Storage service to use the old configuration. With this update, the stack update procedure ensures that both the Block Storage backup service and the Block Storage service restart when the Block Storage configuration changes. This ensures that the Block Storage service always uses the latest configuration.
Clone Of: 2106643
Environment:
Last Closed: 2022-12-07 19:23:35 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1981591 0 None None None 2022-07-13 14:38:18 UTC
OpenStack gerrit 849710 0 None MERGED Fix restarting cinder HA services on config change 2022-07-19 14:39:04 UTC
OpenStack gerrit 850354 0 None MERGED Fix restarting cinder HA services on config change 2022-07-25 13:05:31 UTC
Red Hat Issue Tracker OSP-17593 0 None None None 2022-07-13 08:27:31 UTC
Red Hat Product Errata RHBA-2022:8794 0 None None None 2022-12-07 19:23:57 UTC

Description Tzach Shefi 2022-07-13 08:21:01 UTC
unfortunately don't have my OSP16 deployment any longer, but I tested it and hit same issue, I think the version I was using was:
openstack-tripleo-heat-templates-11.6.1-2.20220409014852.el8ost.noarch.rpm 



+++ This bug was initially created as a clone of Bug #2106643 +++

Description of problem: Adding a second Ceph RBD pool to Cinder via overcloud update may fail to restart/update cinder.conf on Cinder volume docker, causing the absence of expected new Cinder backend. 

If we add the second pool yaml on initial overcloud deployment, the pool/backend is always added as expected. 

Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-14.3.1-0.20220628111342.7c969c5.el9ost.noarch

Also hits 16.2, will duplicate bz, but shouldn't happen on 16.1

How reproducible:
Every time,
Tested at least twice on osp17 and 16.2. 

Steps to Reproduce:
1. On an existing deployment, add a yaml which should create/add a new RBD pool for example:
Cat extra_templates.yaml
parameter_defaults:
    CephPools:
      - {"name": vol2, "pg_num": 32, "pgp_num": 32, "application": rbd} 
    CinderRbdExtraPools:
      - vol2

2. After overcloud update completes, 
we confirm on the Ceph node the new pool is created. 
Cinder.conf also includes the new pool/backend.
However running #cinder service-list the new pool/backend isn't listed neither as up or down, it's missing all together. 


3. If you then restart C-vol docker, the backend will suddenly show on cinder service-list as expected with an "up" statues, after which a volume can now be successfully created on it.

Quoting Alan's RCA from an email, 
the issue happen only when cinder-backup is enabled. 
Tripleo-ansible role that handles restarting pacemaker services isn't aware there are two separate pcmk cinder services to consider. It detects that cinder.conf has changed, and uses that information to restart the cinder-backup service. But then when it checks again for the cinder-volume service, it reaches the wrong conclusion. It doesn't restart the cinder-volume service because restarting c-bak tricked it into thinking c-vol was OK.


Actual results:
The new pool is created on Ceph side, 
but C-vol doesn't notice the change, thus the second backend isn't added. 
Until we restart c-vol docker, which then pulls in the config changes, adding the missing second backend.


Expected results:
The second Ceph pool, should show up as a second Cinder backend pool,
without having to manually restart c-vol docker. 

Additional info:

Comment 4 Tzach Shefi 2022-09-04 14:35:45 UTC
Verified on:
openstack-tripleo-heat-templates-11.6.1-2.20220821010130.b1e9bfe.el8ost.noarch

Deployed a basic Ceph system, it just so happens that the job I used had created two backend:

(overcloud) [stack@undercloud-0 ~]$ cinder service-list
+------------------+---------------------------------+------+---------+-------+----------------------------+-----------------+                                                                                    
| Binary           | Host                            | Zone | Status  | State | Updated_at                 | Disabled Reason |                                                                                    
+------------------+---------------------------------+------+---------+-------+----------------------------+-----------------+                                                                                    
| cinder-backup    | controller-1                    | nova | enabled | up    | 2022-09-04T12:56:02.000000 | -               |                                                                                    
..                                                                                  
| cinder-volume    | hostgroup@tripleo_ceph          | nova | enabled | up    | 2022-09-04T12:56:03.000000 | -               |                                                                                    
| cinder-volume    | hostgroup@tripleo_ceph_fastpool | nova | enabled | up    | 2022-09-04T12:56:03.000000 | -               | 


Again I created a yaml to add a third Cinder ceph pool/backend
(overcloud) [stack@undercloud-0 ~]$ cat extra_templates.yaml 
parameter_defaults:
    CephPools:
      - {"name": vol2, "pg_num": 32, "pgp_num": 32, "application": rbd} 
    CinderRbdExtraPools:
      - vol2

Added the above yaml to the overcloud_deploy.sh command and updated the overcloud. 
As expected the result was that the new Ceph third pool/backend now exists:


(overcloud) [stack@undercloud-0 ~]$ cinder service-list
+------------------+---------------------------------+------+---------+-------+----------------------------+-----------------+
| Binary           | Host                            | Zone | Status  | State | Updated_at                 | Disabled Reason |
+------------------+---------------------------------+------+---------+-------+----------------------------+-----------------+
| cinder-backup    | controller-1                    | nova | enabled | up    | 2022-09-04T14:30:03.000000 | -               |
..
| cinder-volume    | hostgroup@tripleo_ceph          | nova | enabled | up    | 2022-09-04T14:30:04.000000 | -               |
| cinder-volume    | hostgroup@tripleo_ceph_fastpool | nova | enabled | down  | 2022-09-04T13:36:24.000000 | -               |
| cinder-volume    | hostgroup@tripleo_ceph_vol2     | nova | enabled | up    | 2022-09-04T14:30:04.000000 | -               |


Before the fix I had to manually restart c-vol so as to get the added pool/backend to show up.
Right now the new backend tripleo_ceph_vol2 was created/added and is up without having to manually restart c-vol, good to verify. 


I fear there is a new issue/bug here which might explain why tripleo_ceph_fastpool is now down, 
as it was up before the update, I'll consult with dev about it.

Comment 14 errata-xmlrpc 2022-12-07 19:23:35 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Release of components for Red Hat OpenStack Platform 16.2.4), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:8794


Note You need to log in before you can comment on or make changes to this bug.