Bug 2033467

Summary: "deployed ceph" defaults set-require-min-compat-client mimic causing glance problems
Product: Red Hat OpenStack Reporter: John Fulton <johfulto>
Component: tripleo-ansibleAssignee: John Fulton <johfulto>
Status: CLOSED ERRATA QA Contact: Yogev Rabl <yrabl>
Severity: high Docs Contact:
Priority: high    
Version: 17.0 (Wallaby)CC: afazekas, alfrgarc, fhubik, gfidente, senrique
Target Milestone: gaKeywords: Triaged
Target Release: 17.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: tripleo-ansible-3.3.1-0.20220706140824.fa5422f.el9ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-09-21 12:18:08 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description John Fulton 2021-12-16 22:43:58 UTC
When using "deployed ceph" [1] to deploy the overcloud the ceph cluster is configured with "set-require-min-compat-client mimic". 

Glance images created when the ceph cluster is in this state resulted `glance image-delete` timing out with a 504. When directly making the same call Glance glance makes [2] via the RBD CLI, the following hangs:

time rbd -n client.openstack -k /etc/ceph/ceph.client.openstack.keyring --conf /etc/ceph/ceph.conf snap unprotect images/d7e638c0-3030-4ac0-a9a9-e9bd340e993c@snap

This seems connected to what is described in the following:

 https://bugs.launchpad.net/tripleo/+bug/1951433
 https://bugzilla.redhat.com/show_bug.cgi?id=2032457

We should ensure deployed ceph does not force "set-require-min-compat-client mimic". Redeploying the same cluster without this setting in the same environment resulted in being able to delete glance images.


[1] https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/features/deployed_ceph.html

[2] https://github.com/openstack/glance_store/blob/master/glance_store/_drivers/rbd.py#L456

Comment 2 Filip Hubík 2021-12-17 13:44:00 UTC
I wonder whether to address also the secondary issue here with the glance service iself. When event described above happens, the service is still seen as on:
[root@controller-0 log]# systemctl status tripleo_glance_api
● tripleo_glance_api.service - glance_api container
   Loaded: loaded (/etc/systemd/system/tripleo_glance_api.service; enabled; vendor preset: disabled)
   Active: active (running) since Wed 2021-12-15 11:06:16 UTC; 1h 25min ago
 Main PID: 108642 (conmon)
    Tasks: 0 (limit: 203963)
   Memory: 0B
   CGroup: /system.slice/tripleo_glance_api.service
           ‣ 108642 /usr/bin/conmon --api-version 1 -c d034cadf5b93055ecefe6a23f3eb647fac6e762787838f41cd2b36218bf51b35 -u d034cadf5b93055ecefe6a23f3eb647fac6e762787838f41cd2b36218bf51b35 -r /usr/bin/runc -b /var/lib/containers/storage/ov>Dec 15 11:06:16 controller-0 systemd[1]: Starting glance_api container...
Dec 15 11:06:16 controller-0 systemd[1]: Started glance_api container.

Though it really broke down catastrophically and stopped logging, any request from Tempest's side now gets 503 (assuming from httpd one step before?). Container seems to be running though. Only "systemctl restart tripleo_glance_api" recovers the original "usable" state. I'd expect the service 
goes down and the container crashes in this case.

Maybe we need to address this as another BZ against Glance?

Comment 3 John Fulton 2021-12-17 13:51:57 UTC
(In reply to Filip Hubík from comment #2)
> I wonder whether to address also the secondary issue here with the glance
> service iself. When event described above happens, the service is still seen
> as on:
> [root@controller-0 log]# systemctl status tripleo_glance_api
> ● tripleo_glance_api.service - glance_api container
>    Loaded: loaded (/etc/systemd/system/tripleo_glance_api.service; enabled;
> vendor preset: disabled)
>    Active: active (running) since Wed 2021-12-15 11:06:16 UTC; 1h 25min ago
>  Main PID: 108642 (conmon)
>     Tasks: 0 (limit: 203963)
>    Memory: 0B
>    CGroup: /system.slice/tripleo_glance_api.service
>            ‣ 108642 /usr/bin/conmon --api-version 1 -c
> d034cadf5b93055ecefe6a23f3eb647fac6e762787838f41cd2b36218bf51b35 -u
> d034cadf5b93055ecefe6a23f3eb647fac6e762787838f41cd2b36218bf51b35 -r
> /usr/bin/runc -b /var/lib/containers/storage/ov>Dec 15 11:06:16 controller-0
> systemd[1]: Starting glance_api container...
> Dec 15 11:06:16 controller-0 systemd[1]: Started glance_api container.
> 
> Though it really broke down catastrophically and stopped logging, any
> request from Tempest's side now gets 503 (assuming from httpd one step
> before?). Container seems to be running though. Only "systemctl restart
> tripleo_glance_api" recovers the original "usable" state. I'd expect the
> service 
> goes down and the container crashes in this case.
> 
> Maybe we need to address this as another BZ against Glance?

We already have that bug here https://bugzilla.redhat.com/show_bug.cgi?id=2032457

Comment 7 John Fulton 2022-01-04 12:12:58 UTC
*** Bug 2036868 has been marked as a duplicate of this bug. ***

Comment 16 errata-xmlrpc 2022-09-21 12:18:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Release of components for Red Hat OpenStack Platform 17.0 (Wallaby)), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2022:6543