Bug 1947786

Summary: Glane Mutli Store: Missing ceph configuration for a dcn site causes glance_api containers stuck in a restart loop
Product: Red Hat OpenStack Reporter: Sadique Puthen <sputhenp>
Component: openstack-glanceAssignee: Cyril Roelandt <cyril>
Status: NEW --- QA Contact:
Severity: medium Docs Contact: Andy Stillman <astillma>
Priority: unspecified    
Version: 16.1 (Train)CC: athomas, eglynn, ekuvaja, gfidente, johfulto, pdeore, udesale
Target Milestone: ---Flags: cyril: needinfo? (pdeore)
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
glance-api.log from /var/log/containers/stdout directory
none
/var/log/containers/glance/api.log none

Description Sadique Puthen 2021-04-09 09:15:09 UTC
Description of problem:

I added my second glance store details for DCN2 to my central cluster[1]. During this process I missed to pass the ceph details for the second Ceph cluster normally generated by running below command.

sudo -E openstack overcloud export ceph --stack edge-1,edge-ceph --config-download-dir /var/lib/mistral --output-file /home/stack/templates/osp-16-1/edge-ceph.yaml

The end result is that glance_api containers on central site is stuck in a restart loop which causes glance to be completely down for the central site + all DCN sites already working properly. When I do stack update again with the ceph details for the new DCN site everything is back to normal. Glance is stuck in a restart loop obviously because it cannot get details of the new ceph cluster details, but the impact is a total cluster down.

We should solve this. Either we need to add validation in tripleo to ensure that a multi-store config is rejected if corresponding Ceph details are not provided. Or glance_api should be fixed so that it must remain up and reject image copy or vm provisioing to only the new site to which ceph configuration details are not provided. In any case a total cluster down scenario is not acceptable for a customer. 

The error in the log file as below.

2021-04-09T09:07:09.131937653+00:00 stderr F + echo 'Running command: '\''/usr/bin/glance-api --config-file /usr/share/glance/glance-api-dist.conf --config-file /etc/glance/glance-api.conf --config-file /etc/glance/glance-image-import.conf'\'''
2021-04-09T09:07:09.131951286+00:00 stdout F Running command: '/usr/bin/glance-api --config-file /usr/share/glance/glance-api-dist.conf --config-file /etc/glance/glance-api.conf --config-file /etc/glance/glance-image-import.conf'
2021-04-09T09:07:09.131985651+00:00 stderr F + exec /usr/bin/glance-api --config-file /usr/share/glance/glance-api-dist.conf --config-file /etc/glance/glance-api.conf --config-file /etc/glance/glance-image-import.conf
2021-04-09T09:07:11.102336424+00:00 stderr F ERROR: [errno 2] error calling conf_read_file


[1]  https://gitlab.cee.redhat.com/sputhenp/lab/-/blob/master/templates/osp-16-1/glance-update.yaml#L13-18

Comment 1 Sadique Puthen 2021-04-09 09:16:26 UTC
Created attachment 1770521 [details]
glance-api.log from /var/log/containers/stdout directory

Comment 2 Sadique Puthen 2021-04-09 10:22:05 UTC
Created attachment 1770532 [details]
/var/log/containers/glance/api.log

Comment 3 John Fulton 2021-04-09 12:12:05 UTC
Can you confirm the version of python-glance-store and the version of the glance container on your system?

Looks similar to 1832667.

Comment 4 Sadique Puthen 2021-04-09 12:47:09 UTC
# podman exec -it glance_api rpm -qa | grep glance-store
python3-glance-store-1.0.2-1.20201114020939.bc62bb4.el8ost.noarch

Comment 5 John Fulton 2021-04-09 12:52:25 UTC
So he has a version newer than the fixing version of 1832667 (python-glance-store-1.0.2-0.20200511193428.a622766.el8ost)

Also, here's the container tag he used.

# podman images | grep glance
satellite.redhat.local:5000/sadique_openstack-openstack-16-1-osp_16_1_containers-osp16_containers-glance-api           16.1         3281c69acf43   2 weeks ago    972 MB

Comment 6 Erno Kuvaja 2021-04-09 13:07:58 UTC
There is no reason why glance-api wouldn't be coming up. Even the logs shows that the service comes up and just reports that one of the stores is misconfigured and will be tried to access read-only. Do we have pacemaker logs available, why the service gets killed after a while (we can see from the logs that it happily answers healthcheck polls a while before getting killed and restarted)?

Comment 7 Sadique Puthen 2021-04-09 13:33:06 UTC
I was under the impression that pacemaker is not used for making glance_api HA.

Comment 8 Erno Kuvaja 2021-04-09 14:15:20 UTC
Not on the DCN edge sites, but it's in control of the central site services.

Comment 9 Sadique Puthen 2021-04-09 14:19:23 UTC
pcs status on central does not have glance_api under its control. It is a stand alone container.

# pcs status
Cluster name: tripleo_cluster
Cluster Summary:
  * Stack: corosync
  * Current DC: controller-1 (version 2.0.3-5.el8_2.3-4b1f869f0f) - partition with quorum
  * Last updated: Fri Apr  9 14:18:39 2021
  * Last change:  Fri Apr  9 06:58:41 2021 by root via cibadmin on controller-1
  * 15 nodes configured
  * 47 resource instances configured

Node List:
  * Online: [ controller-1 controller-2 controller-3 ]
  * GuestOnline: [ galera-bundle-0@controller-1 galera-bundle-1@controller-2 galera-bundle-2@controller-3 ovn-dbs-bundle-0@controller-1 ovn-dbs-bundle-1@controller-2 ovn-dbs-bundle-2@controller-3 rabbitmq-bundle-0@controller-1 rabbitmq-bundle-1@controller-2 rabbitmq-bundle-2@controller-3 redis-bundle-0@controller-1 redis-bundle-1@controller-2 redis-bundle-2@controller-3 ]

Full List of Resources:
  * Container bundle set: galera-bundle [cluster.common.tag/sadique_openstack-openstack-16-1-osp_16_1_containers-osp16_containers-mariadb:pcmklatest]:
    * galera-bundle-0	(ocf::heartbeat:galera):	Master controller-1
    * galera-bundle-1	(ocf::heartbeat:galera):	Master controller-2
    * galera-bundle-2	(ocf::heartbeat:galera):	Master controller-3
  * Container bundle set: rabbitmq-bundle [cluster.common.tag/sadique_openstack-openstack-16-1-osp_16_1_containers-osp16_containers-rabbitmq:pcmklatest]:
    * rabbitmq-bundle-0	(ocf::heartbeat:rabbitmq-cluster):	Started controller-1
    * rabbitmq-bundle-1	(ocf::heartbeat:rabbitmq-cluster):	Started controller-2
    * rabbitmq-bundle-2	(ocf::heartbeat:rabbitmq-cluster):	Started controller-3
  * Container bundle set: redis-bundle [cluster.common.tag/sadique_openstack-openstack-16-1-osp_16_1_containers-osp16_containers-redis:pcmklatest]:
    * redis-bundle-0	(ocf::heartbeat:redis):	Master controller-1
    * redis-bundle-1	(ocf::heartbeat:redis):	Slave controller-2
    * redis-bundle-2	(ocf::heartbeat:redis):	Slave controller-3
  * ip-172.16.0.150	(ocf::heartbeat:IPaddr2):	Started controller-1
  * ip-172.16.200.150	(ocf::heartbeat:IPaddr2):	Started controller-2
  * ip-172.20.0.151	(ocf::heartbeat:IPaddr2):	Started controller-3
  * ip-172.20.0.150	(ocf::heartbeat:IPaddr2):	Started controller-1
  * ip-172.18.0.150	(ocf::heartbeat:IPaddr2):	Started controller-2
  * ip-172.19.0.150	(ocf::heartbeat:IPaddr2):	Started controller-3
  * Container bundle set: haproxy-bundle [cluster.common.tag/sadique_openstack-openstack-16-1-osp_16_1_containers-osp16_containers-haproxy:pcmklatest]:
    * haproxy-bundle-podman-0	(ocf::heartbeat:podman):	Started controller-1
    * haproxy-bundle-podman-1	(ocf::heartbeat:podman):	Started controller-2
    * haproxy-bundle-podman-2	(ocf::heartbeat:podman):	Started controller-3
  * Container bundle set: ovn-dbs-bundle [cluster.common.tag/sadique_openstack-openstack-16-1-osp_16_1_containers-osp16_containers-ovn-northd:pcmklatest]:
    * ovn-dbs-bundle-0	(ocf::ovn:ovndb-servers):	Master controller-1
    * ovn-dbs-bundle-1	(ocf::ovn:ovndb-servers):	Slave controller-2
    * ovn-dbs-bundle-2	(ocf::ovn:ovndb-servers):	Slave controller-3
  * ip-172.20.0.110	(ocf::heartbeat:IPaddr2):	Started controller-1
  * Container bundle: openstack-cinder-volume [cluster.common.tag/sadique_openstack-openstack-16-1-osp_16_1_containers-osp16_containers-cinder-volume:pcmklatest]:
    * openstack-cinder-volume-podman-0	(ocf::heartbeat:podman):	Started controller-2

Comment 10 John Fulton 2021-04-09 14:38:54 UTC
It is expected [1] that glance is not under pcs control in 16. Let's look into why glance is failing as it should be able to handle one misconfigured store.

[1] https://specs.openstack.org/openstack/tripleo-specs/specs/newton/pacemaker-next-generation-architecture.html

Comment 13 John Fulton 2021-04-16 13:19:16 UTC
Suggested reproducer:

Configure two glance backends with RBD [1] but do not provide the ceph configuration files for one of the backends [2].

[1]
"""
[central]
rbd_store_ceph_conf=/etc/ceph/central.conf
rbd_store_user=openstack
rbd_store_pool=images
store_description=central rbd glance store

[edge-1]
rbd_store_ceph_conf=/etc/ceph/edge1.conf
rbd_store_user=openstack
rbd_store_pool=images
store_description=edge-1 rbd glance store
"""

[2]
"""
[root@controller-2 ~]# ls -l /etc/ceph/
total 8
-rw-------. 1  167  167 201 Apr 15 16:18 central.client.openstack.keyring
-rw-r--r--. 1 root root 658 Apr 15 16:18 central.conf
[root@controller-2 ~]# 
"""

Comment 14 Cyril Roelandt 2021-04-26 19:41:57 UTC
*** Bug 1947784 has been marked as a duplicate of this bug. ***

Comment 18 Cyril Roelandt 2021-06-02 23:03:47 UTC
@Pranali: Do you think ooo could check that the configuration is good enough before starting the services?