Description of problem: I added my second glance store details for DCN2 to my central cluster[1]. During this process I missed to pass the ceph details for the second Ceph cluster normally generated by running below command. sudo -E openstack overcloud export ceph --stack edge-1,edge-ceph --config-download-dir /var/lib/mistral --output-file /home/stack/templates/osp-16-1/edge-ceph.yaml The end result is that glance_api containers on central site is stuck in a restart loop which causes glance to be completely down for the central site + all DCN sites already working properly. When I do stack update again with the ceph details for the new DCN site everything is back to normal. Glance is stuck in a restart loop obviously because it cannot get details of the new ceph cluster details, but the impact is a total cluster down. We should solve this. Either we need to add validation in tripleo to ensure that a multi-store config is rejected if corresponding Ceph details are not provided. Or glance_api should be fixed so that it must remain up and reject image copy or vm provisioing to only the new site to which ceph configuration details are not provided. In any case a total cluster down scenario is not acceptable for a customer. The error in the log file as below. 2021-04-09T09:07:09.131937653+00:00 stderr F + echo 'Running command: '\''/usr/bin/glance-api --config-file /usr/share/glance/glance-api-dist.conf --config-file /etc/glance/glance-api.conf --config-file /etc/glance/glance-image-import.conf'\''' 2021-04-09T09:07:09.131951286+00:00 stdout F Running command: '/usr/bin/glance-api --config-file /usr/share/glance/glance-api-dist.conf --config-file /etc/glance/glance-api.conf --config-file /etc/glance/glance-image-import.conf' 2021-04-09T09:07:09.131985651+00:00 stderr F + exec /usr/bin/glance-api --config-file /usr/share/glance/glance-api-dist.conf --config-file /etc/glance/glance-api.conf --config-file /etc/glance/glance-image-import.conf 2021-04-09T09:07:11.102336424+00:00 stderr F ERROR: [errno 2] error calling conf_read_file [1] https://gitlab.cee.redhat.com/sputhenp/lab/-/blob/master/templates/osp-16-1/glance-update.yaml#L13-18
Created attachment 1770521 [details] glance-api.log from /var/log/containers/stdout directory
Created attachment 1770532 [details] /var/log/containers/glance/api.log
Can you confirm the version of python-glance-store and the version of the glance container on your system? Looks similar to 1832667.
# podman exec -it glance_api rpm -qa | grep glance-store python3-glance-store-1.0.2-1.20201114020939.bc62bb4.el8ost.noarch
So he has a version newer than the fixing version of 1832667 (python-glance-store-1.0.2-0.20200511193428.a622766.el8ost) Also, here's the container tag he used. # podman images | grep glance satellite.redhat.local:5000/sadique_openstack-openstack-16-1-osp_16_1_containers-osp16_containers-glance-api 16.1 3281c69acf43 2 weeks ago 972 MB
There is no reason why glance-api wouldn't be coming up. Even the logs shows that the service comes up and just reports that one of the stores is misconfigured and will be tried to access read-only. Do we have pacemaker logs available, why the service gets killed after a while (we can see from the logs that it happily answers healthcheck polls a while before getting killed and restarted)?
I was under the impression that pacemaker is not used for making glance_api HA.
Not on the DCN edge sites, but it's in control of the central site services.
pcs status on central does not have glance_api under its control. It is a stand alone container. # pcs status Cluster name: tripleo_cluster Cluster Summary: * Stack: corosync * Current DC: controller-1 (version 2.0.3-5.el8_2.3-4b1f869f0f) - partition with quorum * Last updated: Fri Apr 9 14:18:39 2021 * Last change: Fri Apr 9 06:58:41 2021 by root via cibadmin on controller-1 * 15 nodes configured * 47 resource instances configured Node List: * Online: [ controller-1 controller-2 controller-3 ] * GuestOnline: [ galera-bundle-0@controller-1 galera-bundle-1@controller-2 galera-bundle-2@controller-3 ovn-dbs-bundle-0@controller-1 ovn-dbs-bundle-1@controller-2 ovn-dbs-bundle-2@controller-3 rabbitmq-bundle-0@controller-1 rabbitmq-bundle-1@controller-2 rabbitmq-bundle-2@controller-3 redis-bundle-0@controller-1 redis-bundle-1@controller-2 redis-bundle-2@controller-3 ] Full List of Resources: * Container bundle set: galera-bundle [cluster.common.tag/sadique_openstack-openstack-16-1-osp_16_1_containers-osp16_containers-mariadb:pcmklatest]: * galera-bundle-0 (ocf::heartbeat:galera): Master controller-1 * galera-bundle-1 (ocf::heartbeat:galera): Master controller-2 * galera-bundle-2 (ocf::heartbeat:galera): Master controller-3 * Container bundle set: rabbitmq-bundle [cluster.common.tag/sadique_openstack-openstack-16-1-osp_16_1_containers-osp16_containers-rabbitmq:pcmklatest]: * rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Started controller-1 * rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Started controller-2 * rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Started controller-3 * Container bundle set: redis-bundle [cluster.common.tag/sadique_openstack-openstack-16-1-osp_16_1_containers-osp16_containers-redis:pcmklatest]: * redis-bundle-0 (ocf::heartbeat:redis): Master controller-1 * redis-bundle-1 (ocf::heartbeat:redis): Slave controller-2 * redis-bundle-2 (ocf::heartbeat:redis): Slave controller-3 * ip-172.16.0.150 (ocf::heartbeat:IPaddr2): Started controller-1 * ip-172.16.200.150 (ocf::heartbeat:IPaddr2): Started controller-2 * ip-172.20.0.151 (ocf::heartbeat:IPaddr2): Started controller-3 * ip-172.20.0.150 (ocf::heartbeat:IPaddr2): Started controller-1 * ip-172.18.0.150 (ocf::heartbeat:IPaddr2): Started controller-2 * ip-172.19.0.150 (ocf::heartbeat:IPaddr2): Started controller-3 * Container bundle set: haproxy-bundle [cluster.common.tag/sadique_openstack-openstack-16-1-osp_16_1_containers-osp16_containers-haproxy:pcmklatest]: * haproxy-bundle-podman-0 (ocf::heartbeat:podman): Started controller-1 * haproxy-bundle-podman-1 (ocf::heartbeat:podman): Started controller-2 * haproxy-bundle-podman-2 (ocf::heartbeat:podman): Started controller-3 * Container bundle set: ovn-dbs-bundle [cluster.common.tag/sadique_openstack-openstack-16-1-osp_16_1_containers-osp16_containers-ovn-northd:pcmklatest]: * ovn-dbs-bundle-0 (ocf::ovn:ovndb-servers): Master controller-1 * ovn-dbs-bundle-1 (ocf::ovn:ovndb-servers): Slave controller-2 * ovn-dbs-bundle-2 (ocf::ovn:ovndb-servers): Slave controller-3 * ip-172.20.0.110 (ocf::heartbeat:IPaddr2): Started controller-1 * Container bundle: openstack-cinder-volume [cluster.common.tag/sadique_openstack-openstack-16-1-osp_16_1_containers-osp16_containers-cinder-volume:pcmklatest]: * openstack-cinder-volume-podman-0 (ocf::heartbeat:podman): Started controller-2
It is expected [1] that glance is not under pcs control in 16. Let's look into why glance is failing as it should be able to handle one misconfigured store. [1] https://specs.openstack.org/openstack/tripleo-specs/specs/newton/pacemaker-next-generation-architecture.html
Suggested reproducer: Configure two glance backends with RBD [1] but do not provide the ceph configuration files for one of the backends [2]. [1] """ [central] rbd_store_ceph_conf=/etc/ceph/central.conf rbd_store_user=openstack rbd_store_pool=images store_description=central rbd glance store [edge-1] rbd_store_ceph_conf=/etc/ceph/edge1.conf rbd_store_user=openstack rbd_store_pool=images store_description=edge-1 rbd glance store """ [2] """ [root@controller-2 ~]# ls -l /etc/ceph/ total 8 -rw-------. 1 167 167 201 Apr 15 16:18 central.client.openstack.keyring -rw-r--r--. 1 root root 658 Apr 15 16:18 central.conf [root@controller-2 ~]# """
*** Bug 1947784 has been marked as a duplicate of this bug. ***
@Pranali: Do you think ooo could check that the configuration is good enough before starting the services?