Bug 2188477

Summary: DCN: GlanceApiEdge fails to deploy on DistributedComputeHCI nodes
Product: Red Hat OpenStack Reporter: Marian Krcmarik <mkrcmari>
Component: openstack-tripleo-heat-templatesAssignee: Alan Bishop <abishop>
Status: CLOSED ERRATA QA Contact: msava
Severity: high Docs Contact:
Priority: high    
Version: 17.1 (Wallaby)CC: abishop, eharney, eshames, jschluet, mburns, pgrist, tkajinam
Target Milestone: rcKeywords: AutomationBlocker, Regression, Triaged
Target Release: 17.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-14.3.1-1.20230519151004.f602c2b.el9ost Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-08-16 01:14:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Marian Krcmarik 2023-04-20 21:19:28 UTC
Description of problem:
The glance-api fails to be deployed in DCN deployment on the DCN site specifically on DistributedComputeHCI nodes.

The actual error message is:
2023-04-19 19:57:45.936127 |                                      |    WARNING | ERROR: Can't run container glance_api_internal

stderr: Error: statfs /var/lib/kolla/config_files/glance_api.json: no such file or directory
2023-04-19 19:57:45.942951 |                                      |    WARNING | ERROR: Can't run container glance_api_internal_tls_proxy

stderr: Error: statfs /var/lib/kolla/config_files/glance_api_tls_proxy.json: no such file or directory
2023-04-19 19:57:45.948045 | 52540056-ef02-9140-41c7-00000000c797 |      FATAL | Create containers managed by Podman for /var/lib/tripleo-config/container-startup-config/step_4 | dcn1-computehci-1 | error={"changed": false, "msg": "Failed containers: glance_api_internal, glance_api_internal_tls_proxy"}
2023-04-19 19:57:45.964199 | 52540056-ef02-9140-41c7-00000000c797 |     TIMING | tripleo_container_manage : Create containers managed by Podman for /var/lib/tripleo-config/container-startup-config/step_4 | dcn1-computehci-1 | 0:42:21.014482 | 13.12s
2023-04-19 19:57:46.003735 |                                      |    WARNING | ERROR: Can't run container glance_api_internal

stderr: Error: statfs /var/lib/kolla/config_files/glance_api.json: no such file or directory
2023-04-19 19:57:46.005840 |                                      |    WARNING | ERROR: Can't run container glance_api_internal_tls_proxy

stderr: Error: statfs /var/lib/kolla/config_files/glance_api_tls_proxy.json: no such file or directory
2023-04-19 19:57:46.007860 | 52540056-ef02-9140-41c7-00000000c700 |      FATAL | Create containers managed by Podman for /var/lib/tripleo-config/container-startup-config/step_4 | dcn1-computehci-0 | error={"changed": false, "msg": "Failed containers: glance_api_internal, glance_api_internal_tls_proxy"}

The problem seems to be missing files specified as the container volume mount point are missing, specifically:
/var/lib/kolla/config_files/glance_api_tls_proxy.json
/var/lib/kolla/config_files/glance_api.json

If I log in on the node I can see following:
[tripleo-admin@dcn1-computehci-0 ~]$ ls /var/lib/kolla/config_files/ | grep glance
glance_api_internal.json
glance_api_internal_tls_proxy.json

But the volume points look like:
[root@dcn1-computehci-0 tripleo-admin]# cat /var/lib/tripleo-config/container-startup-config/step_4/glance_api_internal_tls_proxy.json 
{
  "environment": {
    "KOLLA_CONFIG_STRATEGY": "COPY_ALWAYS",
    "TRIPLEO_CONFIG_HASH": "5ea6953245003848ac83fd667e3a957c"
  },
  "image": "site-undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp17-openstack-glance-api:17.1_20230404.1",
  "net": "host",
  "restart": "always",
  "start_order": 3,
  "user": "root",
  "volumes": [
    "/etc/hosts:/etc/hosts:ro",
    "/etc/localtime:/etc/localtime:ro",
    "/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro",
    "/etc/pki/ca-trust/source/anchors:/etc/pki/ca-trust/source/anchors:ro",
    "/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro",
    "/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro",
    "/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro",
    "/dev/log:/dev/log",
    "/etc/ipa/ca.crt:/etc/ipa/ca.crt:ro",
    "/etc/puppet:/etc/puppet:ro",
    "/var/log/containers/glance:/var/log/glance:z",
    "/var/log/containers/httpd/glance:/var/log/httpd:z",
    "/var/lib/kolla/config_files/glance_api_tls_proxy.json:/var/lib/kolla/config_files/config.json:ro",
    "/var/lib/config-data/puppet-generated/glance_api_internal:/var/lib/kolla/config_files/src:ro",
    "/etc/pki/tls/certs/httpd:/etc/pki/tls/certs/httpd:ro",
    "/etc/pki/tls/private/httpd:/etc/pki/tls/private/httpd:ro"
  ]
}

When the container is started It looks for /var/lib/kolla/config_files/glance_api_tls_proxy.json and not /var/lib/kolla/config_files/glance_api_internal_tls_proxy.json which was generated.

It would work on Controller nodes because the THT role for Controller node includes     - OS::TripleO::Services::GlanceApi which will cause /var/lib/kolla/config_files/glance_api_tls_proxy.json to be generated. The DistributedComputeHCI role has only GlanceApiEdge service included.

To workaround the problem I manually did a following change in THT:
diff --git a/deployment/glance/glance-api-internal-container-puppet.yaml b/deployment/glance/glance-api-internal-container-puppet.yaml
--- a/deployment/glance/glance-api-internal-container-puppet.yaml	(revision 1393d39be367db3acb02508e0e858395a4e4fefa)
+++ b/deployment/glance/glance-api-internal-container-puppet.yaml	(date 1682024903117)
@@ -152,7 +152,7 @@
                   - get_attr: [GlanceApi, role_data, docker_config, step_4, glance_api]
                   - volumes:
                       yaql:
-                        expression: $.data.vols.select($.replace('puppet-generated/glance_api', 'puppet-generated/glance_api_internal'))
+                        expression: $.data.vols.select($.replace('glance_api', 'glance_api_internal'))
                         data:
                           vols: {get_attr: [GlanceApi, role_data, docker_config, step_4, glance_api, volumes]}
               glance_api_internal_tls_proxy:
@@ -162,7 +162,7 @@
                       - get_attr: [GlanceApi, role_data, docker_config, step_4, glance_api_tls_proxy]
                       - volumes:
                           yaql:
-                            expression: $.data.vols.select($.replace('puppet-generated/glance_api', 'puppet-generated/glance_api_internal'))
+                            expression: $.data.vols.select($.replace('glance_api', 'glance_api_internal'))
                             data:
                               vols: {get_attr: [GlanceApi, role_data, docker_config, step_4, glance_api_tls_proxy, volumes]}

But I am not sure what's the right way to fix it. I can provided env If needed and/or I may be wrong about the root cause.

Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-14.3.1-1.20230402010807.563f2cd.el9ost.noarch

How reproducible:
Always

Steps to Reproduce:
1. Deploy a site in DCN env with glance multistore - which means with glance-api deployed on the edge site with DistributedComputeHCI nodes.
2.
3.

Actual results:
The site deployment fails on failed start of glance-api containers

Expected results:
Successful deployment of the DCN site

Additional info:
The Glance internal API was introduced in d/s in 17.1 - upstream release note: https://github.com/openstack/tripleo-heat-templates/blob/master/releasenotes/notes/glance-internal-service-86274f56712ffaac.yaml

Comment 2 Alan Bishop 2023-05-04 14:16:04 UTC
I'll take this, though I'm surprised to see this is failing because we tried to do exhaustive downstream tests in this very scenario prior to submitting the patches upstream.

Comment 3 Alan Bishop 2023-05-04 20:51:22 UTC
Marian, your analysis is correct, but I have a slightly different fix in mind. It turns out the kolla json file contents is the same for both the public and internal API services, which means they should both be able to use the same file. Here's my patch:

diff --git a/deployment/glance/glance-api-internal-container-puppet.yaml b/deployment/glance/glance-api-internal-container-puppet.yaml
index 15fab9d14..b6469fce5 100644
--- a/deployment/glance/glance-api-internal-container-puppet.yaml
+++ b/deployment/glance/glance-api-internal-container-puppet.yaml
@@ -133,14 +133,6 @@ outputs:
                   - {get_attr: [MySQLClient, role_data, step_config]}
             config_image: {get_attr: [RoleParametersValue, value, ContainerGlanceApiInternalConfigImage]}
 
-          kolla_config:
-            # The kolla_config are essentially the same as the GlanceApi service.
-            # The only difference is the json file names.
-            /var/lib/kolla/config_files/glance_api_internal.json:
-              {get_attr: [GlanceApi, role_data, kolla_config, /var/lib/kolla/config_files/glance_api.json]}
-            /var/lib/kolla/config_files/glance_api_internal_tls_proxy.json:
-              {get_attr: [GlanceApi, role_data, kolla_config, /var/lib/kolla/config_files/glance_api_tls_proxy.json]}
-
           docker_config:
             step_2:
               get_attr: [GlanceLogging, docker_config, step_2]

The patch works in my own test environment, and *should* also fix it in a DCN deployment. It would be great if you could verify it works for you, in which case I'll submit the patch upstream.

Comment 4 Marian Krcmarik 2023-05-05 16:45:23 UTC
> The patch works in my own test environment, and *should* also fix it in a
> DCN deployment. It would be great if you could verify it works for you, in
> which case I'll submit the patch upstream.

It works in DCN d/s CI as well and It fixes the problem neatly. I think we can submit it upstream, thanks!

Comment 18 errata-xmlrpc 2023-08-16 01:14:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Release of components for Red Hat OpenStack Platform 17.1 (Wallaby)), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2023:4577