Created attachment 1881400 [details] Overcloud deploy log Description of problem: Overcloud deployment of a DCN site with storage and following topology: 3 DistributedComputeHCI nodes, 1 DistributedComputeHCIScaleOut and 1 DistributedScaleOut node fails on following error: FATAL | Ensure provider packages are installed | dcn2-computescaleout2-0 | error={"msg": "{% set unique_providers = [] %} {% for item in certificate_requests %} {% set _ = unique_providers.append( item.provider | d(__certificate_provider_default) ) %} {% endfor %} {{ unique_providers | unique }} : [{'ca': 'ipa', 'dns': ['{{fqdn_ctlplane}}', '{{cloud_names.cloud_name_ctlplane}}'], 'key_size': '2048', 'name': 'haproxy-ctlplane-cert', 'principal': 'haproxy/{{fqdn_ctlplane}}@{{idm_realm}}', 'run_after': '# Copy crt and key for backward compatibility\ cp \"/etc/pki/tls/certs/haproxy-ctlplane-cert.crt\" \"/etc/pki/tls/certs/haproxy/overcloud-haproxy-ctlplane.crt\"\ cp \"/etc/pki/tls/private/haproxy-ctlplane-cert.key\" \"/etc/pki/tls/private/haproxy/overcloud-haproxy-ctlplane.key\"\ \ ca_path=\"/etc/ipa/ca.crt\"\ service_crt=\"/etc/pki/tls/certs/haproxy/overcloud-haproxy-ctlplane.crt\"\ service_key=\"/etc/pki/tls/private/haproxy/overcloud-haproxy-ctlplane.key\"\ service_pem=\"/etc/pki/tls/certs/haproxy/overcloud-haproxy-ctlplane.pem\"\ \ cat \"$service_crt\" \"$ca_path\" \"$service_key\" > \"$service_pem\"\ \ container_name=$({{container_cli}} ps --format=\\\\{\\\\{.Names\\\\}\\\\} | grep -w -E \\'haproxy(-bundle-.*-[0-9]+)?\\')\ # Inject the new pem into the running container\ if echo \"$container_name\" | grep -q \"^haproxy-bundle\"; then\ # lp#1917868: Do not use podman cp with HA containers as they get\ # frozen temporarily and that can make pacemaker operation fail.\ tar -c \"$service_pem\" | {{container_cli}} exec -i \"$container_name\" tar -C / -xv\ # no need to update the mount point, because pacemaker\ # recreates the container when it\\'s restarted\ else\ # Refresh the pem at the mount-point\ {{container_cli}} cp $service_pem \"$container_name:/var/lib/kolla/config_files/src-tls/$service_pem\"\ # Copy the new pem from the mount-point to the real path\ {{container_cli}} exec \"$container_name\" cp \"/var/lib/kolla/config_files/src-tls$service_pem\" \"$service_pem\"\ fi\ # Set appropriate permissions\ {{container_cli}} exec \"$container_name\" chown haproxy:haproxy \"$service_pem\"\ # Trigger a reload for HAProxy to read the new certificates\ {{container_cli}} kill --signal HUP \"$container_name\"\ '}, {'ca': 'ipa', 'dns': ['{{fqdn_storage}}', '{{cloud_names.cloud_name_storage}}'], 'key_size': '2048', 'name': 'haproxy-storage-cert', 'principal': 'haproxy/{{fqdn_storage}}@{{idm_realm}}', 'run_after': '# Copy crt and key for backward compatibility\ cp \"/etc/pki/tls/certs/haproxy-storage-cert.crt\" \"/etc/pki/tls/certs/haproxy/overcloud-haproxy-storage.crt\"\ cp \"/etc/pki/tls/private/haproxy-storage-cert.key\" \"/etc/pki/tls/private/haproxy/overcloud-haproxy-storage.key\"\ \ ca_path=\"/etc/ipa/ca.crt\"\ service_crt=\"/etc/pki/tls/certs/haproxy/overcloud-haproxy-storage.crt\"\ service_key=\"/etc/pki/tls/private/haproxy/overcloud-haproxy-storage.key\"\ service_pem=\"/etc/pki/tls/certs/haproxy/overcloud-haproxy-storage.pem\"\ \ cat \"$service_crt\" \"$ca_path\" \"$service_key\" > \"$service_pem\"\ \ container_name=$({{container_cli}} ps --format=\\\\{\\\\{.Names\\\\}\\\\} | grep -w -E \\'haproxy(-bundle-.*-[0-9]+)?\\')\ # Inject the new pem into the running container\ if echo \"$container_name\" | grep -q \"^haproxy-bundle\"; then\ # lp#1917868: Do not use podman cp with HA containers as they get\ # frozen temporarily and that can make pacemaker operation fail.\ tar -c \"$service_pem\" | {{container_cli}} exec -i \"$container_name\" tar -C / -xv\ # no need to update the mount point, because pacemaker\ # recreates the container when it\\'s restarted\ else\ # Refresh the pem at the mount-point\ {{container_cli}} cp $service_pem \"$container_name:/var/lib/kolla/config_files/src-tls/$service_pem\"\ # Copy the new pem from the mount-point to the real path\ {{container_cli}} exec \"$container_name\" cp \"/var/lib/kolla/config_files/src-tls$service_pem\" \"$service_pem\"\ fi\ # Set appropriate permissions\ {{container_cli}} exec \"$container_name\" chown haproxy:haproxy \"$service_pem\"\ # Trigger a reload for HAProxy to read the new certificates\ {{container_cli}} kill --signal HUP \"$container_name\"\ '}, {'ca': 'ipa', 'dns': ['{{fqdn_storage_mgmt}}', '{{cloud_names.cloud_name_storage_mgmt}}'], 'key_size': '2048', 'name': 'haproxy-storage_mgmt-cert', 'principal': 'haproxy/{{fqdn_storage_mgmt}}@{{idm_realm}}', 'run_after': '# Copy crt and key for backward compatibility\ cp \"/etc/pki/tls/certs/haproxy-storage_mgmt-cert.crt\" \"/etc/pki/tls/certs/haproxy/overcloud-haproxy-storage_mgmt.crt\"\ cp \"/etc/pki/tls/private/haproxy-storage_mgmt-cert.key\" \"/etc/pki/tls/private/haproxy/overcloud-haproxy-storage_mgmt.key\"\ \ ca_path=\"/etc/ipa/ca.crt\"\ service_crt=\"/etc/pki/tls/certs/haproxy/overcloud-haproxy-storage_mgmt.crt\"\ service_key=\"/etc/pki/tls/private/haproxy/overcloud-haproxy-storage_mgmt.key\"\ service_pem=\"/etc/pki/tls/certs/haproxy/overcloud-haproxy-storage_mgmt.pem\"\ \ cat \"$service_crt\" \"$ca_path\" \"$service_key\" > \"$service_pem\"\ \ container_name=$({{container_cli}} ps --format=\\\\{\\\\{.Names\\\\}\\\\} | grep -w -E \\'haproxy(-bundle-.*-[0-9]+)?\\')\ # Inject the new pem into the running container\ if echo \"$container_name\" | grep -q \"^haproxy-bundle\"; then\ # lp#1917868: Do not use podman cp with HA containers as they get\ # frozen temporarily and that can make pacemaker operation fail.\ tar -c \"$service_pem\" | {{container_cli}} exec -i \"$container_name\" tar -C / -xv\ # no need to update the mount point, because pacemaker\ # recreates the container when it\\'s restarted\ else\ # Refresh the pem at the mount-point\ {{container_cli}} cp $service_pem \"$container_name:/var/lib/kolla/config_files/src-tls/$service_pem\"\ # Copy the new pem from the mount-point to the real path\ {{container_cli}} exec \"$container_name\" cp \"/var/lib/kolla/config_files/src-tls$service_pem\" \"$service_pem\"\ fi\ # Set appropriate permissions\ {{container_cli}} exec \"$container_name\" chown haproxy:haproxy \"$service_pem\"\ # Trigger a reload for HAProxy to read the new certificates\ {{container_cli}} kill --signal HUP \"$container_name\"\ '}, {'ca': 'ipa', 'dns': ['{{fqdn_internal_api}}', '{{cloud_names.cloud_name_internal_api}}'], 'key_size': '2048', 'name': 'haproxy-internal_api-cert', 'principal': 'haproxy/{{fqdn_internal_api}}@{{idm_realm}}', 'run_after': '# Copy crt and key for backward compatibility\ cp \"/etc/pki/tls/certs/haproxy-internal_api-cert.crt\" \"/etc/pki/tls/certs/haproxy/overcloud-haproxy-internal_api.crt\"\ cp \"/etc/pki/tls/private/haproxy-internal_api-cert.key\" \"/etc/pki/tls/private/haproxy/overcloud-haproxy-internal_api.key\"\ \ ca_path=\"/etc/ipa/ca.crt\"\ service_crt=\"/etc/pki/tls/certs/haproxy/overcloud-haproxy-internal_api.crt\"\ service_key=\"/etc/pki/tls/private/haproxy/overcloud-haproxy-internal_api.key\"\ service_pem=\"/etc/pki/tls/certs/haproxy/overcloud-haproxy-internal_api.pem\"\ \ cat \"$service_crt\" \"$ca_path\" \"$service_key\" > \"$service_pem\"\ \ container_name=$({{container_cli}} ps --format=\\\\{\\\\{.Names\\\\}\\\\} | grep -w -E \\'haproxy(-bundle-.*-[0-9]+)?\\')\ # Inject the new pem into the running container\ if echo \"$container_name\" | grep -q \"^haproxy-bundle\"; then\ # lp#1917868: Do not use podman cp with HA containers as they get\ # frozen temporarily and that can make pacemaker operation fail.\ tar -c \"$service_pem\" | {{container_cli}} exec -i \"$container_name\" tar -C / -xv\ # no need to update the mount point, because pacemaker\ # recreates the container when it\\'s restarted\ else\ # Refresh the pem at the mount-point\ {{container_cli}} cp $service_pem \"$container_name:/var/lib/kolla/config_files/src-tls/$service_pem\"\ # Copy the new pem from the mount-point to the real path\ {{container_cli}} exec \"$container_name\" cp \"/var/lib/kolla/config_files/src-tls$service_pem\" \"$service_pem\"\ fi\ # Set appropriate permissions\ {{container_cli}} exec \"$container_name\" chown haproxy:haproxy \"$service_pem\"\ # Trigger a reload for HAProxy to read the new certificates\ {{container_cli}} kill --signal HUP \"$container_name\"\ '}]: 'fqdn_storage_mgmt' is undefined"} This is the overcloud deploy command line: openstack overcloud deploy \ --timeout 240 \ --templates /usr/share/openstack-tripleo-heat-templates \ --environment-file /usr/share/openstack-tripleo-heat-templates/environments/barbican-backend-simple-crypto.yaml \ --environment-file /usr/share/openstack-tripleo-heat-templates/environments/services/barbican-edge.yaml \ --stack dcn2 \ --libvirt-type kvm \ --ntp-server clock1.rdu2.redhat.com \ -e /usr/share/openstack-tripleo-heat-templates/environments/dcn-storage.yaml \ --deployed-server \ -e /home/stack/templates/overcloud-vip-deployed.yaml \ -e /home/stack/templates/overcloud-networks-deployed.yaml \ -e /home/stack/templates/overcloud-baremetal-deployed-dcn2.yaml \ -e /home/stack/templates/overcloud-ceph-deployed-dcn2.yaml \ --networks-file /home/stack/dcn2/network/network_data_v2.yaml \ -e /home/stack/dcn2/internal.yaml \ -r /home/stack/dcn2/roles/roles_data.yaml \ -e /home/stack/dcn2/network/network-environment_v2.yaml \ -e /home/stack/dcn2/enable-tls.yaml \ -e /home/stack/dcn2/inject-trust-anchor.yaml \ -e /home/stack/dcn2/hostnames.yml \ -e /usr/share/openstack-tripleo-heat-templates/environments/cephadm/cephadm.yaml \ -e /home/stack/dcn2/nodes_data.yaml \ -e /home/stack/dcn2/glance.yaml \ -e /home/stack/dcn2/debug.yaml \ -e /home/stack/dcn2/use-dns-for-vips.yaml \ -e /home/stack/dcn2/glance.yaml \ -e /home/stack/central_ceph_external.yaml \ -e /home/stack/central-export.yaml \ -e /home/stack/dcn2/config_heat.yaml \ -e ~/containers-prepare-parameter.yaml \ -e /home/stack/dcn2/barbican.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/ssl/tls-everywhere-endpoints-dns.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/ssl/enable-internal-tls.yaml \ -e /home/stack/dcn2/cloud-names.yaml \ -e /home/stack/dcn2/ipaservices-baremetal-ansible.yaml and baremetal-deployment.yaml content is: - name: DistributedComputeHCI count: 3 hostname_format: '%stackname%-computehci2-%index%' defaults: profile: computehci2 network_config: template: /home/stack/dcn2/network/spine-leaf-nics/computehci2.j2 networks: - network: ctlplane vif: true - network: storage subnet: storage_leaf2 - network: internal_api subnet: internal_api_leaf2 - network: tenant subnet: tenant_leaf2 - network: storage_mgmt subnet: storage_mgmt_leaf2 - name: DistributedComputeHCIScaleOut count: 1 hostname_format: '%stackname%-computehciscaleout2-%index%' defaults: profile: computehciscaleout2 network_config: template: /home/stack/dcn2/network/spine-leaf-nics/computehciscaleout2.j2 networks: - network: ctlplane vif: true - network: storage subnet: storage_leaf2 - network: internal_api subnet: internal_api_leaf2 - network: tenant subnet: tenant_leaf2 - network: storage_mgmt subnet: storage_mgmt_leaf2 - name: DistributedComputeScaleOut count: 1 hostname_format: '%stackname%-computescaleout2-%index%' defaults: profile: computescaleout2 network_config: template: /home/stack/dcn2/network/spine-leaf-nics/computescaleout2.j2 networks: - network: ctlplane vif: true - network: storage subnet: storage_leaf2 - network: internal_api subnet: internal_api_leaf2 - network: tenant subnet: tenant_leaf2 If I remove the dcn-storage.yaml template from the overcloud command line, the error is not thrown. NOTE: TO be able to deploy I am using following upstream patches: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/840443 https://review.opendev.org/c/openstack/tripleo-heat-templates/+/841743 https://review.opendev.org/c/openstack/tripleo-heat-templates/+/840534 Version-Release number of selected component (if applicable): ansible-tripleo-ipsec-11.0.1-0.20210910011424.b5559c8.el9ost.noarch ansible-role-tripleo-modify-image-1.3.1-0.20220216001439.30d23d5.el9ost.noarch ansible-tripleo-ipa-0.2.3-0.20220301190449.6b0ed82.el9ost.noarch puppet-tripleo-14.2.3-0.20220407012437.87240e8.el9ost.noarch python3-tripleo-common-15.4.1-0.20220328184445.0c754c6.el9ost.noarch tripleo-ansible-3.3.1-0.20220418220931.1557dde.el9ost.noarch openstack-tripleo-validations-14.2.2-0.20220422030814.62e6da1.el9ost.noarch openstack-tripleo-common-containers-15.4.1-0.20220328184445.0c754c6.el9ost.noarch openstack-tripleo-common-15.4.1-0.20220328184445.0c754c6.el9ost.noarch openstack-tripleo-heat-templates-14.3.1-0.20220417231018.c3cee34.el9ost.noarch python3-tripleoclient-16.4.1-0.20220416001259.ccc329b.el9ost.noarch How reproducible: Always Steps to Reproduce: 1. Deploy DCN site with a storage and described topology, especially with DistributedComputeScaleOut node. Actual results: The deployment fails
Looks like dcn-storage.yaml has ManageNetworks: False. Maybe the networks were not in network_data of the main stack or missing some fixes like https://review.opendev.org/c/openstack/tripleo-heat-templates/+/781572.
In baremetal-deployment.yaml it looks like the DistributedComputeScaleOut networks list is missing the storage_mgmt network. Feel free to close this if that fixes the issue.
(In reply to Steve Baker from comment #2) > In baremetal-deployment.yaml it looks like the DistributedComputeScaleOut > networks list is missing the storage_mgmt network. Feel free to close this > if that fixes the issue. That's intentional since the predefined role in tht does not have storage_mgmt network defined: https://github.com/openstack/tripleo-heat-templates/blob/master/roles/DistributedComputeScaleOut.yaml So my assumption is that there is no need for storage_mgmt network to be deployed on DistributedComputeScaleOut node.
I'm not an expert in that specific architecture configuration, but wouldn't you still need storage network so you don't burden your primary network interfaces with the storage networking traffic?
(In reply to Julia Kreger from comment #4) > I'm not an expert in that specific architecture configuration, but wouldn't > you still need storage network so you don't burden your primary network > interfaces with the storage networking traffic? Both roles have the "storage network" to unburden the primary interface. We're talking about which roles should have the storage_mgmt network which is another network for storage. (In reply to Marian Krcmarik from comment #3) > (In reply to Steve Baker from comment #2) > > In baremetal-deployment.yaml it looks like the DistributedComputeScaleOut > > networks list is missing the storage_mgmt network. Feel free to close this > > if that fixes the issue. > > That's intentional since the predefined role in tht does not have > storage_mgmt network defined: > https://github.com/openstack/tripleo-heat-templates/blob/master/roles/ > DistributedComputeScaleOut.yaml > So my assumption is that there is no need for storage_mgmt network to be > deployed on DistributedComputeScaleOut node. - DistributedComputeScaleOut has the storage network [1] - DistributedComputeHCIScaleOut has the storage network and the storage_mgmt network [2] - In Ceph terms, "storage network" is the "ceph public_network" and the storage_mgmt network is the "ceph cluster_network" [3] - You only need the storage_mgmt network if you are hosting OSDs so that they can replicate. - The main difference between DistributedComputeScaleOut and DistributedComputeHCIScaleOut is that the HCI one hosts OSDs. - Since DistributedComputeScaleOut does not host OSDs it shouldn't need the storage_mgmt network. Unless, the storage_mgmt network is being used for something other than just OSDs. The ServiceNetMap in the config-download directory could confirm which services are using which networks. [1] https://github.com/openstack/tripleo-heat-templates/blob/master/roles/DistributedComputeScaleOut.yaml#L15-L16 [2] https://github.com/openstack/tripleo-heat-templates/blob/master/roles/DistributedComputeHCIScaleOut.yaml#L16-L17 [3] https://docs.ceph.com/en/latest/rados/configuration/network-config-ref/
Marian confirmed the following: - If the deployment is run without DistributedComputeScaleOut, then it doesn't hit this bug - If the DistributedComputeScaleOut node has the storage_mgmt network added, then it doesn't hit this bug (this can be a workaround until this BZ is closed) Also, we see TLS-e provided a certificate for each network, e.g. /etc/pki/tls/certs/haproxy/overcloud-haproxy-storage_mgmt.pem This script that failed is running on the DistributedComputeScaleOut node for each network in HAProxyNetworks [2] even if that network is not configured on the node. Maybe the line that provides the list for the loop [2] needs to be intersected with the list of actual networks that are configured on the running node. [1] https://github.com/openstack/tripleo-heat-templates/blob/stable/wallaby/deployment/haproxy/haproxy-internal-tls-certmonger.j2.yaml#L172
I think John has explained the root cause, assigning to DFG:Security to look at changing "Certificate generation" to skip networks which are not configure on that node
This is similar to, but not a duplicate of, BZ 2081698; solved by https://review.opendev.org/c/openstack/tripleo-heat-templates/+/841930/1/deployment/apache/apache-baremetal-puppet.j2.yaml
Any updates to this BZ? Upstream Gerrit has been merged for quite some time, this BZ is still in NEW, though high severity?
Upstream patch has merged on stable/wallaby.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Release of components for Red Hat OpenStack Platform 17.0 (Wallaby)), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2022:6543