Bug 2088526 - DistributedComputeScaleOut node fails to deploy on a DCN site with storage
Summary: DistributedComputeScaleOut node fails to deploy on a DCN site with storage
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 17.0 (Wallaby)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: beta
: 17.0
Assignee: Grzegorz Grasza
QA Contact: Joe H. Rahme
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-05-19 15:50 UTC by Marian Krcmarik
Modified: 2022-09-21 12:22 UTC (History)
11 users (show)

Fixed In Version: openstack-tripleo-heat-templates-14.3.1-0.20220719171711.feca772.el9ost
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-09-21 12:21:38 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Overcloud deploy log (3.39 MB, text/plain)
2022-05-19 15:50 UTC, Marian Krcmarik
no flags Details


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 841930 0 None stable/wallaby: MERGED tripleo-heat-templates: Stop generating certificate requests for disabled networks (I05ba5fb48c617a5bbedebb8b74c23bec9ab... 2022-07-06 12:41:52 UTC
OpenStack gerrit 848550 0 None stable/wallaby: MERGED tripleo-heat-templates: Stop generating certificate requests for disabled networks (I0293e019f3a2c4c8ffbf8258214d8522957... 2022-07-06 12:41:57 UTC
Red Hat Issue Tracker OSP-15318 0 None None None 2022-05-19 15:59:44 UTC
Red Hat Product Errata RHEA-2022:6543 0 None None None 2022-09-21 12:22:06 UTC

Description Marian Krcmarik 2022-05-19 15:50:23 UTC
Created attachment 1881400 [details]
Overcloud deploy log

Description of problem:
Overcloud deployment of a DCN site with storage and following topology: 3 DistributedComputeHCI nodes, 1 DistributedComputeHCIScaleOut and 1 DistributedScaleOut node fails on following error:
FATAL | Ensure provider packages are installed | dcn2-computescaleout2-0 | error={"msg": "{% set unique_providers = [] %}
{% for item in certificate_requests %}
{%   set _ = unique_providers.append(
       item.provider |
       d(__certificate_provider_default)
     ) %}
{% endfor %}
{{ unique_providers | unique }}
: [{'ca': 'ipa', 'dns': ['{{fqdn_ctlplane}}', '{{cloud_names.cloud_name_ctlplane}}'], 'key_size': '2048', 'name': 'haproxy-ctlplane-cert', 'principal': 'haproxy/{{fqdn_ctlplane}}@{{idm_realm}}', 'run_after': '# Copy crt and key for backward compatibility\
cp \"/etc/pki/tls/certs/haproxy-ctlplane-cert.crt\" \"/etc/pki/tls/certs/haproxy/overcloud-haproxy-ctlplane.crt\"\
cp \"/etc/pki/tls/private/haproxy-ctlplane-cert.key\" \"/etc/pki/tls/private/haproxy/overcloud-haproxy-ctlplane.key\"\
\
ca_path=\"/etc/ipa/ca.crt\"\
service_crt=\"/etc/pki/tls/certs/haproxy/overcloud-haproxy-ctlplane.crt\"\
service_key=\"/etc/pki/tls/private/haproxy/overcloud-haproxy-ctlplane.key\"\
service_pem=\"/etc/pki/tls/certs/haproxy/overcloud-haproxy-ctlplane.pem\"\
\
cat \"$service_crt\" \"$ca_path\" \"$service_key\" > \"$service_pem\"\
\
container_name=$({{container_cli}} ps --format=\\\\{\\\\{.Names\\\\}\\\\} | grep -w -E \\'haproxy(-bundle-.*-[0-9]+)?\\')\
# Inject the new pem into the running container\
if echo \"$container_name\" | grep -q \"^haproxy-bundle\"; then\
  # lp#1917868: Do not use podman cp with HA containers as they get\
  # frozen temporarily and that can make pacemaker operation fail.\
  tar -c \"$service_pem\" | {{container_cli}} exec -i \"$container_name\" tar -C / -xv\
  # no need to update the mount point, because pacemaker\
  # recreates the container when it\\'s restarted\
else\
  # Refresh the pem at the mount-point\
  {{container_cli}} cp $service_pem \"$container_name:/var/lib/kolla/config_files/src-tls/$service_pem\"\
  # Copy the new pem from the mount-point to the real path\
  {{container_cli}} exec \"$container_name\" cp \"/var/lib/kolla/config_files/src-tls$service_pem\" \"$service_pem\"\
fi\
# Set appropriate permissions\
{{container_cli}} exec \"$container_name\" chown haproxy:haproxy \"$service_pem\"\
# Trigger a reload for HAProxy to read the new certificates\
{{container_cli}} kill --signal HUP \"$container_name\"\
'}, {'ca': 'ipa', 'dns': ['{{fqdn_storage}}', '{{cloud_names.cloud_name_storage}}'], 'key_size': '2048', 'name': 'haproxy-storage-cert', 'principal': 'haproxy/{{fqdn_storage}}@{{idm_realm}}', 'run_after': '# Copy crt and key for backward compatibility\
cp \"/etc/pki/tls/certs/haproxy-storage-cert.crt\" \"/etc/pki/tls/certs/haproxy/overcloud-haproxy-storage.crt\"\
cp \"/etc/pki/tls/private/haproxy-storage-cert.key\" \"/etc/pki/tls/private/haproxy/overcloud-haproxy-storage.key\"\
\
ca_path=\"/etc/ipa/ca.crt\"\
service_crt=\"/etc/pki/tls/certs/haproxy/overcloud-haproxy-storage.crt\"\
service_key=\"/etc/pki/tls/private/haproxy/overcloud-haproxy-storage.key\"\
service_pem=\"/etc/pki/tls/certs/haproxy/overcloud-haproxy-storage.pem\"\
\
cat \"$service_crt\" \"$ca_path\" \"$service_key\" > \"$service_pem\"\
\
container_name=$({{container_cli}} ps --format=\\\\{\\\\{.Names\\\\}\\\\} | grep -w -E \\'haproxy(-bundle-.*-[0-9]+)?\\')\
# Inject the new pem into the running container\
if echo \"$container_name\" | grep -q \"^haproxy-bundle\"; then\
  # lp#1917868: Do not use podman cp with HA containers as they get\
  # frozen temporarily and that can make pacemaker operation fail.\
  tar -c \"$service_pem\" | {{container_cli}} exec -i \"$container_name\" tar -C / -xv\
  # no need to update the mount point, because pacemaker\
  # recreates the container when it\\'s restarted\
else\
  # Refresh the pem at the mount-point\
  {{container_cli}} cp $service_pem \"$container_name:/var/lib/kolla/config_files/src-tls/$service_pem\"\
  # Copy the new pem from the mount-point to the real path\
  {{container_cli}} exec \"$container_name\" cp \"/var/lib/kolla/config_files/src-tls$service_pem\" \"$service_pem\"\
fi\
# Set appropriate permissions\
{{container_cli}} exec \"$container_name\" chown haproxy:haproxy \"$service_pem\"\
# Trigger a reload for HAProxy to read the new certificates\
{{container_cli}} kill --signal HUP \"$container_name\"\
'}, {'ca': 'ipa', 'dns': ['{{fqdn_storage_mgmt}}', '{{cloud_names.cloud_name_storage_mgmt}}'], 'key_size': '2048', 'name': 'haproxy-storage_mgmt-cert', 'principal': 'haproxy/{{fqdn_storage_mgmt}}@{{idm_realm}}', 'run_after': '# Copy crt and key for backward compatibility\
cp \"/etc/pki/tls/certs/haproxy-storage_mgmt-cert.crt\" \"/etc/pki/tls/certs/haproxy/overcloud-haproxy-storage_mgmt.crt\"\
cp \"/etc/pki/tls/private/haproxy-storage_mgmt-cert.key\" \"/etc/pki/tls/private/haproxy/overcloud-haproxy-storage_mgmt.key\"\
\
ca_path=\"/etc/ipa/ca.crt\"\
service_crt=\"/etc/pki/tls/certs/haproxy/overcloud-haproxy-storage_mgmt.crt\"\
service_key=\"/etc/pki/tls/private/haproxy/overcloud-haproxy-storage_mgmt.key\"\
service_pem=\"/etc/pki/tls/certs/haproxy/overcloud-haproxy-storage_mgmt.pem\"\
\
cat \"$service_crt\" \"$ca_path\" \"$service_key\" > \"$service_pem\"\
\
container_name=$({{container_cli}} ps --format=\\\\{\\\\{.Names\\\\}\\\\} | grep -w -E \\'haproxy(-bundle-.*-[0-9]+)?\\')\
# Inject the new pem into the running container\
if echo \"$container_name\" | grep -q \"^haproxy-bundle\"; then\
  # lp#1917868: Do not use podman cp with HA containers as they get\
  # frozen temporarily and that can make pacemaker operation fail.\
  tar -c \"$service_pem\" | {{container_cli}} exec -i \"$container_name\" tar -C / -xv\
  # no need to update the mount point, because pacemaker\
  # recreates the container when it\\'s restarted\
else\
  # Refresh the pem at the mount-point\
  {{container_cli}} cp $service_pem \"$container_name:/var/lib/kolla/config_files/src-tls/$service_pem\"\
  # Copy the new pem from the mount-point to the real path\
  {{container_cli}} exec \"$container_name\" cp \"/var/lib/kolla/config_files/src-tls$service_pem\" \"$service_pem\"\
fi\
# Set appropriate permissions\
{{container_cli}} exec \"$container_name\" chown haproxy:haproxy \"$service_pem\"\
# Trigger a reload for HAProxy to read the new certificates\
{{container_cli}} kill --signal HUP \"$container_name\"\
'}, {'ca': 'ipa', 'dns': ['{{fqdn_internal_api}}', '{{cloud_names.cloud_name_internal_api}}'], 'key_size': '2048', 'name': 'haproxy-internal_api-cert', 'principal': 'haproxy/{{fqdn_internal_api}}@{{idm_realm}}', 'run_after': '# Copy crt and key for backward compatibility\
cp \"/etc/pki/tls/certs/haproxy-internal_api-cert.crt\" \"/etc/pki/tls/certs/haproxy/overcloud-haproxy-internal_api.crt\"\
cp \"/etc/pki/tls/private/haproxy-internal_api-cert.key\" \"/etc/pki/tls/private/haproxy/overcloud-haproxy-internal_api.key\"\
\
ca_path=\"/etc/ipa/ca.crt\"\
service_crt=\"/etc/pki/tls/certs/haproxy/overcloud-haproxy-internal_api.crt\"\
service_key=\"/etc/pki/tls/private/haproxy/overcloud-haproxy-internal_api.key\"\
service_pem=\"/etc/pki/tls/certs/haproxy/overcloud-haproxy-internal_api.pem\"\
\
cat \"$service_crt\" \"$ca_path\" \"$service_key\" > \"$service_pem\"\
\
container_name=$({{container_cli}} ps --format=\\\\{\\\\{.Names\\\\}\\\\} | grep -w -E \\'haproxy(-bundle-.*-[0-9]+)?\\')\
# Inject the new pem into the running container\
if echo \"$container_name\" | grep -q \"^haproxy-bundle\"; then\
  # lp#1917868: Do not use podman cp with HA containers as they get\
  # frozen temporarily and that can make pacemaker operation fail.\
  tar -c \"$service_pem\" | {{container_cli}} exec -i \"$container_name\" tar -C / -xv\
  # no need to update the mount point, because pacemaker\
  # recreates the container when it\\'s restarted\
else\
  # Refresh the pem at the mount-point\
  {{container_cli}} cp $service_pem \"$container_name:/var/lib/kolla/config_files/src-tls/$service_pem\"\
  # Copy the new pem from the mount-point to the real path\
  {{container_cli}} exec \"$container_name\" cp \"/var/lib/kolla/config_files/src-tls$service_pem\" \"$service_pem\"\
fi\
# Set appropriate permissions\
{{container_cli}} exec \"$container_name\" chown haproxy:haproxy \"$service_pem\"\
# Trigger a reload for HAProxy to read the new certificates\
{{container_cli}} kill --signal HUP \"$container_name\"\
'}]: 'fqdn_storage_mgmt' is undefined"}

This is the overcloud deploy command line:
openstack overcloud deploy \
--timeout 240 \
--templates /usr/share/openstack-tripleo-heat-templates \
  --environment-file /usr/share/openstack-tripleo-heat-templates/environments/barbican-backend-simple-crypto.yaml \
  --environment-file /usr/share/openstack-tripleo-heat-templates/environments/services/barbican-edge.yaml \
--stack dcn2 \
--libvirt-type kvm \
--ntp-server clock1.rdu2.redhat.com \
-e /usr/share/openstack-tripleo-heat-templates/environments/dcn-storage.yaml \
--deployed-server \
-e /home/stack/templates/overcloud-vip-deployed.yaml \
-e /home/stack/templates/overcloud-networks-deployed.yaml \
-e /home/stack/templates/overcloud-baremetal-deployed-dcn2.yaml \
-e /home/stack/templates/overcloud-ceph-deployed-dcn2.yaml \
--networks-file /home/stack/dcn2/network/network_data_v2.yaml \
-e /home/stack/dcn2/internal.yaml \
-r /home/stack/dcn2/roles/roles_data.yaml \
-e /home/stack/dcn2/network/network-environment_v2.yaml \
-e /home/stack/dcn2/enable-tls.yaml \
-e /home/stack/dcn2/inject-trust-anchor.yaml \
-e /home/stack/dcn2/hostnames.yml \
-e /usr/share/openstack-tripleo-heat-templates/environments/cephadm/cephadm.yaml \
-e /home/stack/dcn2/nodes_data.yaml \
-e /home/stack/dcn2/glance.yaml \
-e /home/stack/dcn2/debug.yaml \
-e /home/stack/dcn2/use-dns-for-vips.yaml \
-e /home/stack/dcn2/glance.yaml \
-e /home/stack/central_ceph_external.yaml \
-e /home/stack/central-export.yaml \
-e /home/stack/dcn2/config_heat.yaml \
-e ~/containers-prepare-parameter.yaml \
-e /home/stack/dcn2/barbican.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/ssl/tls-everywhere-endpoints-dns.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/ssl/enable-internal-tls.yaml \
-e /home/stack/dcn2/cloud-names.yaml \
-e /home/stack/dcn2/ipaservices-baremetal-ansible.yaml

and baremetal-deployment.yaml content is:
- name: DistributedComputeHCI
  count: 3
  hostname_format: '%stackname%-computehci2-%index%'
  defaults:
    profile: computehci2
    network_config:
      template: /home/stack/dcn2/network/spine-leaf-nics/computehci2.j2
    networks:
    - network: ctlplane
      vif: true
    - network: storage
      subnet: storage_leaf2
    - network: internal_api
      subnet: internal_api_leaf2
    - network: tenant
      subnet: tenant_leaf2
    - network: storage_mgmt
      subnet: storage_mgmt_leaf2
- name: DistributedComputeHCIScaleOut
  count: 1
  hostname_format: '%stackname%-computehciscaleout2-%index%'
  defaults:
    profile: computehciscaleout2
    network_config:
      template: /home/stack/dcn2/network/spine-leaf-nics/computehciscaleout2.j2
    networks:
    - network: ctlplane
      vif: true
    - network: storage
      subnet: storage_leaf2
    - network: internal_api
      subnet: internal_api_leaf2
    - network: tenant
      subnet: tenant_leaf2
    - network: storage_mgmt
      subnet: storage_mgmt_leaf2
- name: DistributedComputeScaleOut
  count: 1
  hostname_format: '%stackname%-computescaleout2-%index%'
  defaults:
    profile: computescaleout2
    network_config:
      template: /home/stack/dcn2/network/spine-leaf-nics/computescaleout2.j2
    networks:
    - network: ctlplane
      vif: true
    - network: storage
      subnet: storage_leaf2
    - network: internal_api
      subnet: internal_api_leaf2
    - network: tenant
      subnet: tenant_leaf2

If I remove the dcn-storage.yaml template from the overcloud command line, the error is not thrown.

NOTE: TO be able to deploy I am using following upstream patches:
https://review.opendev.org/c/openstack/tripleo-heat-templates/+/840443
https://review.opendev.org/c/openstack/tripleo-heat-templates/+/841743
https://review.opendev.org/c/openstack/tripleo-heat-templates/+/840534

Version-Release number of selected component (if applicable):
ansible-tripleo-ipsec-11.0.1-0.20210910011424.b5559c8.el9ost.noarch
ansible-role-tripleo-modify-image-1.3.1-0.20220216001439.30d23d5.el9ost.noarch
ansible-tripleo-ipa-0.2.3-0.20220301190449.6b0ed82.el9ost.noarch
puppet-tripleo-14.2.3-0.20220407012437.87240e8.el9ost.noarch
python3-tripleo-common-15.4.1-0.20220328184445.0c754c6.el9ost.noarch
tripleo-ansible-3.3.1-0.20220418220931.1557dde.el9ost.noarch
openstack-tripleo-validations-14.2.2-0.20220422030814.62e6da1.el9ost.noarch
openstack-tripleo-common-containers-15.4.1-0.20220328184445.0c754c6.el9ost.noarch
openstack-tripleo-common-15.4.1-0.20220328184445.0c754c6.el9ost.noarch
openstack-tripleo-heat-templates-14.3.1-0.20220417231018.c3cee34.el9ost.noarch
python3-tripleoclient-16.4.1-0.20220416001259.ccc329b.el9ost.noarch

How reproducible:
Always

Steps to Reproduce:
1. Deploy DCN site with a storage and described topology, especially with DistributedComputeScaleOut node.

Actual results:
The deployment fails

Comment 1 Rabi Mishra 2022-05-23 06:54:24 UTC
Looks like dcn-storage.yaml has ManageNetworks: False. Maybe the networks were not in network_data of the main stack or missing some fixes like https://review.opendev.org/c/openstack/tripleo-heat-templates/+/781572.

Comment 2 Steve Baker 2022-05-23 19:52:40 UTC
In baremetal-deployment.yaml it looks like the DistributedComputeScaleOut networks list is missing the storage_mgmt network. Feel free to close this if that fixes the issue.

Comment 3 Marian Krcmarik 2022-05-23 20:44:41 UTC
(In reply to Steve Baker from comment #2)
> In baremetal-deployment.yaml it looks like the DistributedComputeScaleOut
> networks list is missing the storage_mgmt network. Feel free to close this
> if that fixes the issue.

That's intentional since the predefined role in tht does not have storage_mgmt network defined:
https://github.com/openstack/tripleo-heat-templates/blob/master/roles/DistributedComputeScaleOut.yaml
So my assumption is that there is no need for storage_mgmt network to be deployed on DistributedComputeScaleOut node.

Comment 4 Julia Kreger 2022-05-25 17:31:16 UTC
I'm not an expert in that specific architecture configuration, but wouldn't you still need storage network so you don't burden your primary network interfaces with the storage networking traffic?

Comment 5 John Fulton 2022-05-25 20:40:11 UTC
(In reply to Julia Kreger from comment #4)
> I'm not an expert in that specific architecture configuration, but wouldn't
> you still need storage network so you don't burden your primary network
> interfaces with the storage networking traffic?

Both roles have the "storage network" to unburden the primary interface. We're talking about which roles should have the storage_mgmt network which is another network for storage.

(In reply to Marian Krcmarik from comment #3)
> (In reply to Steve Baker from comment #2)
> > In baremetal-deployment.yaml it looks like the DistributedComputeScaleOut
> > networks list is missing the storage_mgmt network. Feel free to close this
> > if that fixes the issue.
> 
> That's intentional since the predefined role in tht does not have
> storage_mgmt network defined:
> https://github.com/openstack/tripleo-heat-templates/blob/master/roles/
> DistributedComputeScaleOut.yaml
> So my assumption is that there is no need for storage_mgmt network to be
> deployed on DistributedComputeScaleOut node.

- DistributedComputeScaleOut has the storage network [1]
- DistributedComputeHCIScaleOut has the storage network and the storage_mgmt network [2]
- In Ceph terms, "storage network" is the "ceph public_network" and the storage_mgmt network is the "ceph cluster_network" [3]

- You only need the storage_mgmt network if you are hosting OSDs so that they can replicate.
- The main difference between DistributedComputeScaleOut and DistributedComputeHCIScaleOut is that the HCI one hosts OSDs. 
- Since DistributedComputeScaleOut does not host OSDs it shouldn't need the storage_mgmt network.

Unless, the storage_mgmt network is being used for something other than just OSDs. The ServiceNetMap in the config-download directory could confirm which services are using which networks.

[1] https://github.com/openstack/tripleo-heat-templates/blob/master/roles/DistributedComputeScaleOut.yaml#L15-L16
[2] https://github.com/openstack/tripleo-heat-templates/blob/master/roles/DistributedComputeHCIScaleOut.yaml#L16-L17
[3] https://docs.ceph.com/en/latest/rados/configuration/network-config-ref/

Comment 6 John Fulton 2022-05-25 21:14:57 UTC
Marian confirmed the following: 
- If the deployment is run without DistributedComputeScaleOut, then it doesn't hit this bug
- If the DistributedComputeScaleOut node has the storage_mgmt network added, then it doesn't hit this bug (this can be a workaround until this BZ is closed)

Also, we see TLS-e provided a certificate for each network, e.g. /etc/pki/tls/certs/haproxy/overcloud-haproxy-storage_mgmt.pem

This script that failed is running on the DistributedComputeScaleOut node for each network in HAProxyNetworks [2] even if that network is not configured on the node.

Maybe the line that provides the list for the loop [2] needs to be intersected with the list of actual networks that are configured on the running node.


[1] https://github.com/openstack/tripleo-heat-templates/blob/stable/wallaby/deployment/haproxy/haproxy-internal-tls-certmonger.j2.yaml#L172

Comment 7 Steve Baker 2022-05-25 22:16:08 UTC
I think John has explained the root cause, assigning to DFG:Security to look at changing "Certificate generation" to skip networks which are not configure on that node

Comment 8 John Fulton 2022-05-26 12:36:10 UTC
This is similar to, but not a duplicate of, BZ 2081698; solved by https://review.opendev.org/c/openstack/tripleo-heat-templates/+/841930/1/deployment/apache/apache-baremetal-puppet.j2.yaml

Comment 10 Yaniv Kaul 2022-06-29 09:06:42 UTC
Any updates to this BZ? Upstream Gerrit has been merged for quite some time, this BZ is still in NEW, though high severity?

Comment 13 Alan Bishop 2022-07-06 03:29:24 UTC
Upstream patch has merged on stable/wallaby.

Comment 27 errata-xmlrpc 2022-09-21 12:21:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Release of components for Red Hat OpenStack Platform 17.0 (Wallaby)), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2022:6543


Note You need to log in before you can comment on or make changes to this bug.