2088526 – DistributedComputeScaleOut node fails to deploy on a DCN site with storage

Bug 2088526 - DistributedComputeScaleOut node fails to deploy on a DCN site with storage

Summary: DistributedComputeScaleOut node fails to deploy on a DCN site with storage

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tripleo-heat-templates
Sub Component:
Version:	17.0 (Wallaby)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	beta
Target Release:	17.0
Assignee:	Grzegorz Grasza
QA Contact:	Joe H. Rahme
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-05-19 15:50 UTC by Marian Krcmarik
Modified:	2022-09-21 12:22 UTC (History)
CC List:	11 users (show)
Fixed In Version:	openstack-tripleo-heat-templates-14.3.1-0.20220719171711.feca772.el9ost
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-09-21 12:21:38 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Overcloud deploy log (3.39 MB, text/plain) 2022-05-19 15:50 UTC, Marian Krcmarik	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
OpenStack gerrit	841930	None	stable/wallaby: MERGED	tripleo-heat-templates: Stop generating certificate requests for disabled networks (I05ba5fb48c617a5bbedebb8b74c23bec9ab...	2022-07-06 12:41:52 UTC
OpenStack gerrit	848550	None	stable/wallaby: MERGED	tripleo-heat-templates: Stop generating certificate requests for disabled networks (I0293e019f3a2c4c8ffbf8258214d8522957...	2022-07-06 12:41:57 UTC
Red Hat Issue Tracker	OSP-15318	None	None	None	2022-05-19 15:59:44 UTC
Red Hat Product Errata	RHEA-2022:6543	None	None	None	2022-09-21 12:22:06 UTC

Description Marian Krcmarik 2022-05-19 15:50:23 UTC

Created attachment 1881400 [details]
Overcloud deploy log

Description of problem:
Overcloud deployment of a DCN site with storage and following topology: 3 DistributedComputeHCI nodes, 1 DistributedComputeHCIScaleOut and 1 DistributedScaleOut node fails on following error:
FATAL | Ensure provider packages are installed | dcn2-computescaleout2-0 | error={"msg": "{% set unique_providers = [] %}
{% for item in certificate_requests %}
{%   set _ = unique_providers.append(
       item.provider |
       d(__certificate_provider_default)
     ) %}
{% endfor %}
{{ unique_providers | unique }}
: [{'ca': 'ipa', 'dns': ['{{fqdn_ctlplane}}', '{{cloud_names.cloud_name_ctlplane}}'], 'key_size': '2048', 'name': 'haproxy-ctlplane-cert', 'principal': 'haproxy/{{fqdn_ctlplane}}@{{idm_realm}}', 'run_after': '# Copy crt and key for backward compatibility\
cp \"/etc/pki/tls/certs/haproxy-ctlplane-cert.crt\" \"/etc/pki/tls/certs/haproxy/overcloud-haproxy-ctlplane.crt\"\
cp \"/etc/pki/tls/private/haproxy-ctlplane-cert.key\" \"/etc/pki/tls/private/haproxy/overcloud-haproxy-ctlplane.key\"\
\
ca_path=\"/etc/ipa/ca.crt\"\
service_crt=\"/etc/pki/tls/certs/haproxy/overcloud-haproxy-ctlplane.crt\"\
service_key=\"/etc/pki/tls/private/haproxy/overcloud-haproxy-ctlplane.key\"\
service_pem=\"/etc/pki/tls/certs/haproxy/overcloud-haproxy-ctlplane.pem\"\
\
cat \"$service_crt\" \"$ca_path\" \"$service_key\" > \"$service_pem\"\
\
container_name=$({{container_cli}} ps --format=\\\\{\\\\{.Names\\\\}\\\\} | grep -w -E \\'haproxy(-bundle-.*-[0-9]+)?\\')\
# Inject the new pem into the running container\
if echo \"$container_name\" | grep -q \"^haproxy-bundle\"; then\
  # lp#1917868: Do not use podman cp with HA containers as they get\
  # frozen temporarily and that can make pacemaker operation fail.\
  tar -c \"$service_pem\" | {{container_cli}} exec -i \"$container_name\" tar -C / -xv\
  # no need to update the mount point, because pacemaker\
  # recreates the container when it\\'s restarted\
else\
  # Refresh the pem at the mount-point\
  {{container_cli}} cp $service_pem \"$container_name:/var/lib/kolla/config_files/src-tls/$service_pem\"\
  # Copy the new pem from the mount-point to the real path\
  {{container_cli}} exec \"$container_name\" cp \"/var/lib/kolla/config_files/src-tls$service_pem\" \"$service_pem\"\
fi\
# Set appropriate permissions\
{{container_cli}} exec \"$container_name\" chown haproxy:haproxy \"$service_pem\"\
# Trigger a reload for HAProxy to read the new certificates\
{{container_cli}} kill --signal HUP \"$container_name\"\
'}, {'ca': 'ipa', 'dns': ['{{fqdn_storage}}', '{{cloud_names.cloud_name_storage}}'], 'key_size': '2048', 'name': 'haproxy-storage-cert', 'principal': 'haproxy/{{fqdn_storage}}@{{idm_realm}}', 'run_after': '# Copy crt and key for backward compatibility\
cp \"/etc/pki/tls/certs/haproxy-storage-cert.crt\" \"/etc/pki/tls/certs/haproxy/overcloud-haproxy-storage.crt\"\
cp \"/etc/pki/tls/private/haproxy-storage-cert.key\" \"/etc/pki/tls/private/haproxy/overcloud-haproxy-storage.key\"\
\
ca_path=\"/etc/ipa/ca.crt\"\
service_crt=\"/etc/pki/tls/certs/haproxy/overcloud-haproxy-storage.crt\"\
service_key=\"/etc/pki/tls/private/haproxy/overcloud-haproxy-storage.key\"\
service_pem=\"/etc/pki/tls/certs/haproxy/overcloud-haproxy-storage.pem\"\
\
cat \"$service_crt\" \"$ca_path\" \"$service_key\" > \"$service_pem\"\
\
container_name=$({{container_cli}} ps --format=\\\\{\\\\{.Names\\\\}\\\\} | grep -w -E \\'haproxy(-bundle-.*-[0-9]+)?\\')\
# Inject the new pem into the running container\
if echo \"$container_name\" | grep -q \"^haproxy-bundle\"; then\
  # lp#1917868: Do not use podman cp with HA containers as they get\
  # frozen temporarily and that can make pacemaker operation fail.\
  tar -c \"$service_pem\" | {{container_cli}} exec -i \"$container_name\" tar -C / -xv\
  # no need to update the mount point, because pacemaker\
  # recreates the container when it\\'s restarted\
else\
  # Refresh the pem at the mount-point\
  {{container_cli}} cp $service_pem \"$container_name:/var/lib/kolla/config_files/src-tls/$service_pem\"\
  # Copy the new pem from the mount-point to the real path\
  {{container_cli}} exec \"$container_name\" cp \"/var/lib/kolla/config_files/src-tls$service_pem\" \"$service_pem\"\
fi\
# Set appropriate permissions\
{{container_cli}} exec \"$container_name\" chown haproxy:haproxy \"$service_pem\"\
# Trigger a reload for HAProxy to read the new certificates\
{{container_cli}} kill --signal HUP \"$container_name\"\
'}, {'ca': 'ipa', 'dns': ['{{fqdn_storage_mgmt}}', '{{cloud_names.cloud_name_storage_mgmt}}'], 'key_size': '2048', 'name': 'haproxy-storage_mgmt-cert', 'principal': 'haproxy/{{fqdn_storage_mgmt}}@{{idm_realm}}', 'run_after': '# Copy crt and key for backward compatibility\
cp \"/etc/pki/tls/certs/haproxy-storage_mgmt-cert.crt\" \"/etc/pki/tls/certs/haproxy/overcloud-haproxy-storage_mgmt.crt\"\
cp \"/etc/pki/tls/private/haproxy-storage_mgmt-cert.key\" \"/etc/pki/tls/private/haproxy/overcloud-haproxy-storage_mgmt.key\"\
\
ca_path=\"/etc/ipa/ca.crt\"\
service_crt=\"/etc/pki/tls/certs/haproxy/overcloud-haproxy-storage_mgmt.crt\"\
service_key=\"/etc/pki/tls/private/haproxy/overcloud-haproxy-storage_mgmt.key\"\
service_pem=\"/etc/pki/tls/certs/haproxy/overcloud-haproxy-storage_mgmt.pem\"\
\
cat \"$service_crt\" \"$ca_path\" \"$service_key\" > \"$service_pem\"\
\
container_name=$({{container_cli}} ps --format=\\\\{\\\\{.Names\\\\}\\\\} | grep -w -E \\'haproxy(-bundle-.*-[0-9]+)?\\')\
# Inject the new pem into the running container\
if echo \"$container_name\" | grep -q \"^haproxy-bundle\"; then\
  # lp#1917868: Do not use podman cp with HA containers as they get\
  # frozen temporarily and that can make pacemaker operation fail.\
  tar -c \"$service_pem\" | {{container_cli}} exec -i \"$container_name\" tar -C / -xv\
  # no need to update the mount point, because pacemaker\
  # recreates the container when it\\'s restarted\
else\
  # Refresh the pem at the mount-point\
  {{container_cli}} cp $service_pem \"$container_name:/var/lib/kolla/config_files/src-tls/$service_pem\"\
  # Copy the new pem from the mount-point to the real path\
  {{container_cli}} exec \"$container_name\" cp \"/var/lib/kolla/config_files/src-tls$service_pem\" \"$service_pem\"\
fi\
# Set appropriate permissions\
{{container_cli}} exec \"$container_name\" chown haproxy:haproxy \"$service_pem\"\
# Trigger a reload for HAProxy to read the new certificates\
{{container_cli}} kill --signal HUP \"$container_name\"\
'}, {'ca': 'ipa', 'dns': ['{{fqdn_internal_api}}', '{{cloud_names.cloud_name_internal_api}}'], 'key_size': '2048', 'name': 'haproxy-internal_api-cert', 'principal': 'haproxy/{{fqdn_internal_api}}@{{idm_realm}}', 'run_after': '# Copy crt and key for backward compatibility\
cp \"/etc/pki/tls/certs/haproxy-internal_api-cert.crt\" \"/etc/pki/tls/certs/haproxy/overcloud-haproxy-internal_api.crt\"\
cp \"/etc/pki/tls/private/haproxy-internal_api-cert.key\" \"/etc/pki/tls/private/haproxy/overcloud-haproxy-internal_api.key\"\
\
ca_path=\"/etc/ipa/ca.crt\"\
service_crt=\"/etc/pki/tls/certs/haproxy/overcloud-haproxy-internal_api.crt\"\
service_key=\"/etc/pki/tls/private/haproxy/overcloud-haproxy-internal_api.key\"\
service_pem=\"/etc/pki/tls/certs/haproxy/overcloud-haproxy-internal_api.pem\"\
\
cat \"$service_crt\" \"$ca_path\" \"$service_key\" > \"$service_pem\"\
\
container_name=$({{container_cli}} ps --format=\\\\{\\\\{.Names\\\\}\\\\} | grep -w -E \\'haproxy(-bundle-.*-[0-9]+)?\\')\
# Inject the new pem into the running container\
if echo \"$container_name\" | grep -q \"^haproxy-bundle\"; then\
  # lp#1917868: Do not use podman cp with HA containers as they get\
  # frozen temporarily and that can make pacemaker operation fail.\
  tar -c \"$service_pem\" | {{container_cli}} exec -i \"$container_name\" tar -C / -xv\
  # no need to update the mount point, because pacemaker\
  # recreates the container when it\\'s restarted\
else\
  # Refresh the pem at the mount-point\
  {{container_cli}} cp $service_pem \"$container_name:/var/lib/kolla/config_files/src-tls/$service_pem\"\
  # Copy the new pem from the mount-point to the real path\
  {{container_cli}} exec \"$container_name\" cp \"/var/lib/kolla/config_files/src-tls$service_pem\" \"$service_pem\"\
fi\
# Set appropriate permissions\
{{container_cli}} exec \"$container_name\" chown haproxy:haproxy \"$service_pem\"\
# Trigger a reload for HAProxy to read the new certificates\
{{container_cli}} kill --signal HUP \"$container_name\"\
'}]: 'fqdn_storage_mgmt' is undefined"}

This is the overcloud deploy command line:
openstack overcloud deploy \
--timeout 240 \
--templates /usr/share/openstack-tripleo-heat-templates \
  --environment-file /usr/share/openstack-tripleo-heat-templates/environments/barbican-backend-simple-crypto.yaml \
  --environment-file /usr/share/openstack-tripleo-heat-templates/environments/services/barbican-edge.yaml \
--stack dcn2 \
--libvirt-type kvm \
--ntp-server clock1.rdu2.redhat.com \
-e /usr/share/openstack-tripleo-heat-templates/environments/dcn-storage.yaml \
--deployed-server \
-e /home/stack/templates/overcloud-vip-deployed.yaml \
-e /home/stack/templates/overcloud-networks-deployed.yaml \
-e /home/stack/templates/overcloud-baremetal-deployed-dcn2.yaml \
-e /home/stack/templates/overcloud-ceph-deployed-dcn2.yaml \
--networks-file /home/stack/dcn2/network/network_data_v2.yaml \
-e /home/stack/dcn2/internal.yaml \
-r /home/stack/dcn2/roles/roles_data.yaml \
-e /home/stack/dcn2/network/network-environment_v2.yaml \
-e /home/stack/dcn2/enable-tls.yaml \
-e /home/stack/dcn2/inject-trust-anchor.yaml \
-e /home/stack/dcn2/hostnames.yml \
-e /usr/share/openstack-tripleo-heat-templates/environments/cephadm/cephadm.yaml \
-e /home/stack/dcn2/nodes_data.yaml \
-e /home/stack/dcn2/glance.yaml \
-e /home/stack/dcn2/debug.yaml \
-e /home/stack/dcn2/use-dns-for-vips.yaml \
-e /home/stack/dcn2/glance.yaml \
-e /home/stack/central_ceph_external.yaml \
-e /home/stack/central-export.yaml \
-e /home/stack/dcn2/config_heat.yaml \
-e ~/containers-prepare-parameter.yaml \
-e /home/stack/dcn2/barbican.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/ssl/tls-everywhere-endpoints-dns.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/ssl/enable-internal-tls.yaml \
-e /home/stack/dcn2/cloud-names.yaml \
-e /home/stack/dcn2/ipaservices-baremetal-ansible.yaml

and baremetal-deployment.yaml content is:
- name: DistributedComputeHCI
  count: 3
  hostname_format: '%stackname%-computehci2-%index%'
  defaults:
    profile: computehci2
    network_config:
      template: /home/stack/dcn2/network/spine-leaf-nics/computehci2.j2
    networks:
    - network: ctlplane
      vif: true
    - network: storage
      subnet: storage_leaf2
    - network: internal_api
      subnet: internal_api_leaf2
    - network: tenant
      subnet: tenant_leaf2
    - network: storage_mgmt
      subnet: storage_mgmt_leaf2
- name: DistributedComputeHCIScaleOut
  count: 1
  hostname_format: '%stackname%-computehciscaleout2-%index%'
  defaults:
    profile: computehciscaleout2
    network_config:
      template: /home/stack/dcn2/network/spine-leaf-nics/computehciscaleout2.j2
    networks:
    - network: ctlplane
      vif: true
    - network: storage
      subnet: storage_leaf2
    - network: internal_api
      subnet: internal_api_leaf2
    - network: tenant
      subnet: tenant_leaf2
    - network: storage_mgmt
      subnet: storage_mgmt_leaf2
- name: DistributedComputeScaleOut
  count: 1
  hostname_format: '%stackname%-computescaleout2-%index%'
  defaults:
    profile: computescaleout2
    network_config:
      template: /home/stack/dcn2/network/spine-leaf-nics/computescaleout2.j2
    networks:
    - network: ctlplane
      vif: true
    - network: storage
      subnet: storage_leaf2
    - network: internal_api
      subnet: internal_api_leaf2
    - network: tenant
      subnet: tenant_leaf2

If I remove the dcn-storage.yaml template from the overcloud command line, the error is not thrown.

NOTE: TO be able to deploy I am using following upstream patches:
https://review.opendev.org/c/openstack/tripleo-heat-templates/+/840443
https://review.opendev.org/c/openstack/tripleo-heat-templates/+/841743
https://review.opendev.org/c/openstack/tripleo-heat-templates/+/840534

Version-Release number of selected component (if applicable):
ansible-tripleo-ipsec-11.0.1-0.20210910011424.b5559c8.el9ost.noarch
ansible-role-tripleo-modify-image-1.3.1-0.20220216001439.30d23d5.el9ost.noarch
ansible-tripleo-ipa-0.2.3-0.20220301190449.6b0ed82.el9ost.noarch
puppet-tripleo-14.2.3-0.20220407012437.87240e8.el9ost.noarch
python3-tripleo-common-15.4.1-0.20220328184445.0c754c6.el9ost.noarch
tripleo-ansible-3.3.1-0.20220418220931.1557dde.el9ost.noarch
openstack-tripleo-validations-14.2.2-0.20220422030814.62e6da1.el9ost.noarch
openstack-tripleo-common-containers-15.4.1-0.20220328184445.0c754c6.el9ost.noarch
openstack-tripleo-common-15.4.1-0.20220328184445.0c754c6.el9ost.noarch
openstack-tripleo-heat-templates-14.3.1-0.20220417231018.c3cee34.el9ost.noarch
python3-tripleoclient-16.4.1-0.20220416001259.ccc329b.el9ost.noarch

How reproducible:
Always

Steps to Reproduce:
1. Deploy DCN site with a storage and described topology, especially with DistributedComputeScaleOut node.

Actual results:
The deployment fails

Comment 1 Rabi Mishra 2022-05-23 06:54:24 UTC

Looks like dcn-storage.yaml has ManageNetworks: False. Maybe the networks were not in network_data of the main stack or missing some fixes like https://review.opendev.org/c/openstack/tripleo-heat-templates/+/781572.

Comment 2 Steve Baker 2022-05-23 19:52:40 UTC

In baremetal-deployment.yaml it looks like the DistributedComputeScaleOut networks list is missing the storage_mgmt network. Feel free to close this if that fixes the issue.

Comment 3 Marian Krcmarik 2022-05-23 20:44:41 UTC

(In reply to Steve Baker from comment #2)
> In baremetal-deployment.yaml it looks like the DistributedComputeScaleOut
> networks list is missing the storage_mgmt network. Feel free to close this
> if that fixes the issue.

That's intentional since the predefined role in tht does not have storage_mgmt network defined:
https://github.com/openstack/tripleo-heat-templates/blob/master/roles/DistributedComputeScaleOut.yaml
So my assumption is that there is no need for storage_mgmt network to be deployed on DistributedComputeScaleOut node.

Comment 4 Julia Kreger 2022-05-25 17:31:16 UTC

I'm not an expert in that specific architecture configuration, but wouldn't you still need storage network so you don't burden your primary network interfaces with the storage networking traffic?

Comment 5 John Fulton 2022-05-25 20:40:11 UTC

(In reply to Julia Kreger from comment #4)
> I'm not an expert in that specific architecture configuration, but wouldn't
> you still need storage network so you don't burden your primary network
> interfaces with the storage networking traffic?

Both roles have the "storage network" to unburden the primary interface. We're talking about which roles should have the storage_mgmt network which is another network for storage.

(In reply to Marian Krcmarik from comment #3)
> (In reply to Steve Baker from comment #2)
> > In baremetal-deployment.yaml it looks like the DistributedComputeScaleOut
> > networks list is missing the storage_mgmt network. Feel free to close this
> > if that fixes the issue.
> 
> That's intentional since the predefined role in tht does not have
> storage_mgmt network defined:
> https://github.com/openstack/tripleo-heat-templates/blob/master/roles/
> DistributedComputeScaleOut.yaml
> So my assumption is that there is no need for storage_mgmt network to be
> deployed on DistributedComputeScaleOut node.

- DistributedComputeScaleOut has the storage network [1]
- DistributedComputeHCIScaleOut has the storage network and the storage_mgmt network [2]
- In Ceph terms, "storage network" is the "ceph public_network" and the storage_mgmt network is the "ceph cluster_network" [3]

- You only need the storage_mgmt network if you are hosting OSDs so that they can replicate.
- The main difference between DistributedComputeScaleOut and DistributedComputeHCIScaleOut is that the HCI one hosts OSDs. 
- Since DistributedComputeScaleOut does not host OSDs it shouldn't need the storage_mgmt network.

Unless, the storage_mgmt network is being used for something other than just OSDs. The ServiceNetMap in the config-download directory could confirm which services are using which networks.

[1] https://github.com/openstack/tripleo-heat-templates/blob/master/roles/DistributedComputeScaleOut.yaml#L15-L16
[2] https://github.com/openstack/tripleo-heat-templates/blob/master/roles/DistributedComputeHCIScaleOut.yaml#L16-L17
[3] https://docs.ceph.com/en/latest/rados/configuration/network-config-ref/

Comment 6 John Fulton 2022-05-25 21:14:57 UTC

Marian confirmed the following: 
- If the deployment is run without DistributedComputeScaleOut, then it doesn't hit this bug
- If the DistributedComputeScaleOut node has the storage_mgmt network added, then it doesn't hit this bug (this can be a workaround until this BZ is closed)

Also, we see TLS-e provided a certificate for each network, e.g. /etc/pki/tls/certs/haproxy/overcloud-haproxy-storage_mgmt.pem

This script that failed is running on the DistributedComputeScaleOut node for each network in HAProxyNetworks [2] even if that network is not configured on the node.

Maybe the line that provides the list for the loop [2] needs to be intersected with the list of actual networks that are configured on the running node.


[1] https://github.com/openstack/tripleo-heat-templates/blob/stable/wallaby/deployment/haproxy/haproxy-internal-tls-certmonger.j2.yaml#L172

Comment 7 Steve Baker 2022-05-25 22:16:08 UTC

I think John has explained the root cause, assigning to DFG:Security to look at changing "Certificate generation" to skip networks which are not configure on that node

Comment 8 John Fulton 2022-05-26 12:36:10 UTC

This is similar to, but not a duplicate of, BZ 2081698; solved by https://review.opendev.org/c/openstack/tripleo-heat-templates/+/841930/1/deployment/apache/apache-baremetal-puppet.j2.yaml

Comment 10 Yaniv Kaul 2022-06-29 09:06:42 UTC

Any updates to this BZ? Upstream Gerrit has been merged for quite some time, this BZ is still in NEW, though high severity?

Comment 13 Alan Bishop 2022-07-06 03:29:24 UTC

Upstream patch has merged on stable/wallaby.

Comment 27 errata-xmlrpc 2022-09-21 12:21:38 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Release of components for Red Hat OpenStack Platform 17.0 (Wallaby)), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2022:6543

Note You need to log in before you can comment on or make changes to this bug.