Based on the debugging of bz #1868986 It turned out that once instance is supposed to be spawned on DCN site with local glance, Edge compute uses IP address as API Ednpoint and not FQDN which is a problem in case TLS-E is used since used cert uses FQDN as CN and cert wont be valided. +++ This bug was initially created as a clone of Bug #1868986 +++ Description of problem: Deployment of DCN with Distributed multibackend storage and TLS-E (tripleo-ipa) fails to deploy DCN site on: <LOG> fatal: [dcn1-computehciscaleout1-0]: FAILED! => {"ansible_job_id": "552398882008.22056", "attempts": 11, "changed": true, "cmd": "set -o pipefail; puppet apply --debug --verbose --modulepath=/etc/puppet/modules: /opt/stack/puppet-modules:/usr/share/openstack-puppet/modules --detailed-exitcodes --summarize --color=false /var/lib/tripleo-config/puppet_step_config.pp 2>&1 | logger -s -t puppet-user", "delta": "0:00:35.08 2779", "end": "2020-08-14 08:25:17.356108", "failed_when_result": true, "finished": 1, "msg": "non-zero return code", "rc": 6, "start": "2020-08-14 08:24:42.273329", "stderr": "<13>Aug 14 08:24:42 puppet-user: D ebug: Runtime environment: puppet_version=5.5.10, ruby_version=2.5.5, run_mode=user, default_encoding=UTF-8 ...skipped log ... <13>Aug 13 21:14:32 puppet-user: Debug: Issuing getcert command with args: [\"request\", \"-I\", \"haproxy-external-cert\", \"-f\", \"/etc/pki/tls/certs/haproxy/overcloud-haproxy-external.crt\", \"-c\", \"IPA\", \ \"-N\", \"CN=overcloud.redhat.local\", \"-K\", \"haproxy/overcloud.redhat.local\", \"-D\", \"overcloud.redhat.local\", \"-U\", \"id-kp-clientAuth\", \"-U\", \"id-kp-serverAuth\", \"-C\", \"/usr/bin/certmonger-h aproxy-refresh.sh reload external\", \"-w\", \"-k\", \"/etc/pki/tls/private/haproxy/overcloud-haproxy-external.key\"] <13>Aug 13 21:14:32 puppet-user: Debug: Executing: '/usr/bin/getcert request -I haproxy-external-cert -f /etc/pki/tls/certs/haproxy/overcloud-haproxy-external.crt -c IPA -N CN=overcloud.redhat.local -K haproxy/o vercloud.redhat.local -D overcloud.redhat.local -U id-kp-clientAuth -U id-kp-serverAuth -C /usr/bin/certmonger-haproxy-refresh.sh reload external -w -k /etc/pki/tls/private/haproxy/overcloud-haproxy-external.key <13>Aug 13 21:14:33 puppet-user: Warning: Could not get certificate: Execution of '/usr/bin/getcert request -I haproxy-external-cert -f /etc/pki/tls/certs/haproxy/overcloud-haproxy-external.crt -c IPA -N CN=over cloud.redhat.local -K haproxy/overcloud.redhat.local -D overcloud.redhat.local -U id-kp-clientAuth -U id-kp-serverAuth -C /usr/bin/certmonger-haproxy-refresh.sh reload external -w -k /etc/pki/tls/private/haproxy /overcloud-haproxy-external.key' returned 2: New signing request \"haproxy-external-cert\" added. <13>Aug 13 21:14:33 puppet-user: Debug: Executing: '/usr/bin/getcert list -i haproxy-external-cert' <13>Aug 13 21:14:34 puppet-user: Error: /Stage[main]/Tripleo::Profile::Base::Certmonger_user/Tripleo::Certmonger::Haproxy[haproxy-external]/Certmonger_certificate[haproxy-external-cert]: Could not evaluate: Could not get certificate: Server at https://site-freeipa-0.redhat.local/ipa/xml denied our request, giving up: 2100 (RPC failed at server. Insufficient access: Insufficient 'write' privilege to the 'userCertificat e' attribute of entry 'krbprincipalname=haproxy/overcloud.redhat.local,cn=services,cn=accounts,dc=redhat,dc=local'.). </LOG> The failing command on dcn1-computehciscaleout1-0 is: /usr/bin/getcert request -I haproxy-external-cert -f /etc/pki/tls/certs/haproxy/overcloud-haproxy-external.crt -c IPA -N CN=over cloud.redhat.local -K haproxy/overcloud.redhat.local -D overcloud.redhat.local -U id-kp-clientAuth -U id-kp-serverAuth -C /usr/bin/certmonger-haproxy-refresh.sh reload external -w -k /etc/pki/tls/private/haproxy /overcloud-haproxy-external.key with error: Could not get certificate: Server at https://site-freeipa-0.redhat.local/ipa/xml denied our request, giving up: 2100 (RPC failed at server. Insufficient access: Insufficient 'write' privilege to the 'userCertificat e' attribute of entry 'krbprincipalname=haproxy/overcloud.redhat.local,cn=services,cn=accounts,dc=redhat,dc=local'.). I tried to google for a solution and I can get request cert If I do following commands: [heat-admin@dcn1-computehciscaleout1-0 ~]$ ipa service-add-host --hosts=dcn1-computehciscaleout1-0.redhat.local haproxy/overcloud.redhat.local [heat-admin@dcn1-computehciscaleout1-0 ~]$ sudo ipa-getcert resubmit -i haproxy-external-cert and then successful issued cert: [heat-admin@dcn1-computehciscaleout1-0 ~]$ sudo /usr/bin/getcert list -i haproxy-external-cert Number of certificates and requests being tracked: 11. Request ID 'haproxy-external-cert': status: MONITORING stuck: no key pair storage: type=FILE,location='/etc/pki/tls/private/haproxy/overcloud-haproxy-external.key' certificate: type=FILE,location='/etc/pki/tls/certs/haproxy/overcloud-haproxy-external.crt' CA: IPA issuer: CN=Certificate Authority,O=REDHAT.LOCAL subject: CN=overcloud.redhat.local,O=REDHAT.LOCAL expires: 2022-08-14 22:47:09 UTC dns: overcloud.redhat.local principal name: haproxy/overcloud.redhat.local key usage: digitalSignature,nonRepudiation,keyEncipherment,dataEncipherment eku: id-kp-serverAuth,id-kp-clientAuth pre-save command: post-save command: /usr/bin/certmonger-haproxy-refresh.sh reload external track: yes auto-renew: yes I do not have clear idea about how this type of topology is deployed and what the scaleout node is for bit this is the deploy command line: openstack overcloud deploy \ --timeout 240 \ --templates /usr/share/openstack-tripleo-heat-templates \ --stack dcn1 \ --libvirt-type kvm \ --ntp-server clock1.rdu2.redhat.com \ -e /usr/share/openstack-tripleo-heat-templates/environments/dcn-hci.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/podman.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/disable-telemetry.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/net-multiple-nics.yaml \ -e /home/stack/dcn1/internal.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \ -n /home/stack/dcn1/network/network_data.yaml \ -r /home/stack/dcn1/roles/roles_data.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/services/neutron-ovs.yaml \ -e /home/stack/dcn1/network/network-environment.yaml \ -e /home/stack/dcn1/enable-tls.yaml \ -e /home/stack/dcn1/inject-trust-anchor.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/ssl/tls-endpoints-public-ip.yaml \ -e /home/stack/dcn1/hostnames.yml \ -e /usr/share/openstack-tripleo-heat-templates/environments/ceph-ansible/ceph-ansible.yaml \ -e /home/stack/dcn1/dcn1_ceph_keys.yaml \ -e /home/stack/dcn1/nodes_data.yaml \ -e /home/stack/dcn1/debug.yaml \ -e /home/stack/dcn1/docker-images.yaml \ -e /home/stack/dcn1/glance.yaml \ -e /home/stack/central_ceph_external.yaml \ -e /home/stack/central-export.yaml \ -e /home/stack/dcn1/config_heat.yaml \ -e ~/containers-prepare-parameter.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/ssl/tls-everywhere-endpoints-dns.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/ssl/enable-internal-tls.yaml \ -e /home/stack/dcn1/cloud-names.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/services/haproxy-public-tls-certmonger.yaml \ -e /home/stack/dcn1/ipaservices-baremetal-ansible.yaml \ --log-file dcn1_overcloud_deployment_22.log More info about how to deploy such env: https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.1/html/distributed_compute_node_and_storage_deployment/assembly_deploying-storage-at-the-edge How reproducible: Always Steps to Reproduce: 1. Deploy DCN topology with Distributed Multibackend storage on the DCN site and TLS-Everywhere deployed by tripleo-ipa mode. Actual results: DCN site fails to deploy Expected results: Successful deployment Additional info: I am submitting under ansible-tripleo-ipa only as a placeholder for a triage, no idea which component to choose I will provide more info about logs and env on comments. --- Additional comment from Marian Krcmarik on 2020-08-14 22:17:24 UTC --- The deploy command lines can be found in following tar from undercloud: https://rhos-ci-staging-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/edge/job/DFG-edge-deployment-16.1-rhel-virthost-ipv4-3cont-3hci-2leafs-x-4hci-ovs-dmb-storage/48/artifact/site-undercloud-0.tar.gz The full console log from Jenkins which includes failure and all other log output from deploying such topology (central, dcn1 and dcn2 sites), can be found at: https://rhos-ci-staging-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/edge/job/DFG-edge-deployment-16.1-rhel-virthost-ipv4-3cont-3hci-2leafs-x-4hci-ovs-dmb-storage/48/artifact/.sh/edge-oc-deploy-spine-leaf.log --- Additional comment from Ade Lee on 2020-08-18 18:48:24 UTC --- The workaround you found points to the fact that the relevant haproxy service was not added to IPA for the host. This is supposed to be done when tripleo_ipa is invoked, based on the service_metadata for the role. Looking at the group_vars for the node, I see the following: group_vars/ComputeHCIScaleOut1: service_metadata_settings: compact_service_etcd: - internalapi compact_service_haproxy: - ctlplane - storage - storagemgmt - internalapi compact_service_libvirt: - internalapi compact_service_libvirt-vnc: - internalapi compact_service_qemu: - internalapi managed_service_haproxyctlplane: haproxy/overcloud.ctlplane.redhat.local managed_service_haproxyinternal_api: haproxy/overcloud.internalapi.redhat.local managed_service_haproxystorage: haproxy/overcloud.storage.redhat.local managed_service_haproxystorage_mgmt: haproxy/overcloud.storagemgmt.redhat.local Compare this to the one for the controller at the central site: central/Controller0: service_metadata_settings: compact_service_HTTP: - ctlplane - external - storage - storagemgmt - internalapi compact_service_haproxy: - ctlplane - storage - storagemgmt - internalapi compact_service_libvirt-vnc: - internalapi compact_service_mysql: - internalapi compact_service_neutron: - internalapi compact_service_novnc-proxy: - internalapi compact_service_rabbitmq: - internalapi managed_service_haproxyctlplane: haproxy/overcloud.ctlplane.redhat.local managed_service_haproxyexternal: haproxy/overcloud.redhat.local managed_service_haproxyinternal_api: haproxy/overcloud.internalapi.redhat.local managed_service_haproxystorage: haproxy/overcloud.storage.redhat.local managed_service_haproxystorage_mgmt: haproxy/overcloud.storagemgmt.redhat.local managed_service_mysqlinternal_api: mysql/overcloud.internalapi.redhat.local The important part that is missing in the service metadata for ComputeHCIScaleOut1 is: managed_service_haproxyexternal: haproxy/overcloud.redhat.local The addition of that metadata would result in the service being added. We'd need to look to see why that is not being added. --- Additional comment from Ade Lee on 2020-08-18 18:55:47 UTC --- That metadata seems to be defined in ./deployment/haproxy/haproxy-public-tls-certmonger.yaml, which is referenced in /usr/share/openstack-tripleo-heat-templates/environments/services/haproxy-public-tls-certmonger.yaml, which is in the deploy script for DCN. So not sure why its not being included in the metadata. Maybe coz external network not defined there? bandini --- any thoughts? --- Additional comment from Michele Baldessari on 2020-08-19 07:01:58 UTC --- If I recall correctly while I was playing with IPv6+TLS-E, I had noticed that by default freeipa accepts certificate requests only from hosts that have an IP address within any subnets where FreeIPA has an IP configured and will actively deny other requests (also DNS requests). I wonder if that could be related as well. Marian do you have an env with this issue somewhere Ade and I can poke at? --- Additional comment from Marian Krcmarik on 2020-08-24 13:44:12 UTC --- (In reply to Michele Baldessari from comment #4) > If I recall correctly while I was playing with IPv6+TLS-E, I had noticed > that by default freeipa accepts certificate requests only from hosts that > have an IP address within any subnets where FreeIPA has an IP configured and > will actively deny other requests (also DNS requests). I wonder if that > could be related as well. > > Marian do you have an env with this issue somewhere Ade and I can poke at? I do have a setup, feel free to ping me once you have time --- Additional comment from Marian Krcmarik on 2020-08-25 00:57:21 UTC --- (In reply to Ade Lee from comment #3) > That metadata seems to be defined in > ./deployment/haproxy/haproxy-public-tls-certmonger.yaml, > which is referenced in > /usr/share/openstack-tripleo-heat-templates/environments/services/haproxy- > public-tls-certmonger.yaml, which is > in the deploy script for DCN. > > So not sure why its not being included in the metadata. > Maybe coz external network not defined there? > > bandini --- any thoughts? If I specify external network to be used for ComputeHCIScaleOut1 role then the DCN stack gets deployed properly, It seems that service metadata for external haproxy service are (as you said) created here: https://opendev.org/openstack/tripleo-heat-templates/src/branch/master/deployment/haproxy/haproxy-public-tls-certmonger.yaml#L81-L84 and PublicNetwork is: https://opendev.org/openstack/tripleo-heat-templates/src/commit/d58efb58e0c39b2ca1585d87fe6d542484b33ad0/network/service_net_map.j2.yaml#L80 So only created if external network exists. The question is now if external network should be added to the role for https://opendev.org/openstack/tripleo-heat-templates/src/branch/master/roles/DistributedComputeHCI.yaml role or external haproxy service should not be created. I have no idea, maybe Alan could know? --- Additional comment from Alan Bishop on 2020-08-25 21:05:00 UTC --- At DCN (edge) sites, haproxy is only used on the internal_api network by the DistributedComputeScaleOut and DistributedComputeHCIScaleOut roles. That's so internal glance_api requests can be forwarded to the (internal) endpoints on the DistributedCompute (or DistributedComputeHCI) nodes. I think the issue is the metadata_settings [1] specify "service: haproxy," but at the DCN site the service is named "haproxy_edge" [2]. The service must be named differently at the DCN site to avoid mixing it up with the "haproxy" service running in the control plane. [1] https://opendev.org/openstack/tripleo-heat-templates/src/branch/master/deployment/haproxy/haproxy-public-tls-certmonger.yaml#L81-L84 [2] https://opendev.org/openstack/tripleo-heat-templates/src/branch/master/deployment/haproxy/haproxy-edge-container-puppet.yaml#L80 But I'm not sure the answer is figuring out a way to create metadata_settings for the haproxy_edge service. Given what I stated above, I'm not sure why DCN sites need anything related to public TLS. I'd be curious to know of things work if you dropped these two env files from the DCN site's deployment command: -e /usr/share/openstack-tripleo-heat-templates/environments/ssl/tls-endpoints-public-ip.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/services/haproxy-public-tls-certmonger.yaml \ But, now I fear we'll end up with a similar problem with the internal TLS stuff at [3] [3] https://opendev.org/openstack/tripleo-heat-templates/src/branch/master/deployment/haproxy/haproxy-internal-tls-certmonger.j2.yaml#L100-L105 I don't know how this stuff works, so I'll let you folks digest this info and see where it leads next. If something is truly necessary for haproxy then I think the key is understanding the service is actually named haproxy_edge at DCN sites. --- Additional comment from Marian Krcmarik on 2020-08-27 17:33:04 UTC --- > But I'm not sure the answer is figuring out a way to create > metadata_settings for the haproxy_edge service. Given what I stated above, > I'm not sure why DCN sites need anything related to public TLS. I'd be > curious to know of things work if you dropped these two env files from the > DCN site's deployment command: > > -e > /usr/share/openstack-tripleo-heat-templates/environments/ssl/tls-endpoints- > public-ip.yaml \ > -e > /usr/share/openstack-tripleo-heat-templates/environments/services/haproxy- > public-tls-certmonger.yaml \ I was able to get it successfully deployed once I removed these two templates from DCN deploy command line, which is a little suprising to me. Anyway I am hitting another problem (not sure if related to the way how It is depoyed deployment, especially things discussed here) - It fails to create any instance on DCN site with following error: /var/log/containers/nova/nova-conductor.log:2020-08-27 02:30:01.722 21 WARNING nova.scheduler.utils [req-95f02425-49a0-4a92-8037-d9f4acf27b5f d0ed4b0b6cab45d98e76e4b3b061040d c4d8a45d49904b1c8c0f4115c3812e13 - default default] [instance: 00628f62-6048-4763-b35f-247bcea57804] Setting instance to ERROR state.: nova.exception.MaxRetriesExceeded: Exceeded maximum number of retries. Exceeded max scheduling attempts 3 for instance 00628f62-6048-4763-b35f-247bcea57804. Last exception: SSL exception connecting to https://172.25.3.55:9292/v2/images/cd91190f-92cf-40b5-bf78-30a30cd9ee71: HTTPSConnectionPool(host='172.25.3.55', port=9292): Max retries exceeded with url: /v2/images/cd91190f-92cf-40b5-bf78-30a30cd9ee71 (Caused by SSLError(CertificateError("hostname '172.25.3.55' doesn't match 'dcn2-computehci2-1.internalapi.redhat.local'",),)) dcn2-computehci2-1.internalapi.redhat.local resolves as 172.25.3.55 and vice versa, but the used ssl cert has CN as dcn2-computehci2-1.internalapi.redhat.local but connection is made to the IP address 172.25.3.55? Should It try to connect to https://dcn2-computehci2-1.internalapi.redhat.local:9292/v2/images/cd91190f-92cf-40b5-bf78-30a30cd9ee71? --- Additional comment from Alan Bishop on 2020-08-27 17:46:40 UTC --- AFAIK (Ade should confirm) it should be using FQDN and not the IP address. But what's "it" in this instance? Is it a scale-out node, which is accessing glance via haproxy?
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 16.1 bug fix and enhancement advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2020:4284