1873329 – Edge compute uses IP address instead of fqdn for glance api endpoint at edge site so the TLS cert isn't valid.

Bug 1873329 - Edge compute uses IP address instead of fqdn for glance api endpoint at edge site so the TLS cert isn't valid.

Summary: Edge compute uses IP address instead of fqdn for glance api endpoint at edge ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tripleo-heat-templates
Sub Component:
Version:	16.1 (Train)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	urgent
Target Milestone:	z2
Target Release:	16.1 (Train on RHEL 8.2)
Assignee:	Alan Bishop
QA Contact:	David Rosenfeld
Docs Contact:
URL:
Whiteboard:
Depends On:	1868986
Blocks:	1858851
TreeView+	depends on / blocked

Reported:	2020-08-27 21:32 UTC by Marian Krcmarik
Modified:	2020-10-28 15:39 UTC (History)
CC List:	7 users (show)
Fixed In Version:	openstack-tripleo-heat-templates-11.3.2-1.20200904013443.7e37115.el8ost
Doc Type:	Bug Fix
Doc Text:	This update fixes a bug that prevented the distributed compute nodes (DCN) compute servcie from accessing the glance service. + Previously, distributed compute nodes were configured with a glance endpoint URI that specified an IP address, even when deployed with internal transport layer security (TLS). Because TLS requires the endpoint URI to specify a fully qualified domain name (FQDN), the compute service could not access the glance service. + Now, when deployed with internal TLS, DCN services are configured with glance endpoint URI that specifies a FQDN, and the DCN compute service can access the glance service.
Clone Of:	1868986
Environment:
Last Closed:	2020-10-28 15:39:26 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1893453	None	None	None	2020-08-28 15:58:12 UTC
OpenStack gerrit	748736	None	MERGED	DCN: use FQDN in glance endpoint with internal TLS	2021-02-05 10:59:52 UTC
Red Hat Product Errata	RHEA-2020:4284	None	None	None	2020-10-28 15:39:55 UTC

Description Marian Krcmarik 2020-08-27 21:32:00 UTC

Based on the debugging of bz #1868986 It turned out that once instance is supposed to be spawned on DCN site with local glance, Edge compute uses IP address as API Ednpoint and not FQDN which is a problem in case TLS-E is used since used cert uses FQDN as CN and cert wont be valided.

+++ This bug was initially created as a clone of Bug #1868986 +++

Description of problem:
Deployment of DCN with Distributed multibackend storage and TLS-E (tripleo-ipa) fails to deploy DCN site on:
<LOG>
fatal: [dcn1-computehciscaleout1-0]: FAILED! => {"ansible_job_id": "552398882008.22056", "attempts": 11, "changed": true, "cmd": "set -o pipefail; puppet apply --debug --verbose --modulepath=/etc/puppet/modules:
/opt/stack/puppet-modules:/usr/share/openstack-puppet/modules --detailed-exitcodes --summarize --color=false   /var/lib/tripleo-config/puppet_step_config.pp 2>&1 | logger -s -t puppet-user", "delta": "0:00:35.08
2779", "end": "2020-08-14 08:25:17.356108", "failed_when_result": true, "finished": 1, "msg": "non-zero return code", "rc": 6, "start": "2020-08-14 08:24:42.273329", "stderr": "<13>Aug 14 08:24:42 puppet-user: D
ebug: Runtime environment: puppet_version=5.5.10, ruby_version=2.5.5, run_mode=user, default_encoding=UTF-8

...skipped log ...

<13>Aug 13 21:14:32 puppet-user: Debug: Issuing getcert command with args: [\"request\", \"-I\", \"haproxy-external-cert\", \"-f\", \"/etc/pki/tls/certs/haproxy/overcloud-haproxy-external.crt\", \"-c\", \"IPA\",
\
 \"-N\", \"CN=overcloud.redhat.local\", \"-K\", \"haproxy/overcloud.redhat.local\", \"-D\", \"overcloud.redhat.local\", \"-U\", \"id-kp-clientAuth\", \"-U\", \"id-kp-serverAuth\", \"-C\", \"/usr/bin/certmonger-h
aproxy-refresh.sh reload external\", \"-w\", \"-k\", \"/etc/pki/tls/private/haproxy/overcloud-haproxy-external.key\"]
<13>Aug 13 21:14:32 puppet-user: Debug: Executing: '/usr/bin/getcert request -I haproxy-external-cert -f /etc/pki/tls/certs/haproxy/overcloud-haproxy-external.crt -c IPA -N CN=overcloud.redhat.local -K haproxy/o
vercloud.redhat.local -D overcloud.redhat.local -U id-kp-clientAuth -U id-kp-serverAuth -C /usr/bin/certmonger-haproxy-refresh.sh reload external -w -k /etc/pki/tls/private/haproxy/overcloud-haproxy-external.key
<13>Aug 13 21:14:33 puppet-user: Warning: Could not get certificate: Execution of '/usr/bin/getcert request -I haproxy-external-cert -f /etc/pki/tls/certs/haproxy/overcloud-haproxy-external.crt -c IPA -N CN=over
cloud.redhat.local -K haproxy/overcloud.redhat.local -D overcloud.redhat.local -U id-kp-clientAuth -U id-kp-serverAuth -C /usr/bin/certmonger-haproxy-refresh.sh reload external -w -k /etc/pki/tls/private/haproxy
/overcloud-haproxy-external.key' returned 2: New signing request \"haproxy-external-cert\" added.
<13>Aug 13 21:14:33 puppet-user: Debug: Executing: '/usr/bin/getcert list -i haproxy-external-cert'
<13>Aug 13 21:14:34 puppet-user: Error: /Stage[main]/Tripleo::Profile::Base::Certmonger_user/Tripleo::Certmonger::Haproxy[haproxy-external]/Certmonger_certificate[haproxy-external-cert]: Could not evaluate: Could not get certificate: Server at https://site-freeipa-0.redhat.local/ipa/xml denied our request, giving up: 2100 (RPC failed at server.  Insufficient access: Insufficient 'write' privilege to the 'userCertificat
e' attribute of entry 'krbprincipalname=haproxy/overcloud.redhat.local,cn=services,cn=accounts,dc=redhat,dc=local'.).
</LOG>

The failing command on dcn1-computehciscaleout1-0 is:
/usr/bin/getcert request -I haproxy-external-cert -f /etc/pki/tls/certs/haproxy/overcloud-haproxy-external.crt -c IPA -N CN=over
cloud.redhat.local -K haproxy/overcloud.redhat.local -D overcloud.redhat.local -U id-kp-clientAuth -U id-kp-serverAuth -C /usr/bin/certmonger-haproxy-refresh.sh reload external -w -k /etc/pki/tls/private/haproxy
/overcloud-haproxy-external.key

with error:
Could not get certificate: Server at https://site-freeipa-0.redhat.local/ipa/xml denied our request, giving up: 2100 (RPC failed at server.  Insufficient access: Insufficient 'write' privilege to the 'userCertificat
e' attribute of entry 'krbprincipalname=haproxy/overcloud.redhat.local,cn=services,cn=accounts,dc=redhat,dc=local'.).

I tried to google for a solution and I can get request cert If I do following commands:
[heat-admin@dcn1-computehciscaleout1-0 ~]$ ipa service-add-host --hosts=dcn1-computehciscaleout1-0.redhat.local haproxy/overcloud.redhat.local
[heat-admin@dcn1-computehciscaleout1-0 ~]$ sudo ipa-getcert resubmit -i haproxy-external-cert
and then successful issued cert:
[heat-admin@dcn1-computehciscaleout1-0 ~]$ sudo /usr/bin/getcert list -i haproxy-external-cert
Number of certificates and requests being tracked: 11.
Request ID 'haproxy-external-cert':
	status: MONITORING
	stuck: no
	key pair storage: type=FILE,location='/etc/pki/tls/private/haproxy/overcloud-haproxy-external.key'
	certificate: type=FILE,location='/etc/pki/tls/certs/haproxy/overcloud-haproxy-external.crt'
	CA: IPA
	issuer: CN=Certificate Authority,O=REDHAT.LOCAL
	subject: CN=overcloud.redhat.local,O=REDHAT.LOCAL
	expires: 2022-08-14 22:47:09 UTC
	dns: overcloud.redhat.local
	principal name: haproxy/overcloud.redhat.local
	key usage: digitalSignature,nonRepudiation,keyEncipherment,dataEncipherment
	eku: id-kp-serverAuth,id-kp-clientAuth
	pre-save command: 
	post-save command: /usr/bin/certmonger-haproxy-refresh.sh reload external
	track: yes
	auto-renew: yes

I do not have clear idea about how this type of topology is deployed and what the scaleout node is for bit this is the deploy command line:
openstack overcloud deploy \
--timeout 240 \
--templates /usr/share/openstack-tripleo-heat-templates \
--stack dcn1 \
--libvirt-type kvm \
--ntp-server clock1.rdu2.redhat.com \
-e /usr/share/openstack-tripleo-heat-templates/environments/dcn-hci.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/podman.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/disable-telemetry.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/net-multiple-nics.yaml \
-e /home/stack/dcn1/internal.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \
-n /home/stack/dcn1/network/network_data.yaml \
-r /home/stack/dcn1/roles/roles_data.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/services/neutron-ovs.yaml \
-e /home/stack/dcn1/network/network-environment.yaml \
-e /home/stack/dcn1/enable-tls.yaml \
-e /home/stack/dcn1/inject-trust-anchor.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/ssl/tls-endpoints-public-ip.yaml \
-e /home/stack/dcn1/hostnames.yml \
-e /usr/share/openstack-tripleo-heat-templates/environments/ceph-ansible/ceph-ansible.yaml \
-e /home/stack/dcn1/dcn1_ceph_keys.yaml \
-e /home/stack/dcn1/nodes_data.yaml \
-e /home/stack/dcn1/debug.yaml \
-e /home/stack/dcn1/docker-images.yaml \
-e /home/stack/dcn1/glance.yaml \
-e /home/stack/central_ceph_external.yaml \
-e /home/stack/central-export.yaml \
-e /home/stack/dcn1/config_heat.yaml \
-e ~/containers-prepare-parameter.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/ssl/tls-everywhere-endpoints-dns.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/ssl/enable-internal-tls.yaml \
-e /home/stack/dcn1/cloud-names.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/services/haproxy-public-tls-certmonger.yaml \
-e /home/stack/dcn1/ipaservices-baremetal-ansible.yaml \
--log-file dcn1_overcloud_deployment_22.log


More info about how to deploy such env:
https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.1/html/distributed_compute_node_and_storage_deployment/assembly_deploying-storage-at-the-edge

How reproducible:
Always

Steps to Reproduce:
1. Deploy DCN topology with Distributed Multibackend storage on the DCN site and TLS-Everywhere deployed by tripleo-ipa mode.

Actual results:
DCN site fails to deploy

Expected results:
Successful deployment

Additional info:
I am submitting under ansible-tripleo-ipa only as a placeholder for a triage, no idea which component to choose
I will provide more info about logs and env on comments.

--- Additional comment from Marian Krcmarik on 2020-08-14 22:17:24 UTC ---

The deploy command lines can be found in following tar from undercloud:
https://rhos-ci-staging-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/edge/job/DFG-edge-deployment-16.1-rhel-virthost-ipv4-3cont-3hci-2leafs-x-4hci-ovs-dmb-storage/48/artifact/site-undercloud-0.tar.gz

The full console log from Jenkins which includes failure and all other log output from deploying such topology (central, dcn1 and dcn2 sites), can be found at:
https://rhos-ci-staging-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/edge/job/DFG-edge-deployment-16.1-rhel-virthost-ipv4-3cont-3hci-2leafs-x-4hci-ovs-dmb-storage/48/artifact/.sh/edge-oc-deploy-spine-leaf.log

--- Additional comment from Ade Lee on 2020-08-18 18:48:24 UTC ---

The workaround you found points to the fact that the relevant haproxy service was not added to IPA for the host.
This is supposed to be done when tripleo_ipa is invoked, based on the service_metadata for the role.

Looking at the group_vars for the node, I see the following:

group_vars/ComputeHCIScaleOut1:

service_metadata_settings:
  compact_service_etcd:
  - internalapi
  compact_service_haproxy:
  - ctlplane
  - storage
  - storagemgmt
  - internalapi
  compact_service_libvirt:
  - internalapi
  compact_service_libvirt-vnc:
  - internalapi
  compact_service_qemu:
  - internalapi
  managed_service_haproxyctlplane: haproxy/overcloud.ctlplane.redhat.local
  managed_service_haproxyinternal_api: haproxy/overcloud.internalapi.redhat.local
  managed_service_haproxystorage: haproxy/overcloud.storage.redhat.local
  managed_service_haproxystorage_mgmt: haproxy/overcloud.storagemgmt.redhat.local

Compare this to the one for the controller at the central site:

central/Controller0:
service_metadata_settings:
  compact_service_HTTP:
  - ctlplane
  - external
  - storage
  - storagemgmt
  - internalapi
  compact_service_haproxy:
  - ctlplane
  - storage
  - storagemgmt
  - internalapi
  compact_service_libvirt-vnc:
  - internalapi
  compact_service_mysql:
  - internalapi
  compact_service_neutron:
  - internalapi
  compact_service_novnc-proxy:
  - internalapi
  compact_service_rabbitmq:
  - internalapi
  managed_service_haproxyctlplane: haproxy/overcloud.ctlplane.redhat.local
  managed_service_haproxyexternal: haproxy/overcloud.redhat.local
  managed_service_haproxyinternal_api: haproxy/overcloud.internalapi.redhat.local
  managed_service_haproxystorage: haproxy/overcloud.storage.redhat.local
  managed_service_haproxystorage_mgmt: haproxy/overcloud.storagemgmt.redhat.local
  managed_service_mysqlinternal_api: mysql/overcloud.internalapi.redhat.local

The important part that is missing in the service metadata for ComputeHCIScaleOut1 is:
managed_service_haproxyexternal: haproxy/overcloud.redhat.local

The addition of that metadata would result in the service being added.

We'd need to look to see why that is not being added.

--- Additional comment from Ade Lee on 2020-08-18 18:55:47 UTC ---

That metadata seems to be defined in ./deployment/haproxy/haproxy-public-tls-certmonger.yaml,
which is referenced in 
/usr/share/openstack-tripleo-heat-templates/environments/services/haproxy-public-tls-certmonger.yaml, which is
in the deploy script for DCN.

So not sure why its not being included in the metadata.
Maybe coz external network not defined there?

bandini --- any thoughts?

--- Additional comment from Michele Baldessari on 2020-08-19 07:01:58 UTC ---

If I recall correctly while I was playing with IPv6+TLS-E, I had noticed that by default freeipa accepts certificate requests only from hosts that have an IP address within any subnets where FreeIPA has an IP configured and will actively deny other requests (also DNS requests). I wonder if that could be related as well.

Marian do you have an env with this issue somewhere Ade and I can poke at?

--- Additional comment from Marian Krcmarik on 2020-08-24 13:44:12 UTC ---

(In reply to Michele Baldessari from comment #4)
> If I recall correctly while I was playing with IPv6+TLS-E, I had noticed
> that by default freeipa accepts certificate requests only from hosts that
> have an IP address within any subnets where FreeIPA has an IP configured and
> will actively deny other requests (also DNS requests). I wonder if that
> could be related as well.
> 
> Marian do you have an env with this issue somewhere Ade and I can poke at?

I do have a setup, feel free to ping me once you have time

--- Additional comment from Marian Krcmarik on 2020-08-25 00:57:21 UTC ---

(In reply to Ade Lee from comment #3)
> That metadata seems to be defined in
> ./deployment/haproxy/haproxy-public-tls-certmonger.yaml,
> which is referenced in 
> /usr/share/openstack-tripleo-heat-templates/environments/services/haproxy-
> public-tls-certmonger.yaml, which is
> in the deploy script for DCN.
> 
> So not sure why its not being included in the metadata.
> Maybe coz external network not defined there?
> 
> bandini --- any thoughts?

If I specify external network to be used for ComputeHCIScaleOut1 role then the DCN stack gets deployed properly, It seems that service metadata for external haproxy service are (as you said) created here:
https://opendev.org/openstack/tripleo-heat-templates/src/branch/master/deployment/haproxy/haproxy-public-tls-certmonger.yaml#L81-L84
and PublicNetwork is:
https://opendev.org/openstack/tripleo-heat-templates/src/commit/d58efb58e0c39b2ca1585d87fe6d542484b33ad0/network/service_net_map.j2.yaml#L80

So only created if external network exists.

The question is now if external network should be added to the role for https://opendev.org/openstack/tripleo-heat-templates/src/branch/master/roles/DistributedComputeHCI.yaml role or external haproxy service should not be created. I have no idea, maybe Alan could know?

--- Additional comment from Alan Bishop on 2020-08-25 21:05:00 UTC ---

At DCN (edge) sites, haproxy is only used on the internal_api network by the DistributedComputeScaleOut and DistributedComputeHCIScaleOut roles. That's so internal glance_api requests can be forwarded to the (internal) endpoints on the DistributedCompute (or DistributedComputeHCI) nodes.

I think the issue is the metadata_settings [1] specify "service: haproxy," but at the DCN site the service is named "haproxy_edge" [2]. The service must be named differently at the DCN site to avoid mixing it up with the "haproxy" service running in the control plane.

[1] https://opendev.org/openstack/tripleo-heat-templates/src/branch/master/deployment/haproxy/haproxy-public-tls-certmonger.yaml#L81-L84
[2] https://opendev.org/openstack/tripleo-heat-templates/src/branch/master/deployment/haproxy/haproxy-edge-container-puppet.yaml#L80

But I'm not sure the answer is figuring out a way to create metadata_settings for the haproxy_edge service. Given what I stated above, I'm not sure why DCN sites need anything related to public TLS. I'd be curious to know of things work if you dropped these two env files from the DCN site's deployment command:

  -e /usr/share/openstack-tripleo-heat-templates/environments/ssl/tls-endpoints-public-ip.yaml \
  -e /usr/share/openstack-tripleo-heat-templates/environments/services/haproxy-public-tls-certmonger.yaml \

But, now I fear we'll end up with a similar problem with the internal TLS stuff at [3]

[3] https://opendev.org/openstack/tripleo-heat-templates/src/branch/master/deployment/haproxy/haproxy-internal-tls-certmonger.j2.yaml#L100-L105

I don't know how this stuff works, so I'll let you folks digest this info and see where it leads next. If something is truly necessary for haproxy then I think the key is understanding the service is actually named haproxy_edge at DCN sites.

--- Additional comment from Marian Krcmarik on 2020-08-27 17:33:04 UTC ---

> But I'm not sure the answer is figuring out a way to create
> metadata_settings for the haproxy_edge service. Given what I stated above,
> I'm not sure why DCN sites need anything related to public TLS. I'd be
> curious to know of things work if you dropped these two env files from the
> DCN site's deployment command:
> 
>   -e
> /usr/share/openstack-tripleo-heat-templates/environments/ssl/tls-endpoints-
> public-ip.yaml \
>   -e
> /usr/share/openstack-tripleo-heat-templates/environments/services/haproxy-
> public-tls-certmonger.yaml \

I was able to get it successfully deployed once I removed these two templates from DCN deploy command line, which is a little suprising to me. Anyway I am hitting another problem (not sure if related to the way how It is depoyed deployment, especially things discussed here) - It fails to create any instance on DCN site with following error:

/var/log/containers/nova/nova-conductor.log:2020-08-27 02:30:01.722 21 WARNING nova.scheduler.utils [req-95f02425-49a0-4a92-8037-d9f4acf27b5f d0ed4b0b6cab45d98e76e4b3b061040d c4d8a45d49904b1c8c0f4115c3812e13 - default default] [instance: 00628f62-6048-4763-b35f-247bcea57804] Setting instance to ERROR state.: nova.exception.MaxRetriesExceeded: Exceeded maximum number of retries. Exceeded max scheduling attempts 3 for instance 00628f62-6048-4763-b35f-247bcea57804. Last exception: SSL exception connecting to https://172.25.3.55:9292/v2/images/cd91190f-92cf-40b5-bf78-30a30cd9ee71: HTTPSConnectionPool(host='172.25.3.55', port=9292): Max retries exceeded with url: /v2/images/cd91190f-92cf-40b5-bf78-30a30cd9ee71 (Caused by SSLError(CertificateError("hostname '172.25.3.55' doesn't match 'dcn2-computehci2-1.internalapi.redhat.local'",),))

dcn2-computehci2-1.internalapi.redhat.local resolves as 172.25.3.55 and vice versa, but the used ssl cert has CN as dcn2-computehci2-1.internalapi.redhat.local but connection is made to the IP address 172.25.3.55? Should It try to connect to https://dcn2-computehci2-1.internalapi.redhat.local:9292/v2/images/cd91190f-92cf-40b5-bf78-30a30cd9ee71?

--- Additional comment from Alan Bishop on 2020-08-27 17:46:40 UTC ---

AFAIK (Ade should confirm) it should be using FQDN and not the IP address.

But what's "it" in this instance? Is it a scale-out node, which is accessing glance via haproxy?

Comment 7 errata-xmlrpc 2020-10-28 15:39:26 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:4284

Note You need to log in before you can comment on or make changes to this bug.