Bug 1868986 - Docs: Do not use templates for Public TLS endpoints in deploy command line for EDGE site stacks
Summary: Docs: Do not use templates for Public TLS endpoints in deploy command line fo...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: documentation
Version: 16.1 (Train)
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Roger Heslop
QA Contact: RHOS Documentation Team
URL:
Whiteboard:
Depends On:
Blocks: 1873329
TreeView+ depends on / blocked
 
Reported: 2020-08-14 22:14 UTC by Marian Krcmarik
Modified: 2020-10-30 14:02 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1873329 (view as bug list)
Environment:
Last Closed: 2020-10-30 14:02:08 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Marian Krcmarik 2020-08-14 22:14:50 UTC
Updated description:
We should mention in docs that if Multistack EDGE DCN deployment is deployed with TLS-Everywhere, the DCN stacks should not be deployed with following templates in the deploy command line:
  -e /usr/share/openstack-tripleo-heat-templates/environments/ssl/tls-endpoints-public-ip.yaml \
  -e /usr/share/openstack-tripleo-heat-templates/environments/services/haproxy-public-tls-certmonger.yaml \
The Public TLS endpoints are not needed to be created and actually It causes the deployment to fail.

Description of problem:
Deployment of DCN with Distributed multibackend storage and TLS-E (tripleo-ipa) fails to deploy DCN site on:
<LOG>
fatal: [dcn1-computehciscaleout1-0]: FAILED! => {"ansible_job_id": "552398882008.22056", "attempts": 11, "changed": true, "cmd": "set -o pipefail; puppet apply --debug --verbose --modulepath=/etc/puppet/modules:
/opt/stack/puppet-modules:/usr/share/openstack-puppet/modules --detailed-exitcodes --summarize --color=false   /var/lib/tripleo-config/puppet_step_config.pp 2>&1 | logger -s -t puppet-user", "delta": "0:00:35.08
2779", "end": "2020-08-14 08:25:17.356108", "failed_when_result": true, "finished": 1, "msg": "non-zero return code", "rc": 6, "start": "2020-08-14 08:24:42.273329", "stderr": "<13>Aug 14 08:24:42 puppet-user: D
ebug: Runtime environment: puppet_version=5.5.10, ruby_version=2.5.5, run_mode=user, default_encoding=UTF-8

...skipped log ...

<13>Aug 13 21:14:32 puppet-user: Debug: Issuing getcert command with args: [\"request\", \"-I\", \"haproxy-external-cert\", \"-f\", \"/etc/pki/tls/certs/haproxy/overcloud-haproxy-external.crt\", \"-c\", \"IPA\",
\
 \"-N\", \"CN=overcloud.redhat.local\", \"-K\", \"haproxy/overcloud.redhat.local\", \"-D\", \"overcloud.redhat.local\", \"-U\", \"id-kp-clientAuth\", \"-U\", \"id-kp-serverAuth\", \"-C\", \"/usr/bin/certmonger-h
aproxy-refresh.sh reload external\", \"-w\", \"-k\", \"/etc/pki/tls/private/haproxy/overcloud-haproxy-external.key\"]
<13>Aug 13 21:14:32 puppet-user: Debug: Executing: '/usr/bin/getcert request -I haproxy-external-cert -f /etc/pki/tls/certs/haproxy/overcloud-haproxy-external.crt -c IPA -N CN=overcloud.redhat.local -K haproxy/o
vercloud.redhat.local -D overcloud.redhat.local -U id-kp-clientAuth -U id-kp-serverAuth -C /usr/bin/certmonger-haproxy-refresh.sh reload external -w -k /etc/pki/tls/private/haproxy/overcloud-haproxy-external.key
<13>Aug 13 21:14:33 puppet-user: Warning: Could not get certificate: Execution of '/usr/bin/getcert request -I haproxy-external-cert -f /etc/pki/tls/certs/haproxy/overcloud-haproxy-external.crt -c IPA -N CN=over
cloud.redhat.local -K haproxy/overcloud.redhat.local -D overcloud.redhat.local -U id-kp-clientAuth -U id-kp-serverAuth -C /usr/bin/certmonger-haproxy-refresh.sh reload external -w -k /etc/pki/tls/private/haproxy
/overcloud-haproxy-external.key' returned 2: New signing request \"haproxy-external-cert\" added.
<13>Aug 13 21:14:33 puppet-user: Debug: Executing: '/usr/bin/getcert list -i haproxy-external-cert'
<13>Aug 13 21:14:34 puppet-user: Error: /Stage[main]/Tripleo::Profile::Base::Certmonger_user/Tripleo::Certmonger::Haproxy[haproxy-external]/Certmonger_certificate[haproxy-external-cert]: Could not evaluate: Could not get certificate: Server at https://site-freeipa-0.redhat.local/ipa/xml denied our request, giving up: 2100 (RPC failed at server.  Insufficient access: Insufficient 'write' privilege to the 'userCertificat
e' attribute of entry 'krbprincipalname=haproxy/overcloud.redhat.local,cn=services,cn=accounts,dc=redhat,dc=local'.).
</LOG>

The failing command on dcn1-computehciscaleout1-0 is:
/usr/bin/getcert request -I haproxy-external-cert -f /etc/pki/tls/certs/haproxy/overcloud-haproxy-external.crt -c IPA -N CN=over
cloud.redhat.local -K haproxy/overcloud.redhat.local -D overcloud.redhat.local -U id-kp-clientAuth -U id-kp-serverAuth -C /usr/bin/certmonger-haproxy-refresh.sh reload external -w -k /etc/pki/tls/private/haproxy
/overcloud-haproxy-external.key

with error:
Could not get certificate: Server at https://site-freeipa-0.redhat.local/ipa/xml denied our request, giving up: 2100 (RPC failed at server.  Insufficient access: Insufficient 'write' privilege to the 'userCertificat
e' attribute of entry 'krbprincipalname=haproxy/overcloud.redhat.local,cn=services,cn=accounts,dc=redhat,dc=local'.).

I tried to google for a solution and I can get request cert If I do following commands:
[heat-admin@dcn1-computehciscaleout1-0 ~]$ ipa service-add-host --hosts=dcn1-computehciscaleout1-0.redhat.local haproxy/overcloud.redhat.local
[heat-admin@dcn1-computehciscaleout1-0 ~]$ sudo ipa-getcert resubmit -i haproxy-external-cert
and then successful issued cert:
[heat-admin@dcn1-computehciscaleout1-0 ~]$ sudo /usr/bin/getcert list -i haproxy-external-cert
Number of certificates and requests being tracked: 11.
Request ID 'haproxy-external-cert':
	status: MONITORING
	stuck: no
	key pair storage: type=FILE,location='/etc/pki/tls/private/haproxy/overcloud-haproxy-external.key'
	certificate: type=FILE,location='/etc/pki/tls/certs/haproxy/overcloud-haproxy-external.crt'
	CA: IPA
	issuer: CN=Certificate Authority,O=REDHAT.LOCAL
	subject: CN=overcloud.redhat.local,O=REDHAT.LOCAL
	expires: 2022-08-14 22:47:09 UTC
	dns: overcloud.redhat.local
	principal name: haproxy/overcloud.redhat.local
	key usage: digitalSignature,nonRepudiation,keyEncipherment,dataEncipherment
	eku: id-kp-serverAuth,id-kp-clientAuth
	pre-save command: 
	post-save command: /usr/bin/certmonger-haproxy-refresh.sh reload external
	track: yes
	auto-renew: yes

I do not have clear idea about how this type of topology is deployed and what the scaleout node is for bit this is the deploy command line:
openstack overcloud deploy \
--timeout 240 \
--templates /usr/share/openstack-tripleo-heat-templates \
--stack dcn1 \
--libvirt-type kvm \
--ntp-server clock1.rdu2.redhat.com \
-e /usr/share/openstack-tripleo-heat-templates/environments/dcn-hci.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/podman.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/disable-telemetry.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/net-multiple-nics.yaml \
-e /home/stack/dcn1/internal.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \
-n /home/stack/dcn1/network/network_data.yaml \
-r /home/stack/dcn1/roles/roles_data.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/services/neutron-ovs.yaml \
-e /home/stack/dcn1/network/network-environment.yaml \
-e /home/stack/dcn1/enable-tls.yaml \
-e /home/stack/dcn1/inject-trust-anchor.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/ssl/tls-endpoints-public-ip.yaml \
-e /home/stack/dcn1/hostnames.yml \
-e /usr/share/openstack-tripleo-heat-templates/environments/ceph-ansible/ceph-ansible.yaml \
-e /home/stack/dcn1/dcn1_ceph_keys.yaml \
-e /home/stack/dcn1/nodes_data.yaml \
-e /home/stack/dcn1/debug.yaml \
-e /home/stack/dcn1/docker-images.yaml \
-e /home/stack/dcn1/glance.yaml \
-e /home/stack/central_ceph_external.yaml \
-e /home/stack/central-export.yaml \
-e /home/stack/dcn1/config_heat.yaml \
-e ~/containers-prepare-parameter.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/ssl/tls-everywhere-endpoints-dns.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/ssl/enable-internal-tls.yaml \
-e /home/stack/dcn1/cloud-names.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/services/haproxy-public-tls-certmonger.yaml \
-e /home/stack/dcn1/ipaservices-baremetal-ansible.yaml \
--log-file dcn1_overcloud_deployment_22.log


More info about how to deploy such env:
https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.1/html/distributed_compute_node_and_storage_deployment/assembly_deploying-storage-at-the-edge

How reproducible:
Always

Steps to Reproduce:
1. Deploy DCN topology with Distributed Multibackend storage on the DCN site and TLS-Everywhere deployed by tripleo-ipa mode.

Actual results:
DCN site fails to deploy

Expected results:
Successful deployment

Additional info:
I am submitting under ansible-tripleo-ipa only as a placeholder for a triage, no idea which component to choose
I will provide more info about logs and env on comments.

Comment 2 Ade Lee 2020-08-18 18:48:24 UTC
The workaround you found points to the fact that the relevant haproxy service was not added to IPA for the host.
This is supposed to be done when tripleo_ipa is invoked, based on the service_metadata for the role.

Looking at the group_vars for the node, I see the following:

group_vars/ComputeHCIScaleOut1:

service_metadata_settings:
  compact_service_etcd:
  - internalapi
  compact_service_haproxy:
  - ctlplane
  - storage
  - storagemgmt
  - internalapi
  compact_service_libvirt:
  - internalapi
  compact_service_libvirt-vnc:
  - internalapi
  compact_service_qemu:
  - internalapi
  managed_service_haproxyctlplane: haproxy/overcloud.ctlplane.redhat.local
  managed_service_haproxyinternal_api: haproxy/overcloud.internalapi.redhat.local
  managed_service_haproxystorage: haproxy/overcloud.storage.redhat.local
  managed_service_haproxystorage_mgmt: haproxy/overcloud.storagemgmt.redhat.local

Compare this to the one for the controller at the central site:

central/Controller0:
service_metadata_settings:
  compact_service_HTTP:
  - ctlplane
  - external
  - storage
  - storagemgmt
  - internalapi
  compact_service_haproxy:
  - ctlplane
  - storage
  - storagemgmt
  - internalapi
  compact_service_libvirt-vnc:
  - internalapi
  compact_service_mysql:
  - internalapi
  compact_service_neutron:
  - internalapi
  compact_service_novnc-proxy:
  - internalapi
  compact_service_rabbitmq:
  - internalapi
  managed_service_haproxyctlplane: haproxy/overcloud.ctlplane.redhat.local
  managed_service_haproxyexternal: haproxy/overcloud.redhat.local
  managed_service_haproxyinternal_api: haproxy/overcloud.internalapi.redhat.local
  managed_service_haproxystorage: haproxy/overcloud.storage.redhat.local
  managed_service_haproxystorage_mgmt: haproxy/overcloud.storagemgmt.redhat.local
  managed_service_mysqlinternal_api: mysql/overcloud.internalapi.redhat.local

The important part that is missing in the service metadata for ComputeHCIScaleOut1 is:
managed_service_haproxyexternal: haproxy/overcloud.redhat.local

The addition of that metadata would result in the service being added.

We'd need to look to see why that is not being added.

Comment 3 Ade Lee 2020-08-18 18:55:47 UTC
That metadata seems to be defined in ./deployment/haproxy/haproxy-public-tls-certmonger.yaml,
which is referenced in 
/usr/share/openstack-tripleo-heat-templates/environments/services/haproxy-public-tls-certmonger.yaml, which is
in the deploy script for DCN.

So not sure why its not being included in the metadata.
Maybe coz external network not defined there?

bandini --- any thoughts?

Comment 4 Michele Baldessari 2020-08-19 07:01:58 UTC
If I recall correctly while I was playing with IPv6+TLS-E, I had noticed that by default freeipa accepts certificate requests only from hosts that have an IP address within any subnets where FreeIPA has an IP configured and will actively deny other requests (also DNS requests). I wonder if that could be related as well.

Marian do you have an env with this issue somewhere Ade and I can poke at?

Comment 5 Marian Krcmarik 2020-08-24 13:44:12 UTC
(In reply to Michele Baldessari from comment #4)
> If I recall correctly while I was playing with IPv6+TLS-E, I had noticed
> that by default freeipa accepts certificate requests only from hosts that
> have an IP address within any subnets where FreeIPA has an IP configured and
> will actively deny other requests (also DNS requests). I wonder if that
> could be related as well.
> 
> Marian do you have an env with this issue somewhere Ade and I can poke at?

I do have a setup, feel free to ping me once you have time

Comment 6 Marian Krcmarik 2020-08-25 00:57:21 UTC
(In reply to Ade Lee from comment #3)
> That metadata seems to be defined in
> ./deployment/haproxy/haproxy-public-tls-certmonger.yaml,
> which is referenced in 
> /usr/share/openstack-tripleo-heat-templates/environments/services/haproxy-
> public-tls-certmonger.yaml, which is
> in the deploy script for DCN.
> 
> So not sure why its not being included in the metadata.
> Maybe coz external network not defined there?
> 
> bandini --- any thoughts?

If I specify external network to be used for ComputeHCIScaleOut1 role then the DCN stack gets deployed properly, It seems that service metadata for external haproxy service are (as you said) created here:
https://opendev.org/openstack/tripleo-heat-templates/src/branch/master/deployment/haproxy/haproxy-public-tls-certmonger.yaml#L81-L84
and PublicNetwork is:
https://opendev.org/openstack/tripleo-heat-templates/src/commit/d58efb58e0c39b2ca1585d87fe6d542484b33ad0/network/service_net_map.j2.yaml#L80

So only created if external network exists.

The question is now if external network should be added to the role for https://opendev.org/openstack/tripleo-heat-templates/src/branch/master/roles/DistributedComputeHCI.yaml role or external haproxy service should not be created. I have no idea, maybe Alan could know?

Comment 7 Alan Bishop 2020-08-25 21:05:00 UTC
At DCN (edge) sites, haproxy is only used on the internal_api network by the DistributedComputeScaleOut and DistributedComputeHCIScaleOut roles. That's so internal glance_api requests can be forwarded to the (internal) endpoints on the DistributedCompute (or DistributedComputeHCI) nodes.

I think the issue is the metadata_settings [1] specify "service: haproxy," but at the DCN site the service is named "haproxy_edge" [2]. The service must be named differently at the DCN site to avoid mixing it up with the "haproxy" service running in the control plane.

[1] https://opendev.org/openstack/tripleo-heat-templates/src/branch/master/deployment/haproxy/haproxy-public-tls-certmonger.yaml#L81-L84
[2] https://opendev.org/openstack/tripleo-heat-templates/src/branch/master/deployment/haproxy/haproxy-edge-container-puppet.yaml#L80

But I'm not sure the answer is figuring out a way to create metadata_settings for the haproxy_edge service. Given what I stated above, I'm not sure why DCN sites need anything related to public TLS. I'd be curious to know of things work if you dropped these two env files from the DCN site's deployment command:

  -e /usr/share/openstack-tripleo-heat-templates/environments/ssl/tls-endpoints-public-ip.yaml \
  -e /usr/share/openstack-tripleo-heat-templates/environments/services/haproxy-public-tls-certmonger.yaml \

But, now I fear we'll end up with a similar problem with the internal TLS stuff at [3]

[3] https://opendev.org/openstack/tripleo-heat-templates/src/branch/master/deployment/haproxy/haproxy-internal-tls-certmonger.j2.yaml#L100-L105

I don't know how this stuff works, so I'll let you folks digest this info and see where it leads next. If something is truly necessary for haproxy then I think the key is understanding the service is actually named haproxy_edge at DCN sites.

Comment 8 Marian Krcmarik 2020-08-27 17:33:04 UTC
> But I'm not sure the answer is figuring out a way to create
> metadata_settings for the haproxy_edge service. Given what I stated above,
> I'm not sure why DCN sites need anything related to public TLS. I'd be
> curious to know of things work if you dropped these two env files from the
> DCN site's deployment command:
> 
>   -e
> /usr/share/openstack-tripleo-heat-templates/environments/ssl/tls-endpoints-
> public-ip.yaml \
>   -e
> /usr/share/openstack-tripleo-heat-templates/environments/services/haproxy-
> public-tls-certmonger.yaml \

I was able to get it successfully deployed once I removed these two templates from DCN deploy command line, which is a little suprising to me. Anyway I am hitting another problem (not sure if related to the way how It is depoyed deployment, especially things discussed here) - It fails to create any instance on DCN site with following error:

/var/log/containers/nova/nova-conductor.log:2020-08-27 02:30:01.722 21 WARNING nova.scheduler.utils [req-95f02425-49a0-4a92-8037-d9f4acf27b5f d0ed4b0b6cab45d98e76e4b3b061040d c4d8a45d49904b1c8c0f4115c3812e13 - default default] [instance: 00628f62-6048-4763-b35f-247bcea57804] Setting instance to ERROR state.: nova.exception.MaxRetriesExceeded: Exceeded maximum number of retries. Exceeded max scheduling attempts 3 for instance 00628f62-6048-4763-b35f-247bcea57804. Last exception: SSL exception connecting to https://172.25.3.55:9292/v2/images/cd91190f-92cf-40b5-bf78-30a30cd9ee71: HTTPSConnectionPool(host='172.25.3.55', port=9292): Max retries exceeded with url: /v2/images/cd91190f-92cf-40b5-bf78-30a30cd9ee71 (Caused by SSLError(CertificateError("hostname '172.25.3.55' doesn't match 'dcn2-computehci2-1.internalapi.redhat.local'",),))

dcn2-computehci2-1.internalapi.redhat.local resolves as 172.25.3.55 and vice versa, but the used ssl cert has CN as dcn2-computehci2-1.internalapi.redhat.local but connection is made to the IP address 172.25.3.55? Should It try to connect to https://dcn2-computehci2-1.internalapi.redhat.local:9292/v2/images/cd91190f-92cf-40b5-bf78-30a30cd9ee71?

Comment 9 Alan Bishop 2020-08-27 17:46:40 UTC
AFAIK (Ade should confirm) it should be using FQDN and not the IP address.

But what's "it" in this instance? Is it a scale-out node, which is accessing glance via haproxy?

Comment 10 Marian Krcmarik 2020-08-27 21:35:26 UTC
Just to record, The problem from comment #8 is a bug, Edge compute uses IP address instead of FQDN of glance endpoint on EDGE site. I created a clone of this bug for that problem.

The problem with deployment of DCN stacks seems to be solved once templates for public TLS endpoints are not used, If it is used, deployment would fail, I am going to change this bug to docs to note this in our docs.


Note You need to log in before you can comment on or make changes to this bug.