Bug 1822200 - CSRs are not approved on private AWS cluster deployment
Summary: CSRs are not approved on private AWS cluster deployment
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.3.z
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: 4.5.0
Assignee: Michael Gugino
QA Contact: Jianwei Hou
URL:
Whiteboard:
Depends On:
Blocks: 1833357 1833359 1833363
TreeView+ depends on / blocked
 
Reported: 2020-04-08 13:30 UTC by Florin Peter
Modified: 2021-06-03 03:16 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: Enhancement
Doc Text:
Feature: Support custom domain names in AWS VPC DHCP options. Reason: Needed for CSR approval of new nodes. Result: New nodes join the cluster appropriately.
Clone Of:
: 1833359 1833361 (view as bug list)
Environment:
Last Closed: 2020-07-13 17:26:16 UTC
Target Upstream Version:
agarcial: needinfo+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-api-provider-aws pull 318 0 None closed Bug 1822200: Custom dns name support 2021-02-18 16:15:35 UTC
Github openshift machine-api-operator pull 572 0 None closed Bug 1822200: Add ec2:DescribeDhcpOptions permission request 2021-02-18 16:15:35 UTC
Red Hat Product Errata RHBA-2020:2409 0 None None None 2020-07-13 17:26:47 UTC

Description Florin Peter 2020-04-08 13:30:25 UTC
Description of problem:
We deployed a cluster into an existing AWS VPC (eu-central).
The VPC is enabled with enableDnsSupport, enableDnsHostnames and DHCP options are set to domain-name = aws.example.com; domain-name-servers = AmazonProvidedDNS;

After the deployment is ready the CSRs are not approved my the machine-approver.

Version-Release number of selected component (if applicable):
4.4 rc.6

How reproducible:
After the deployment is ready check CSRs 

Steps to Reproduce:
1. Create VPC with all requirements https://docs.openshift.com/container-platform/4.3/installing/installing_aws/installing-aws-vpc.html#installation-custom-aws-vpc-requirements_installing-aws-vpc 
2. Enable options enableDnsSupport and enableDnsHostnames for the VPC
3. Setup DHCP options to domain-name = aws.example.com; domain-name-servers = AmazonProvidedDNS;
4. Create a Route53 private zone aws.example.com and attach it to the VPC
5. Deploy the cluster into the existing VPC 
 
Actual results:
CSRs are pending

Expected results:
CSRs are approved

Additional info:
We tracked down the issue to https://github.com/openshift/cluster-api-provider-aws/blob/release-4.4/pkg/actuators/machine/utils.go#L404-L408
The EC2 instance PrivateDNS points to ip-xx-xx-xx-xx.eu-central-1.compute.internal but the kubelet reads the hostname from the meta-data service (http://169.254.169.254/latest/meta-data/hostname ) that will result in ip-xx-xx-xx-xx.eu-central-1.aws.example.com.
The problem is that the Machine object has different addresses than the Node object and this causes the machine approver to reject the CSR

Comment 2 Alberto 2020-04-13 11:50:52 UTC
Hi, could we please get must gather logs?

Comment 3 Stephen Cuppett 2020-04-13 12:05:35 UTC
Setting target release to our active development branch (4.5) for debugging. For fixes requested/required, if any, backported, additional bugs (clones) will be created targetting those versions.

Comment 4 Alberto 2020-04-14 11:49:17 UTC
Also can we please get more context on the particular scenario and their needs for using a custom domain and dhcp config?

This is the check that is failing the machine approver for the serving CSR https://github.com/openshift/cluster-machine-approver/blob/master/csr_check.go#L194-L216
e.g "Error syncing csr csr-6wqxx: DNS name 'ip-10-0-131-9.agl.com' not in machine names: ip-10-0-131-9.ec2.internal ip-10-0-131-9.ec2.internal".

This is because the kubelet serving CSR https://github.com/kubernetes/kubernetes/blob/03b7f272c8287fdaafa67b82f1c577a96c5a238a/pkg/kubelet/certificate/kubelet.go#L90-L100
uses any DNS/IP SANs supplied with whatever values the host knows for itself, i.e what the aws cloud provider sets here https://github.com/kubernetes/kubernetes/blob/6b13befdfb3e30862dc86fd1c7739f58901f0bae/staging/src/k8s.io/legacy-cloud-providers/aws/aws.go#L1400-L1479

We could reuse the logic in https://github.com/kubernetes/kubernetes/blob/6b13befdfb3e30862dc86fd1c7739f58901f0bae/staging/src/k8s.io/legacy-cloud-providers/aws/aws.go#L1400-L1479 for the machine aws controller to infer the custom domain from the metadata.

With my setup I'm not seeing the node coming with the custom domain name. If that was the case this check for the client CSR would fail as well https://github.com/openshift/cluster-machine-approver/blob/master/csr_check.go#L269 as the nodename is used for the subject common name https://github.com/kubernetes/kubernetes/blob/03b7f272c8287fdaafa67b82f1c577a96c5a238a/pkg/kubelet/certificate/bootstrap/bootstrap.go#L318 in the CSR.

We'll need must gather logs to validate all the above and see how is best to proceed.

Comment 5 Florin Peter 2020-04-14 12:26:55 UTC
Hi Alberto,

We have a private link to AWS and a VPC that is shared company internal.

There is a custom domain like aws.example.com shared internally and attached the the VPC to be able to resolve the EC2 names.
Our AWS team followed the best practice from https://docs.aws.amazon.com/vpc/latest/userguide/vpc-dns.html#vpc-dns-support
---snip---
If you use custom DNS domain names defined in a private hosted zone in Amazon Route 53, or use private DNS with interface VPC endpoints (AWS PrivateLink), you must set the enableDnsHostnames and enableDnsSupport attributes to true.
---snip---...

I will deploy the cluster again and get the must gather logs for you.

Comment 6 Florin Peter 2020-04-14 14:00:52 UTC
Hi Alberto,

I attached the must gather logs to the case #02626129 as I dont want to disclose it publicly.
To be able to run oc adm must-gather I had to approve all pending CSRs because of this error:

[must-gather      ] OUT Using must-gather plugin-in image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1e4b36f70fc62ad1d82e4c56d84922b21b0c15f96ab899b553f64588fd5048f3
[must-gather      ] OUT namespace/openshift-must-gather-5dwld created
[must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-zxqv7 created
[must-gather      ] OUT pod for plug-in image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1e4b36f70fc62ad1d82e4c56d84922b21b0c15f96ab899b553f64588fd5048f3 created
[must-gather-nchnl] OUT gather logs unavailable: Get https://10.175.6.106:10250/containerLogs/openshift-must-gather-5dwld/must-gather-nchnl/gather?follow=true: remote error: tls: internal error
[must-gather-nchnl] OUT waiting for gather to complete
[must-gather-nchnl] OUT downloading gather output
WARNING: cannot use rsync: rsync not available in container
WARNING: cannot use tar: tar not available in container
[must-gather-nchnl] OUT gather output not downloaded: No available strategies to copy.
[must-gather-nchnl] OUT
[must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-zxqv7 deleted
[must-gather      ] OUT namespace/openshift-must-gather-5dwld deleted
error: unable to download output from pod must-gather-nchnl: No available strategies to copy.

before approving ths CSRs this was the output:

$ oc get csr
NAME        AGE     REQUESTOR                                                                   CONDITION
csr-22vp6   22m     system:node:ip-10-175-5-159.eu-central-1.compute.internal                   Pending
csr-2t4zf   27m     system:node:ip-10-175-6-190.eu-central-1.compute.internal                   Pending
csr-5f75h   22m     system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-5j2fx   17m     system:node:ip-10-175-5-35.eu-central-1.compute.internal                    Pending
csr-5zrn9   27m     system:node:ip-10-175-5-206.eu-central-1.compute.internal                   Pending
csr-8bvb9   37m     system:node:ip-10-175-5-35.eu-central-1.compute.internal                    Approved,Issued
csr-8tbpt   21m     system:node:ip-10-175-6-106.eu-central-1.compute.internal                   Pending
csr-9llb2   27m     system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-b5csl   21m     system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-bj6d8   22m     system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-bxtdj   22m     system:node:ip-10-175-4-46.eu-central-1.compute.internal                    Pending
csr-dbd9s   22m     system:node:ip-10-175-4-58.eu-central-1.compute.internal                    Pending
csr-dfbvb   37m     system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-f9bhn   7m1s    system:node:ip-10-175-5-159.eu-central-1.compute.internal                   Pending
csr-gdqc9   21m     system:node:ip-10-175-5-158.eu-central-1.compute.internal                   Pending
csr-h6bbb   7m5s    system:node:ip-10-175-4-46.eu-central-1.compute.internal                    Pending
csr-j4vxn   7m44s   system:node:ip-10-175-4-58.eu-central-1.compute.internal                    Pending
csr-j99p7   37m     system:node:ip-10-175-4-100.eu-central-1.compute.internal                   Approved,Issued
csr-jmg98   6m48s   system:node:ip-10-175-6-106.eu-central-1.compute.internal                   Pending
csr-n2hmt   26m     system:node:ip-10-175-4-8.eu-central-1.compute.internal                     Pending
csr-qbkz5   37m     system:node:ip-10-175-6-16.eu-central-1.compute.internal                    Approved,Issued
csr-qh985   37m     system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-qhm2w   21m     system:node:ip-10-175-6-62.eu-central-1.compute.internal                    Pending
csr-r4pgx   17m     system:node:ip-10-175-6-16.eu-central-1.compute.internal                    Pending
csr-rncp2   2m32s   system:node:ip-10-175-6-16.eu-central-1.compute.internal                    Pending
csr-scbdn   22m     system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-swjwn   6m23s   system:node:ip-10-175-6-62.eu-central-1.compute.internal                    Pending
csr-t2l4j   22m     system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-t7pvz   37m     system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-vgg7n   27m     system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-vvk2v   3m22s   system:node:ip-10-175-4-100.eu-central-1.compute.internal                   Pending
csr-wztrj   2m35s   system:node:ip-10-175-5-35.eu-central-1.compute.internal                    Pending
csr-xf5lr   18m     system:node:ip-10-175-4-100.eu-central-1.compute.internal                   Pending
csr-xnnwd   22m     system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-z68lc   6m56s   system:node:ip-10-175-5-158.eu-central-1.compute.internal                   Pending
csr-zmxkk   27m     system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued

if you need something else please give me ping.

Comment 7 Alberto 2020-04-14 14:57:02 UTC
Thanks a bunch Florin. Could you please also share machine approver logs? I can't see those in the file attached, that might be an issue on machine approver/must gather side.

Comment 8 Florin Peter 2020-04-14 15:09:44 UTC
Hi Alberto,

I added machine-approver-6d9d8b8dd7-5tr78.log to the case #02626129

Comment 11 Alberto 2020-04-15 10:49:14 UTC
>We have a private link to AWS and a VPC that is shared company internal.
There is a custom domain like aws.example.com shared internally and attached the the VPC to be able to resolve the EC2 names.

So they use this vcp not only for running the OCP cluster, correct?

Thanks for the logs. That confirm the scenario I saw in https://bugzilla.redhat.com/show_bug.cgi?id=1822200#c4

We could reuse the logic in https://github.com/kubernetes/kubernetes/blob/6b13befdfb3e30862dc86fd1c7739f58901f0bae/staging/src/k8s.io/legacy-cloud-providers/aws/aws.go#L1400-L1479 for the aws machine controller but that would force us to run the pod in the host network so it can reach the metadata endpoint.

To avoid the host network I'd rather suggest to infer the domain information from the instance using the regular API to construct the IPs with the know format. I dropped some code getting showing how to get the domain info from the dhcp config https://github.com/openshift/cluster-api-provider-aws/compare/master...enxebre:dev?expand=1

Ryan Phillips, does this sounds ok to you from node point of view?
Alex Demicev can you take it from here?

Comment 12 Florin Peter 2020-04-15 11:24:03 UTC
> So they use this vpc not only for running the OCP cluster, correct?
yes

I checked the cluster-machine-approver code and I don't really understand why the DNS Names[1] of the CSR are checked because the CommonName already matches the node name.
Another possibility would be to extend the config[2] with a switch what would ignore/bypass the alternative DNSName check.
But maybe I overlooked something...

[1] https://github.com/openshift/cluster-machine-approver/blob/master/csr_check.go#L194
[2] https://github.com/openshift/cluster-machine-approver/blob/master/config.go

Comment 13 Alberto 2020-04-15 12:28:24 UTC
Relaxing the dns check as in [1] might be an option.
We don't want to degrade UX and put burden on the user so I don't think we should consider extend config as per [2].

Comment 15 Michael Gugino 2020-04-21 00:30:12 UTC
(In reply to Alberto from comment #11)
> To avoid the host network I'd rather suggest to infer the domain information
> from the instance using the regular API to construct the IPs with the know
> format. I dropped some code getting showing how to get the domain info from
> the dhcp config
> https://github.com/openshift/cluster-api-provider-aws/compare/master...
> enxebre:dev?expand=1

This seems sensible to me.

When adding network addresses to the machine's status object, it should be trivial to split and join to the domain discovered by this code.

Comment 16 Michael Gugino 2020-04-24 12:10:13 UTC
https://github.com/openshift/cluster-api-provider-aws/pull/318

Hitting a permission issue on AWS: "error describing dhcp: UnauthorizedOperation: You are not authorized to perform this operation."

Comment 20 Michael Gugino 2020-04-29 14:45:45 UTC
We are actively working on this BZ case.  In the mean time, anyone impacted by this can always approve the CSRs manually to get their environment online.  Obviously, scaling up new machines will be impacted and require manual CSR approval of any additional machines.  If they are trying to validate the overall environment, I would approve the CSRs and continue other validations until the code ships.

Comment 26 Jianwei Hou 2020-05-12 00:08:04 UTC
Thank you very much @mgugino, this is very helpful! Verified in 4.5.0-0.nightly-2020-05-11-202959

oc describe csr csr-vkw66
Name:               csr-vkw66
Labels:             <none>
Annotations:        <none>
CreationTimestamp:  Tue, 12 May 2020 07:53:10 +0800
Requesting User:    system:node:ip-10-0-137-178.us-east-2.compute.internal
Status:             Approved,Issued
Subject:
  Common Name:    system:node:ip-10-0-137-178.us-east-2.compute.internal
  Serial Number:
  Organization:   system:nodes
Subject Alternative Names:
         DNS Names:     ip-10-0-137-178.example.com
         IP Addresses:  10.0.137.178
Events:  <none>
~ ➤ oc describe csr csr-wtjhq
Name:               csr-wtjhq
Labels:             <none>
Annotations:        <none>
CreationTimestamp:  Tue, 12 May 2020 07:52:56 +0800
Requesting User:    system:serviceaccount:openshift-machine-config-operator:node-bootstrapper
Status:             Approved,Issued
Subject:
         Common Name:    system:node:ip-10-0-137-178.us-east-2.compute.internal
         Serial Number:
         Organization:   system:nodes
Events:  <none>


oc get machine jhou-mztdx-worker-us-east-2a-tsz95 -n openshift-machine-api -o yaml

status:
  addresses:
  - address: 10.0.137.178
    type: InternalIP
  - address: ip-10-0-137-178.us-east-2.compute.internal
    type: InternalDNS
  - address: ip-10-0-137-178.us-east-2.compute.internal
    type: Hostname
  - address: ip-10-0-137-178.example.com
    type: InternalDNS
  lastUpdated: "2020-05-11T23:54:30Z"
  nodeRef:
    kind: Node
    name: ip-10-0-137-178.us-east-2.compute.internal
    uid: 43d5bb66-c8db-4e8e-bf3b-a42ed80ba01a
  phase: Running

Comment 27 Michael Gugino 2020-05-13 13:54:18 UTC
This code has merged into master and will be available in the 4.5 GA release.  Due to large changes in the 4.5 code base from the 4.4 release, we are not targeting a 4.4 back port at this time.

Comment 28 Michael Gugino 2020-05-27 20:33:12 UTC
We've re-opened the backport bug and PR.

The PR to watch for 4.4 is https://github.com/openshift/cluster-api-provider-aws/pull/326

Comment 31 errata-xmlrpc 2020-07-13 17:26:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409


Note You need to log in before you can comment on or make changes to this bug.