Description of problem: We deployed a cluster into an existing AWS VPC (eu-central). The VPC is enabled with enableDnsSupport, enableDnsHostnames and DHCP options are set to domain-name = aws.example.com; domain-name-servers = AmazonProvidedDNS; After the deployment is ready the CSRs are not approved my the machine-approver. Version-Release number of selected component (if applicable): 4.4 rc.6 How reproducible: After the deployment is ready check CSRs Steps to Reproduce: 1. Create VPC with all requirements https://docs.openshift.com/container-platform/4.3/installing/installing_aws/installing-aws-vpc.html#installation-custom-aws-vpc-requirements_installing-aws-vpc 2. Enable options enableDnsSupport and enableDnsHostnames for the VPC 3. Setup DHCP options to domain-name = aws.example.com; domain-name-servers = AmazonProvidedDNS; 4. Create a Route53 private zone aws.example.com and attach it to the VPC 5. Deploy the cluster into the existing VPC Actual results: CSRs are pending Expected results: CSRs are approved Additional info: We tracked down the issue to https://github.com/openshift/cluster-api-provider-aws/blob/release-4.4/pkg/actuators/machine/utils.go#L404-L408 The EC2 instance PrivateDNS points to ip-xx-xx-xx-xx.eu-central-1.compute.internal but the kubelet reads the hostname from the meta-data service (http://169.254.169.254/latest/meta-data/hostname ) that will result in ip-xx-xx-xx-xx.eu-central-1.aws.example.com. The problem is that the Machine object has different addresses than the Node object and this causes the machine approver to reject the CSR
Hi, could we please get must gather logs?
Setting target release to our active development branch (4.5) for debugging. For fixes requested/required, if any, backported, additional bugs (clones) will be created targetting those versions.
Also can we please get more context on the particular scenario and their needs for using a custom domain and dhcp config? This is the check that is failing the machine approver for the serving CSR https://github.com/openshift/cluster-machine-approver/blob/master/csr_check.go#L194-L216 e.g "Error syncing csr csr-6wqxx: DNS name 'ip-10-0-131-9.agl.com' not in machine names: ip-10-0-131-9.ec2.internal ip-10-0-131-9.ec2.internal". This is because the kubelet serving CSR https://github.com/kubernetes/kubernetes/blob/03b7f272c8287fdaafa67b82f1c577a96c5a238a/pkg/kubelet/certificate/kubelet.go#L90-L100 uses any DNS/IP SANs supplied with whatever values the host knows for itself, i.e what the aws cloud provider sets here https://github.com/kubernetes/kubernetes/blob/6b13befdfb3e30862dc86fd1c7739f58901f0bae/staging/src/k8s.io/legacy-cloud-providers/aws/aws.go#L1400-L1479 We could reuse the logic in https://github.com/kubernetes/kubernetes/blob/6b13befdfb3e30862dc86fd1c7739f58901f0bae/staging/src/k8s.io/legacy-cloud-providers/aws/aws.go#L1400-L1479 for the machine aws controller to infer the custom domain from the metadata. With my setup I'm not seeing the node coming with the custom domain name. If that was the case this check for the client CSR would fail as well https://github.com/openshift/cluster-machine-approver/blob/master/csr_check.go#L269 as the nodename is used for the subject common name https://github.com/kubernetes/kubernetes/blob/03b7f272c8287fdaafa67b82f1c577a96c5a238a/pkg/kubelet/certificate/bootstrap/bootstrap.go#L318 in the CSR. We'll need must gather logs to validate all the above and see how is best to proceed.
Hi Alberto, We have a private link to AWS and a VPC that is shared company internal. There is a custom domain like aws.example.com shared internally and attached the the VPC to be able to resolve the EC2 names. Our AWS team followed the best practice from https://docs.aws.amazon.com/vpc/latest/userguide/vpc-dns.html#vpc-dns-support ---snip--- If you use custom DNS domain names defined in a private hosted zone in Amazon Route 53, or use private DNS with interface VPC endpoints (AWS PrivateLink), you must set the enableDnsHostnames and enableDnsSupport attributes to true. ---snip---... I will deploy the cluster again and get the must gather logs for you.
Hi Alberto, I attached the must gather logs to the case #02626129 as I dont want to disclose it publicly. To be able to run oc adm must-gather I had to approve all pending CSRs because of this error: [must-gather ] OUT Using must-gather plugin-in image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1e4b36f70fc62ad1d82e4c56d84922b21b0c15f96ab899b553f64588fd5048f3 [must-gather ] OUT namespace/openshift-must-gather-5dwld created [must-gather ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-zxqv7 created [must-gather ] OUT pod for plug-in image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1e4b36f70fc62ad1d82e4c56d84922b21b0c15f96ab899b553f64588fd5048f3 created [must-gather-nchnl] OUT gather logs unavailable: Get https://10.175.6.106:10250/containerLogs/openshift-must-gather-5dwld/must-gather-nchnl/gather?follow=true: remote error: tls: internal error [must-gather-nchnl] OUT waiting for gather to complete [must-gather-nchnl] OUT downloading gather output WARNING: cannot use rsync: rsync not available in container WARNING: cannot use tar: tar not available in container [must-gather-nchnl] OUT gather output not downloaded: No available strategies to copy. [must-gather-nchnl] OUT [must-gather ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-zxqv7 deleted [must-gather ] OUT namespace/openshift-must-gather-5dwld deleted error: unable to download output from pod must-gather-nchnl: No available strategies to copy. before approving ths CSRs this was the output: $ oc get csr NAME AGE REQUESTOR CONDITION csr-22vp6 22m system:node:ip-10-175-5-159.eu-central-1.compute.internal Pending csr-2t4zf 27m system:node:ip-10-175-6-190.eu-central-1.compute.internal Pending csr-5f75h 22m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-5j2fx 17m system:node:ip-10-175-5-35.eu-central-1.compute.internal Pending csr-5zrn9 27m system:node:ip-10-175-5-206.eu-central-1.compute.internal Pending csr-8bvb9 37m system:node:ip-10-175-5-35.eu-central-1.compute.internal Approved,Issued csr-8tbpt 21m system:node:ip-10-175-6-106.eu-central-1.compute.internal Pending csr-9llb2 27m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-b5csl 21m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-bj6d8 22m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-bxtdj 22m system:node:ip-10-175-4-46.eu-central-1.compute.internal Pending csr-dbd9s 22m system:node:ip-10-175-4-58.eu-central-1.compute.internal Pending csr-dfbvb 37m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-f9bhn 7m1s system:node:ip-10-175-5-159.eu-central-1.compute.internal Pending csr-gdqc9 21m system:node:ip-10-175-5-158.eu-central-1.compute.internal Pending csr-h6bbb 7m5s system:node:ip-10-175-4-46.eu-central-1.compute.internal Pending csr-j4vxn 7m44s system:node:ip-10-175-4-58.eu-central-1.compute.internal Pending csr-j99p7 37m system:node:ip-10-175-4-100.eu-central-1.compute.internal Approved,Issued csr-jmg98 6m48s system:node:ip-10-175-6-106.eu-central-1.compute.internal Pending csr-n2hmt 26m system:node:ip-10-175-4-8.eu-central-1.compute.internal Pending csr-qbkz5 37m system:node:ip-10-175-6-16.eu-central-1.compute.internal Approved,Issued csr-qh985 37m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-qhm2w 21m system:node:ip-10-175-6-62.eu-central-1.compute.internal Pending csr-r4pgx 17m system:node:ip-10-175-6-16.eu-central-1.compute.internal Pending csr-rncp2 2m32s system:node:ip-10-175-6-16.eu-central-1.compute.internal Pending csr-scbdn 22m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-swjwn 6m23s system:node:ip-10-175-6-62.eu-central-1.compute.internal Pending csr-t2l4j 22m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-t7pvz 37m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-vgg7n 27m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-vvk2v 3m22s system:node:ip-10-175-4-100.eu-central-1.compute.internal Pending csr-wztrj 2m35s system:node:ip-10-175-5-35.eu-central-1.compute.internal Pending csr-xf5lr 18m system:node:ip-10-175-4-100.eu-central-1.compute.internal Pending csr-xnnwd 22m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-z68lc 6m56s system:node:ip-10-175-5-158.eu-central-1.compute.internal Pending csr-zmxkk 27m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued if you need something else please give me ping.
Thanks a bunch Florin. Could you please also share machine approver logs? I can't see those in the file attached, that might be an issue on machine approver/must gather side.
Hi Alberto, I added machine-approver-6d9d8b8dd7-5tr78.log to the case #02626129
>We have a private link to AWS and a VPC that is shared company internal. There is a custom domain like aws.example.com shared internally and attached the the VPC to be able to resolve the EC2 names. So they use this vcp not only for running the OCP cluster, correct? Thanks for the logs. That confirm the scenario I saw in https://bugzilla.redhat.com/show_bug.cgi?id=1822200#c4 We could reuse the logic in https://github.com/kubernetes/kubernetes/blob/6b13befdfb3e30862dc86fd1c7739f58901f0bae/staging/src/k8s.io/legacy-cloud-providers/aws/aws.go#L1400-L1479 for the aws machine controller but that would force us to run the pod in the host network so it can reach the metadata endpoint. To avoid the host network I'd rather suggest to infer the domain information from the instance using the regular API to construct the IPs with the know format. I dropped some code getting showing how to get the domain info from the dhcp config https://github.com/openshift/cluster-api-provider-aws/compare/master...enxebre:dev?expand=1 Ryan Phillips, does this sounds ok to you from node point of view? Alex Demicev can you take it from here?
> So they use this vpc not only for running the OCP cluster, correct? yes I checked the cluster-machine-approver code and I don't really understand why the DNS Names[1] of the CSR are checked because the CommonName already matches the node name. Another possibility would be to extend the config[2] with a switch what would ignore/bypass the alternative DNSName check. But maybe I overlooked something... [1] https://github.com/openshift/cluster-machine-approver/blob/master/csr_check.go#L194 [2] https://github.com/openshift/cluster-machine-approver/blob/master/config.go
Relaxing the dns check as in [1] might be an option. We don't want to degrade UX and put burden on the user so I don't think we should consider extend config as per [2].
(In reply to Alberto from comment #11) > To avoid the host network I'd rather suggest to infer the domain information > from the instance using the regular API to construct the IPs with the know > format. I dropped some code getting showing how to get the domain info from > the dhcp config > https://github.com/openshift/cluster-api-provider-aws/compare/master... > enxebre:dev?expand=1 This seems sensible to me. When adding network addresses to the machine's status object, it should be trivial to split and join to the domain discovered by this code.
https://github.com/openshift/cluster-api-provider-aws/pull/318 Hitting a permission issue on AWS: "error describing dhcp: UnauthorizedOperation: You are not authorized to perform this operation."
We are actively working on this BZ case. In the mean time, anyone impacted by this can always approve the CSRs manually to get their environment online. Obviously, scaling up new machines will be impacted and require manual CSR approval of any additional machines. If they are trying to validate the overall environment, I would approve the CSRs and continue other validations until the code ships.
Thank you very much @mgugino, this is very helpful! Verified in 4.5.0-0.nightly-2020-05-11-202959 oc describe csr csr-vkw66 Name: csr-vkw66 Labels: <none> Annotations: <none> CreationTimestamp: Tue, 12 May 2020 07:53:10 +0800 Requesting User: system:node:ip-10-0-137-178.us-east-2.compute.internal Status: Approved,Issued Subject: Common Name: system:node:ip-10-0-137-178.us-east-2.compute.internal Serial Number: Organization: system:nodes Subject Alternative Names: DNS Names: ip-10-0-137-178.example.com IP Addresses: 10.0.137.178 Events: <none> ~ ➤ oc describe csr csr-wtjhq Name: csr-wtjhq Labels: <none> Annotations: <none> CreationTimestamp: Tue, 12 May 2020 07:52:56 +0800 Requesting User: system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Status: Approved,Issued Subject: Common Name: system:node:ip-10-0-137-178.us-east-2.compute.internal Serial Number: Organization: system:nodes Events: <none> oc get machine jhou-mztdx-worker-us-east-2a-tsz95 -n openshift-machine-api -o yaml status: addresses: - address: 10.0.137.178 type: InternalIP - address: ip-10-0-137-178.us-east-2.compute.internal type: InternalDNS - address: ip-10-0-137-178.us-east-2.compute.internal type: Hostname - address: ip-10-0-137-178.example.com type: InternalDNS lastUpdated: "2020-05-11T23:54:30Z" nodeRef: kind: Node name: ip-10-0-137-178.us-east-2.compute.internal uid: 43d5bb66-c8db-4e8e-bf3b-a42ed80ba01a phase: Running
This code has merged into master and will be available in the 4.5 GA release. Due to large changes in the 4.5 code base from the 4.4 release, we are not targeting a 4.4 back port at this time.
We've re-opened the backport bug and PR. The PR to watch for 4.4 is https://github.com/openshift/cluster-api-provider-aws/pull/326
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409