Description of problem: In a pre-provisioned AWS VPC configuration, with a specific dhcp-options-set, nodes will provision into the cluster, but will not be able get a valid kubernetes.io/kubelet-serving certificate issued due to hostname mismatch. Version-Release number of selected component (if applicable): 4.9.25 How reproducible: Consistent Steps to Reproduce: 1. Create dhcp-options-set with following command: aws ec2 create-dhcp-options --dhcp-configurations '[{"Key":"domain-name-servers","Values":["AmazonProvidedDNS"]}]' 2. Create VPC/Subnets for OCP cluster, and associate dhcp-options-set to the VPC 3. Create OCP IPI cluster with specified subnets It's also possible to reproduce this after cluster installation: 1. Create dhcp-options-set with above command 2. Create default OCP IPI cluster, allowing the installer to create it's own VPC 3. After installation, swap the dhcp-options-set for the VPC with the one above 4. Delete a worker machine backed by a machineset, allowing MAPI to recreate the machine Actual results: Machine is able to join cluster, but the cluster-machine-approver doesn't approve the kubernetes.io/kubelet-serving certificate due to hostname mismatch. I0405 18:49:19.415329 1 controller.go:114] Reconciling CSR: csr-pdk9f I0405 18:49:19.429358 1 csr_check.go:156] csr-pdk9f: CSR does not appear to be client csr I0405 18:49:19.433102 1 csr_check.go:542] retrieving serving cert from ip-10-0-169-161.ec2.internal (10.0.169.161:10250) I0405 18:49:19.434310 1 csr_check.go:181] Failed to retrieve current serving cert: remote error: tls: internal error I0405 18:49:19.434336 1 csr_check.go:201] Falling back to machine-api authorization for ip-10-0-169-161.ec2.internal E0405 18:49:19.434345 1 csr_check.go:392] csr-pdk9f: DNS name 'ip-10-0-169-161' not in machine names: ip-10-0-169-161.ec2.internal ip-10-0-169-161.ec2.internal I0405 18:49:19.434350 1 csr_check.go:204] Could not use Machine for serving cert authorization: DNS name 'ip-10-0-169-161' not in machine names: ip-10-0-169-161.ec2.internal ip-10-0-169-161.ec2.internal I0405 18:49:19.437877 1 controller.go:199] csr-pdk9f: CSR not authorized Expected results: Machine joins cluster and gets a kubernetes.io/kubelet-serving certificate. Additional info:
i looked at the must-gather briefly, did the machine that corresponds to "10.0.169.161" get created with a different IP address? i do see 2 outstanding CSRs in the must-gather, but those are for a "10.0.160.184", which does appear to have a machine and node object. is this instance related? also, i'm curious if you are able to workaround the current issue by manually approving the cert?
The machine and node objects come up fine, and the node bootstrap certificate gets approved. However the kubernetes.io/kubelet-serving doesn't, and without that one you can't do things like oc debug or oc logs. Manually approving the cert seems to work for a period of time, but then the CSRs keep accumulating when they try to renew, or the machine is replaced.
ack, thanks for the extra details it helps. i'm still curious about the name, "ip-10-0-169-161.ec2.internal", from the logs. i would guess this comes from the hostname, but i don't see any manifests that match this ip address (or similar naming). any extra context around this machine?
Sorry, I've already destroyed the test cluster I used to get this must gather, but the reproducer steps are consistent (I've done this twice, in addition to the production cluster I originally discovered this in).
ok, no worries, we'll try to reproduce.
I was able to reproduce the behaviour by following these steps : > It's also possible to reproduce this after cluster installation: > 1. Create dhcp-options-set with above command > (creating the DHCP option set with `aws ec2 create-dhcp-options --dhcp-configurations '[{"Key":"domain-name-servers","Values":["AmazonProvidedDNS"]}]'`) > 2. Create default OCP IPI cluster, allowing the installer to create it's own VPC > 3. After installation, swap the dhcp-options-set for the VPC with the one above > 4. Delete a worker machine backed by a machineset, allowing MAPI to recreate the machine And I got a similar error in the `machine-approver-controller`: ``` $ oc -n openshift-cluster-machine-approver logs -f machine-approver-6f4f5f79bc-7tsn4 -c machine-approver-controller I0425 16:07:47.887513 1 controller.go:114] Reconciling CSR: csr-9b4xj I0425 16:07:47.902107 1 csr_check.go:156] csr-9b4xj: CSR does not appear to be client csr I0425 16:07:47.905825 1 csr_check.go:542] retrieving serving cert from ip-10-0-214-139.eu-central-1.compute.internal (10.0.214.139:10250) I0425 16:07:47.907864 1 csr_check.go:181] Failed to retrieve current serving cert: remote error: tls: internal error I0425 16:07:47.907954 1 csr_check.go:201] Falling back to machine-api authorization for ip-10-0-214-139.eu-central-1.compute.internal E0425 16:07:47.907992 1 csr_check.go:392] csr-9b4xj: DNS name 'ip-10-0-214-139' not in machine names: ip-10-0-214-139.eu-central-1.compute.internal ip-10-0-214-139.eu-central-1.compute.internal I0425 16:07:47.957793 1 controller.go:199] csr-9b4xj: CSR not authorized ``` I then tried the same reproduction steps again but slightly changed step 1., where I've instead used `aws ec2 create-dhcp-options --dhcp-configurations '[{"Key":"domain-name-servers","Values":["AmazonProvidedDNS"]},{"Key":"domain-name","Values":["eu-central-1.compute.internal"]}]'. This change allowed the machine-approver-controller to find a matching hostname and validate the CSR: ``` I0425 16:10:32.667447 1 controller.go:114] Reconciling CSR: csr-4thph I0425 16:10:32.714966 1 csr_check.go:156] csr-4thph: CSR does not appear to be client csr I0425 16:10:32.747043 1 csr_check.go:542] retrieving serving cert from ip-10-0-163-134.eu-central-1.compute.internal (10.0.163.134:10250) I0425 16:10:32.776643 1 csr_check.go:181] Failed to retrieve current serving cert: remote error: tls: internal error I0425 16:10:32.777685 1 csr_check.go:201] Falling back to machine-api authorization for ip-10-0-163-134.eu-central-1.compute.internal I0425 16:10:32.820556 1 controller.go:206] CSR csr-4thph approved ``` @cblecker Is there a particular reason for keeping the domain name empty? Thanks.
@ddonati Not other than the fact that not having a domain name suffix is a valid configuration (for example, a host of "ip-10-0-214-139")
This is being investigated between the Cluster Infrastructure and ShiftStack teams, we hope to update you later once we have made a bit more progress
We think we have a working fix for this and we are testing it. We’ll update on the progress of this as soon as the fix is confirmed.
Reproduce the issue on 4.9.25 Steps: 1. Create dhcp-options-set liuhuali@Lius-MacBook-Pro huali-test % aws ec2 create-dhcp-options --dhcp-configurations '[{"Key":"domain-name-servers","Values":["AmazonProvidedDNS"]}]' DHCPOPTIONS dopt-0c9dfcde919f49105 301721915996 DHCPCONFIGURATIONS domain-name-servers VALUES AmazonProvidedDNS liuhuali@Lius-MacBook-Pro huali-test % 2. Create default OCP IPI cluster, allowing the installer to create it's own VPC 3. After installation, swap the dhcp-options-set for the VPC with the one above 4. Delete a worker machine backed by a machineset, allowing MAPI to recreate the machine liuhuali@Lius-MacBook-Pro huali-test % oc delete machine huliu-aws49a-l5tqf-worker-us-east-2c-rnrwg machine.machine.openshift.io "huliu-aws49a-l5tqf-worker-us-east-2c-rnrwg" deleted liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-aws49a-l5tqf-master-0 Running m5.xlarge us-east-2 us-east-2a 3h54m huliu-aws49a-l5tqf-master-1 Running m5.xlarge us-east-2 us-east-2b 3h54m huliu-aws49a-l5tqf-master-2 Running m5.xlarge us-east-2 us-east-2c 3h54m huliu-aws49a-l5tqf-worker-us-east-2a-rrm2l Running m5.large us-east-2 us-east-2a 3h52m huliu-aws49a-l5tqf-worker-us-east-2b-svst7 Running m5.large us-east-2 us-east-2b 3h52m huliu-aws49a-l5tqf-worker-us-east-2c-8rzmk Running m5.large us-east-2 us-east-2c 5m45s liuhuali@Lius-MacBook-Pro huali-test % oc get node NAME STATUS ROLES AGE VERSION ip-10-0-148-98.us-east-2.compute.internal Ready worker 3h48m v1.22.5+5c84e52 ip-10-0-154-23.us-east-2.compute.internal Ready master 3h54m v1.22.5+5c84e52 ip-10-0-187-34.us-east-2.compute.internal Ready master 3h54m v1.22.5+5c84e52 ip-10-0-190-51.us-east-2.compute.internal Ready worker 3h48m v1.22.5+5c84e52 ip-10-0-208-148.us-east-2.compute.internal Ready master 3h54m v1.22.5+5c84e52 ip-10-0-222-232.us-east-2.compute.internal Ready worker 3m1s v1.22.5+5c84e52 liuhuali@Lius-MacBook-Pro huali-test % oc get pod -n openshift-cluster-machine-approver NAME READY STATUS RESTARTS AGE machine-approver-6f4f5f79bc-9qczn 2/2 Running 0 3h55m liuhuali@Lius-MacBook-Pro huali-test % oc -n openshift-cluster-machine-approver logs -f machine-approver-6f4f5f79bc-9qczn -c machine-approver-controller ... I0511 06:35:20.259860 1 controller.go:114] Reconciling CSR: csr-466wh I0511 06:35:20.270565 1 csr_check.go:156] csr-466wh: CSR does not appear to be client csr I0511 06:35:20.274916 1 csr_check.go:542] retrieving serving cert from ip-10-0-222-232.us-east-2.compute.internal (10.0.222.232:10250) I0511 06:35:20.277286 1 csr_check.go:181] Failed to retrieve current serving cert: remote error: tls: internal error I0511 06:35:20.277303 1 csr_check.go:201] Falling back to machine-api authorization for ip-10-0-222-232.us-east-2.compute.internal E0511 06:35:20.277312 1 csr_check.go:392] csr-466wh: DNS name 'ip-10-0-222-232' not in machine names: ip-10-0-222-232.us-east-2.compute.internal ip-10-0-222-232.us-east-2.compute.internal I0511 06:35:20.277319 1 csr_check.go:204] Could not use Machine for serving cert authorization: DNS name 'ip-10-0-222-232' not in machine names: ip-10-0-222-232.us-east-2.compute.internal ip-10-0-222-232.us-east-2.compute.internal I0511 06:35:20.284132 1 controller.go:199] csr-466wh: CSR not authorized Verified on 4.11.0-0.nightly-2022-05-10-045003 Steps: 1. Create dhcp-options-set liuhuali@Lius-MacBook-Pro huali-test % aws ec2 create-dhcp-options --dhcp-configurations '[{"Key":"domain-name-servers","Values":["AmazonProvidedDNS"]}]' DHCPOPTIONS dopt-0c9dfcde919f49105 301721915996 DHCPCONFIGURATIONS domain-name-servers VALUES AmazonProvidedDNS liuhuali@Lius-MacBook-Pro huali-test % 2. Create default OCP IPI cluster, allowing the installer to create it's own VPC 3. After installation, swap the dhcp-options-set for the VPC with the one above 4. Delete a worker machine backed by a machineset, allowing MAPI to recreate the machine liuhuali@Lius-MacBook-Pro huali-test % oc delete machine huliu-aws411d-clw88-worker-us-east-2c-ggdqn machine.machine.openshift.io "huliu-aws411d-clw88-worker-us-east-2c-ggdqn" deleted liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-aws411d-clw88-master-0 Running m6i.xlarge us-east-2 us-east-2a 3h38m huliu-aws411d-clw88-master-1 Running m6i.xlarge us-east-2 us-east-2b 3h38m huliu-aws411d-clw88-master-2 Running m6i.xlarge us-east-2 us-east-2c 3h39m huliu-aws411d-clw88-worker-us-east-2a-lsshh Running m6i.large us-east-2 us-east-2a 3h36m huliu-aws411d-clw88-worker-us-east-2b-zqjnh Running m6i.large us-east-2 us-east-2b 3h36m huliu-aws411d-clw88-worker-us-east-2c-vrtkz Running m6i.large us-east-2 us-east-2c 8m28s liuhuali@Lius-MacBook-Pro huali-test % oc get node NAME STATUS ROLES AGE VERSION ip-10-0-135-122.us-east-2.compute.internal Ready master 3h38m v1.23.3+69213f8 ip-10-0-136-113.us-east-2.compute.internal Ready worker 3h34m v1.23.3+69213f8 ip-10-0-171-251.us-east-2.compute.internal Ready master 3h38m v1.23.3+69213f8 ip-10-0-177-8.us-east-2.compute.internal Ready worker 3h34m v1.23.3+69213f8 ip-10-0-207-202.us-east-2.compute.internal Ready worker 5m13s v1.23.3+69213f8 ip-10-0-216-30.us-east-2.compute.internal Ready master 3h39m v1.23.3+69213f8 liuhuali@Lius-MacBook-Pro huali-test % oc get pod -n openshift-cluster-machine-approver NAME READY STATUS RESTARTS AGE machine-approver-6d975cff6f-nhc4t 2/2 Running 0 3h39m liuhuali@Lius-MacBook-Pro huali-test % oc -n openshift-cluster-machine-approver logs -f machine-approver-6d975cff6f-nhc4t -c machine-approver-controller ... I0511 06:45:16.284945 1 controller.go:121] Reconciling CSR: csr-8jffv I0511 06:45:16.306608 1 csr_check.go:157] csr-8jffv: CSR does not appear to be client csr I0511 06:45:16.308250 1 csr_check.go:545] retrieving serving cert from ip-10-0-207-202.us-east-2.compute.internal (10.0.207.202:10250) I0511 06:45:16.310061 1 csr_check.go:182] Failed to retrieve current serving cert: remote error: tls: internal error I0511 06:45:16.310076 1 csr_check.go:202] Falling back to machine-api authorization for ip-10-0-207-202.us-east-2.compute.internal I0511 06:45:16.314493 1 controller.go:240] CSR csr-8jffv approved
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069