Bug 2072195
Summary: | machine api doesn't issue client cert when AWS DNS suffix missing | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Christoph Blecker <cblecker> |
Component: | Cloud Compute | Assignee: | Damiano Donati <ddonati> |
Cloud Compute sub component: | Other Providers | QA Contact: | Huali Liu <huliu> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | medium | ||
Priority: | medium | CC: | aarapov, ddonati, mimccune, zhsun |
Version: | 4.9 | Keywords: | ServiceDeliveryBlocker |
Target Milestone: | --- | ||
Target Release: | 4.11.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
Cause: The machine-api provider for AWS would ignore an empty domain-name in the main VPC's Custom DHCP Option Set and don't consider it while populating machine addresses.
Consequence: On AWS OCP clusters adding an empty domain-name in the main VPC's Custom DHCP Option Set wrongfully ended up breaking the CSR validation for nodes created after the change.
Fix: We now allow empty domain-names within the VPC's DHCP Option Set and consider it as a valid setting.
Result: Users can now set empty domain-names within the VPC's DHCP Option Sets without experiencing issues in kubelet's CSR validation.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2022-08-10 11:03:38 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 2083270 |
Description
Christoph Blecker
2022-04-05 18:51:31 UTC
i looked at the must-gather briefly, did the machine that corresponds to "10.0.169.161" get created with a different IP address? i do see 2 outstanding CSRs in the must-gather, but those are for a "10.0.160.184", which does appear to have a machine and node object. is this instance related? also, i'm curious if you are able to workaround the current issue by manually approving the cert? The machine and node objects come up fine, and the node bootstrap certificate gets approved. However the kubernetes.io/kubelet-serving doesn't, and without that one you can't do things like oc debug or oc logs. Manually approving the cert seems to work for a period of time, but then the CSRs keep accumulating when they try to renew, or the machine is replaced. ack, thanks for the extra details it helps. i'm still curious about the name, "ip-10-0-169-161.ec2.internal", from the logs. i would guess this comes from the hostname, but i don't see any manifests that match this ip address (or similar naming). any extra context around this machine? Sorry, I've already destroyed the test cluster I used to get this must gather, but the reproducer steps are consistent (I've done this twice, in addition to the production cluster I originally discovered this in). ok, no worries, we'll try to reproduce. I was able to reproduce the behaviour by following these steps :
> It's also possible to reproduce this after cluster installation:
> 1. Create dhcp-options-set with above command
> (creating the DHCP option set with `aws ec2 create-dhcp-options --dhcp-configurations '[{"Key":"domain-name-servers","Values":["AmazonProvidedDNS"]}]'`)
> 2. Create default OCP IPI cluster, allowing the installer to create it's own VPC
> 3. After installation, swap the dhcp-options-set for the VPC with the one above
> 4. Delete a worker machine backed by a machineset, allowing MAPI to recreate the machine
And I got a similar error in the `machine-approver-controller`:
```
$ oc -n openshift-cluster-machine-approver logs -f machine-approver-6f4f5f79bc-7tsn4 -c machine-approver-controller
I0425 16:07:47.887513 1 controller.go:114] Reconciling CSR: csr-9b4xj
I0425 16:07:47.902107 1 csr_check.go:156] csr-9b4xj: CSR does not appear to be client csr
I0425 16:07:47.905825 1 csr_check.go:542] retrieving serving cert from ip-10-0-214-139.eu-central-1.compute.internal (10.0.214.139:10250)
I0425 16:07:47.907864 1 csr_check.go:181] Failed to retrieve current serving cert: remote error: tls: internal error
I0425 16:07:47.907954 1 csr_check.go:201] Falling back to machine-api authorization for ip-10-0-214-139.eu-central-1.compute.internal
E0425 16:07:47.907992 1 csr_check.go:392] csr-9b4xj: DNS name 'ip-10-0-214-139' not in machine names: ip-10-0-214-139.eu-central-1.compute.internal ip-10-0-214-139.eu-central-1.compute.internal
I0425 16:07:47.957793 1 controller.go:199] csr-9b4xj: CSR not authorized
```
I then tried the same reproduction steps again but slightly changed step 1., where I've instead used
`aws ec2 create-dhcp-options --dhcp-configurations '[{"Key":"domain-name-servers","Values":["AmazonProvidedDNS"]},{"Key":"domain-name","Values":["eu-central-1.compute.internal"]}]'.
This change allowed the machine-approver-controller to find a matching hostname and validate the CSR:
```
I0425 16:10:32.667447 1 controller.go:114] Reconciling CSR: csr-4thph
I0425 16:10:32.714966 1 csr_check.go:156] csr-4thph: CSR does not appear to be client csr
I0425 16:10:32.747043 1 csr_check.go:542] retrieving serving cert from ip-10-0-163-134.eu-central-1.compute.internal (10.0.163.134:10250)
I0425 16:10:32.776643 1 csr_check.go:181] Failed to retrieve current serving cert: remote error: tls: internal error
I0425 16:10:32.777685 1 csr_check.go:201] Falling back to machine-api authorization for ip-10-0-163-134.eu-central-1.compute.internal
I0425 16:10:32.820556 1 controller.go:206] CSR csr-4thph approved
```
@cblecker Is there a particular reason for keeping the domain name empty? Thanks.
@ddonati Not other than the fact that not having a domain name suffix is a valid configuration (for example, a host of "ip-10-0-214-139") This is being investigated between the Cluster Infrastructure and ShiftStack teams, we hope to update you later once we have made a bit more progress We think we have a working fix for this and we are testing it. We’ll update on the progress of this as soon as the fix is confirmed. Reproduce the issue on 4.9.25 Steps: 1. Create dhcp-options-set liuhuali@Lius-MacBook-Pro huali-test % aws ec2 create-dhcp-options --dhcp-configurations '[{"Key":"domain-name-servers","Values":["AmazonProvidedDNS"]}]' DHCPOPTIONS dopt-0c9dfcde919f49105 301721915996 DHCPCONFIGURATIONS domain-name-servers VALUES AmazonProvidedDNS liuhuali@Lius-MacBook-Pro huali-test % 2. Create default OCP IPI cluster, allowing the installer to create it's own VPC 3. After installation, swap the dhcp-options-set for the VPC with the one above 4. Delete a worker machine backed by a machineset, allowing MAPI to recreate the machine liuhuali@Lius-MacBook-Pro huali-test % oc delete machine huliu-aws49a-l5tqf-worker-us-east-2c-rnrwg machine.machine.openshift.io "huliu-aws49a-l5tqf-worker-us-east-2c-rnrwg" deleted liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-aws49a-l5tqf-master-0 Running m5.xlarge us-east-2 us-east-2a 3h54m huliu-aws49a-l5tqf-master-1 Running m5.xlarge us-east-2 us-east-2b 3h54m huliu-aws49a-l5tqf-master-2 Running m5.xlarge us-east-2 us-east-2c 3h54m huliu-aws49a-l5tqf-worker-us-east-2a-rrm2l Running m5.large us-east-2 us-east-2a 3h52m huliu-aws49a-l5tqf-worker-us-east-2b-svst7 Running m5.large us-east-2 us-east-2b 3h52m huliu-aws49a-l5tqf-worker-us-east-2c-8rzmk Running m5.large us-east-2 us-east-2c 5m45s liuhuali@Lius-MacBook-Pro huali-test % oc get node NAME STATUS ROLES AGE VERSION ip-10-0-148-98.us-east-2.compute.internal Ready worker 3h48m v1.22.5+5c84e52 ip-10-0-154-23.us-east-2.compute.internal Ready master 3h54m v1.22.5+5c84e52 ip-10-0-187-34.us-east-2.compute.internal Ready master 3h54m v1.22.5+5c84e52 ip-10-0-190-51.us-east-2.compute.internal Ready worker 3h48m v1.22.5+5c84e52 ip-10-0-208-148.us-east-2.compute.internal Ready master 3h54m v1.22.5+5c84e52 ip-10-0-222-232.us-east-2.compute.internal Ready worker 3m1s v1.22.5+5c84e52 liuhuali@Lius-MacBook-Pro huali-test % oc get pod -n openshift-cluster-machine-approver NAME READY STATUS RESTARTS AGE machine-approver-6f4f5f79bc-9qczn 2/2 Running 0 3h55m liuhuali@Lius-MacBook-Pro huali-test % oc -n openshift-cluster-machine-approver logs -f machine-approver-6f4f5f79bc-9qczn -c machine-approver-controller ... I0511 06:35:20.259860 1 controller.go:114] Reconciling CSR: csr-466wh I0511 06:35:20.270565 1 csr_check.go:156] csr-466wh: CSR does not appear to be client csr I0511 06:35:20.274916 1 csr_check.go:542] retrieving serving cert from ip-10-0-222-232.us-east-2.compute.internal (10.0.222.232:10250) I0511 06:35:20.277286 1 csr_check.go:181] Failed to retrieve current serving cert: remote error: tls: internal error I0511 06:35:20.277303 1 csr_check.go:201] Falling back to machine-api authorization for ip-10-0-222-232.us-east-2.compute.internal E0511 06:35:20.277312 1 csr_check.go:392] csr-466wh: DNS name 'ip-10-0-222-232' not in machine names: ip-10-0-222-232.us-east-2.compute.internal ip-10-0-222-232.us-east-2.compute.internal I0511 06:35:20.277319 1 csr_check.go:204] Could not use Machine for serving cert authorization: DNS name 'ip-10-0-222-232' not in machine names: ip-10-0-222-232.us-east-2.compute.internal ip-10-0-222-232.us-east-2.compute.internal I0511 06:35:20.284132 1 controller.go:199] csr-466wh: CSR not authorized Verified on 4.11.0-0.nightly-2022-05-10-045003 Steps: 1. Create dhcp-options-set liuhuali@Lius-MacBook-Pro huali-test % aws ec2 create-dhcp-options --dhcp-configurations '[{"Key":"domain-name-servers","Values":["AmazonProvidedDNS"]}]' DHCPOPTIONS dopt-0c9dfcde919f49105 301721915996 DHCPCONFIGURATIONS domain-name-servers VALUES AmazonProvidedDNS liuhuali@Lius-MacBook-Pro huali-test % 2. Create default OCP IPI cluster, allowing the installer to create it's own VPC 3. After installation, swap the dhcp-options-set for the VPC with the one above 4. Delete a worker machine backed by a machineset, allowing MAPI to recreate the machine liuhuali@Lius-MacBook-Pro huali-test % oc delete machine huliu-aws411d-clw88-worker-us-east-2c-ggdqn machine.machine.openshift.io "huliu-aws411d-clw88-worker-us-east-2c-ggdqn" deleted liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-aws411d-clw88-master-0 Running m6i.xlarge us-east-2 us-east-2a 3h38m huliu-aws411d-clw88-master-1 Running m6i.xlarge us-east-2 us-east-2b 3h38m huliu-aws411d-clw88-master-2 Running m6i.xlarge us-east-2 us-east-2c 3h39m huliu-aws411d-clw88-worker-us-east-2a-lsshh Running m6i.large us-east-2 us-east-2a 3h36m huliu-aws411d-clw88-worker-us-east-2b-zqjnh Running m6i.large us-east-2 us-east-2b 3h36m huliu-aws411d-clw88-worker-us-east-2c-vrtkz Running m6i.large us-east-2 us-east-2c 8m28s liuhuali@Lius-MacBook-Pro huali-test % oc get node NAME STATUS ROLES AGE VERSION ip-10-0-135-122.us-east-2.compute.internal Ready master 3h38m v1.23.3+69213f8 ip-10-0-136-113.us-east-2.compute.internal Ready worker 3h34m v1.23.3+69213f8 ip-10-0-171-251.us-east-2.compute.internal Ready master 3h38m v1.23.3+69213f8 ip-10-0-177-8.us-east-2.compute.internal Ready worker 3h34m v1.23.3+69213f8 ip-10-0-207-202.us-east-2.compute.internal Ready worker 5m13s v1.23.3+69213f8 ip-10-0-216-30.us-east-2.compute.internal Ready master 3h39m v1.23.3+69213f8 liuhuali@Lius-MacBook-Pro huali-test % oc get pod -n openshift-cluster-machine-approver NAME READY STATUS RESTARTS AGE machine-approver-6d975cff6f-nhc4t 2/2 Running 0 3h39m liuhuali@Lius-MacBook-Pro huali-test % oc -n openshift-cluster-machine-approver logs -f machine-approver-6d975cff6f-nhc4t -c machine-approver-controller ... I0511 06:45:16.284945 1 controller.go:121] Reconciling CSR: csr-8jffv I0511 06:45:16.306608 1 csr_check.go:157] csr-8jffv: CSR does not appear to be client csr I0511 06:45:16.308250 1 csr_check.go:545] retrieving serving cert from ip-10-0-207-202.us-east-2.compute.internal (10.0.207.202:10250) I0511 06:45:16.310061 1 csr_check.go:182] Failed to retrieve current serving cert: remote error: tls: internal error I0511 06:45:16.310076 1 csr_check.go:202] Falling back to machine-api authorization for ip-10-0-207-202.us-east-2.compute.internal I0511 06:45:16.314493 1 controller.go:240] CSR csr-8jffv approved Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069 |