2072195 – machine api doesn't issue client cert when AWS DNS suffix missing

Bug 2072195 - machine api doesn't issue client cert when AWS DNS suffix missing

Summary: machine api doesn't issue client cert when AWS DNS suffix missing

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.11.0
Assignee:	Damiano Donati
QA Contact:	Huali Liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2083270
TreeView+	depends on / blocked

Reported:	2022-04-05 18:51 UTC by Christoph Blecker
Modified:	2022-10-26 07:54 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: The machine-api provider for AWS would ignore an empty domain-name in the main VPC's Custom DHCP Option Set and don't consider it while populating machine addresses. Consequence: On AWS OCP clusters adding an empty domain-name in the main VPC's Custom DHCP Option Set wrongfully ended up breaking the CSR validation for nodes created after the change. Fix: We now allow empty domain-names within the VPC's DHCP Option Set and consider it as a valid setting. Result: Users can now set empty domain-names within the VPC's DHCP Option Sets without experiencing issues in kubelet's CSR validation.
Clone Of:
Environment:
Last Closed:	2022-08-10 11:03:38 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-api-provider-aws pull 36	0	None	Draft	Bug 2072195: Custom DHCPOption Set: empty domain-name is a valid custom domain	2022-05-05 17:06:34 UTC

Description Christoph Blecker 2022-04-05 18:51:31 UTC

Description of problem:
In a pre-provisioned AWS VPC configuration, with a specific dhcp-options-set, nodes will provision into the cluster, but will not be able get a valid kubernetes.io/kubelet-serving certificate issued due to hostname mismatch.


Version-Release number of selected component (if applicable):
4.9.25


How reproducible:
Consistent


Steps to Reproduce:
1. Create dhcp-options-set with following command:
aws ec2 create-dhcp-options --dhcp-configurations '[{"Key":"domain-name-servers","Values":["AmazonProvidedDNS"]}]'
2. Create VPC/Subnets for OCP cluster, and associate dhcp-options-set to the VPC
3. Create OCP IPI cluster with specified subnets

It's also possible to reproduce this after cluster installation:
1. Create dhcp-options-set with above command
2. Create default OCP IPI cluster, allowing the installer to create it's own VPC
3. After installation, swap the dhcp-options-set for the VPC with the one above
4. Delete a worker machine backed by a machineset, allowing MAPI to recreate the machine


Actual results:
Machine is able to join cluster, but the cluster-machine-approver doesn't approve the kubernetes.io/kubelet-serving certificate due to hostname mismatch.

I0405 18:49:19.415329       1 controller.go:114] Reconciling CSR: csr-pdk9f
I0405 18:49:19.429358       1 csr_check.go:156] csr-pdk9f: CSR does not appear to be client csr
I0405 18:49:19.433102       1 csr_check.go:542] retrieving serving cert from ip-10-0-169-161.ec2.internal (10.0.169.161:10250)
I0405 18:49:19.434310       1 csr_check.go:181] Failed to retrieve current serving cert: remote error: tls: internal error
I0405 18:49:19.434336       1 csr_check.go:201] Falling back to machine-api authorization for ip-10-0-169-161.ec2.internal
E0405 18:49:19.434345       1 csr_check.go:392] csr-pdk9f: DNS name 'ip-10-0-169-161' not in machine names: ip-10-0-169-161.ec2.internal ip-10-0-169-161.ec2.internal
I0405 18:49:19.434350       1 csr_check.go:204] Could not use Machine for serving cert authorization: DNS name 'ip-10-0-169-161' not in machine names: ip-10-0-169-161.ec2.internal ip-10-0-169-161.ec2.internal
I0405 18:49:19.437877       1 controller.go:199] csr-pdk9f: CSR not authorized


Expected results:
Machine joins cluster and gets a kubernetes.io/kubelet-serving certificate.


Additional info:

Comment 2 Michael McCune 2022-04-05 19:46:51 UTC

i looked at the must-gather briefly, did the machine that corresponds to "10.0.169.161" get created with a different IP address?

i do see 2 outstanding CSRs in the must-gather, but those are for a "10.0.160.184", which does appear to have a machine and node object. is this instance related?

also, i'm curious if you are able to workaround the current issue by manually approving the cert?

Comment 3 Christoph Blecker 2022-04-05 19:50:28 UTC

The machine and node objects come up fine, and the node bootstrap certificate gets approved. However the kubernetes.io/kubelet-serving doesn't, and without that one you can't do things like oc debug or oc logs.

Manually approving the cert seems to work for a period of time, but then the CSRs keep accumulating when they try to renew, or the machine is replaced.

Comment 4 Michael McCune 2022-04-05 20:29:47 UTC

ack, thanks for the extra details it helps.

i'm still curious about the name, "ip-10-0-169-161.ec2.internal", from the logs. i would guess this comes from the hostname, but i don't see any manifests that match this ip address (or similar naming). any extra context around this machine?

Comment 5 Christoph Blecker 2022-04-05 20:41:05 UTC

Sorry, I've already destroyed the test cluster I used to get this must gather, but the reproducer steps are consistent (I've done this twice, in addition to the production cluster I originally discovered this in).

Comment 6 Michael McCune 2022-04-05 20:46:03 UTC

ok, no worries, we'll try to reproduce.

Comment 7 Damiano Donati 2022-04-25 17:40:20 UTC

I was able to reproduce the behaviour by following these steps :

> It's also possible to reproduce this after cluster installation:
> 1. Create dhcp-options-set with above command
> (creating the DHCP option set with `aws ec2 create-dhcp-options --dhcp-configurations '[{"Key":"domain-name-servers","Values":["AmazonProvidedDNS"]}]'`)
> 2. Create default OCP IPI cluster, allowing the installer to create it's own VPC
> 3. After installation, swap the dhcp-options-set for the VPC with the one above
> 4. Delete a worker machine backed by a machineset, allowing MAPI to recreate the machine

And I got a similar error in the `machine-approver-controller`:
```
$ oc -n openshift-cluster-machine-approver logs -f machine-approver-6f4f5f79bc-7tsn4 -c machine-approver-controller
I0425 16:07:47.887513       1 controller.go:114] Reconciling CSR: csr-9b4xj
I0425 16:07:47.902107       1 csr_check.go:156] csr-9b4xj: CSR does not appear to be client csr
I0425 16:07:47.905825       1 csr_check.go:542] retrieving serving cert from ip-10-0-214-139.eu-central-1.compute.internal (10.0.214.139:10250)
I0425 16:07:47.907864       1 csr_check.go:181] Failed to retrieve current serving cert: remote error: tls: internal error
I0425 16:07:47.907954       1 csr_check.go:201] Falling back to machine-api authorization for ip-10-0-214-139.eu-central-1.compute.internal
E0425 16:07:47.907992       1 csr_check.go:392] csr-9b4xj: DNS name 'ip-10-0-214-139' not in machine names: ip-10-0-214-139.eu-central-1.compute.internal ip-10-0-214-139.eu-central-1.compute.internal
I0425 16:07:47.957793       1 controller.go:199] csr-9b4xj: CSR not authorized
```

I then tried the same reproduction steps again but slightly changed step 1., where I've instead used
`aws ec2 create-dhcp-options --dhcp-configurations '[{"Key":"domain-name-servers","Values":["AmazonProvidedDNS"]},{"Key":"domain-name","Values":["eu-central-1.compute.internal"]}]'.
This change allowed the machine-approver-controller to find a matching hostname and validate the CSR:
```
I0425 16:10:32.667447       1 controller.go:114] Reconciling CSR: csr-4thph
I0425 16:10:32.714966       1 csr_check.go:156] csr-4thph: CSR does not appear to be client csr
I0425 16:10:32.747043       1 csr_check.go:542] retrieving serving cert from ip-10-0-163-134.eu-central-1.compute.internal (10.0.163.134:10250)
I0425 16:10:32.776643       1 csr_check.go:181] Failed to retrieve current serving cert: remote error: tls: internal error
I0425 16:10:32.777685       1 csr_check.go:201] Falling back to machine-api authorization for ip-10-0-163-134.eu-central-1.compute.internal
I0425 16:10:32.820556       1 controller.go:206] CSR csr-4thph approved
```

@cblecker Is there a particular reason for keeping the domain name empty? Thanks.

Comment 8 Christoph Blecker 2022-04-27 00:36:52 UTC

@ddonati Not other than the fact that not having a domain name suffix is a valid configuration (for example, a host of "ip-10-0-214-139")

Comment 9 Joel Speed 2022-05-05 13:04:50 UTC

This is being investigated between the Cluster Infrastructure and ShiftStack teams, we hope to update you later once we have made a bit more progress

Comment 10 Damiano Donati 2022-05-05 17:27:39 UTC

We think we have a working fix for this and we are testing it. We’ll update on the progress of this as soon as the fix is confirmed.

Comment 13 Huali Liu 2022-05-11 07:06:11 UTC

Reproduce the issue on 4.9.25
Steps:
1. Create dhcp-options-set
liuhuali@Lius-MacBook-Pro huali-test % aws ec2 create-dhcp-options --dhcp-configurations '[{"Key":"domain-name-servers","Values":["AmazonProvidedDNS"]}]'
DHCPOPTIONS	dopt-0c9dfcde919f49105	301721915996
DHCPCONFIGURATIONS	domain-name-servers
VALUES	AmazonProvidedDNS
liuhuali@Lius-MacBook-Pro huali-test % 

2. Create default OCP IPI cluster, allowing the installer to create it's own VPC

3. After installation, swap the dhcp-options-set for the VPC with the one above

4. Delete a worker machine backed by a machineset, allowing MAPI to recreate the machine
liuhuali@Lius-MacBook-Pro huali-test % oc delete machine huliu-aws49a-l5tqf-worker-us-east-2c-rnrwg
machine.machine.openshift.io "huliu-aws49a-l5tqf-worker-us-east-2c-rnrwg" deleted
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                                         PHASE     TYPE        REGION      ZONE         AGE
huliu-aws49a-l5tqf-master-0                  Running   m5.xlarge   us-east-2   us-east-2a   3h54m
huliu-aws49a-l5tqf-master-1                  Running   m5.xlarge   us-east-2   us-east-2b   3h54m
huliu-aws49a-l5tqf-master-2                  Running   m5.xlarge   us-east-2   us-east-2c   3h54m
huliu-aws49a-l5tqf-worker-us-east-2a-rrm2l   Running   m5.large    us-east-2   us-east-2a   3h52m
huliu-aws49a-l5tqf-worker-us-east-2b-svst7   Running   m5.large    us-east-2   us-east-2b   3h52m
huliu-aws49a-l5tqf-worker-us-east-2c-8rzmk   Running   m5.large    us-east-2   us-east-2c   5m45s
liuhuali@Lius-MacBook-Pro huali-test % oc get node
NAME                                         STATUS   ROLES    AGE     VERSION
ip-10-0-148-98.us-east-2.compute.internal    Ready    worker   3h48m   v1.22.5+5c84e52
ip-10-0-154-23.us-east-2.compute.internal    Ready    master   3h54m   v1.22.5+5c84e52
ip-10-0-187-34.us-east-2.compute.internal    Ready    master   3h54m   v1.22.5+5c84e52
ip-10-0-190-51.us-east-2.compute.internal    Ready    worker   3h48m   v1.22.5+5c84e52
ip-10-0-208-148.us-east-2.compute.internal   Ready    master   3h54m   v1.22.5+5c84e52
ip-10-0-222-232.us-east-2.compute.internal   Ready    worker   3m1s    v1.22.5+5c84e52
liuhuali@Lius-MacBook-Pro huali-test % oc get pod -n openshift-cluster-machine-approver
NAME                                READY   STATUS    RESTARTS   AGE
machine-approver-6f4f5f79bc-9qczn   2/2     Running   0          3h55m
liuhuali@Lius-MacBook-Pro huali-test % oc -n openshift-cluster-machine-approver logs -f  machine-approver-6f4f5f79bc-9qczn -c machine-approver-controller 
...
I0511 06:35:20.259860       1 controller.go:114] Reconciling CSR: csr-466wh
I0511 06:35:20.270565       1 csr_check.go:156] csr-466wh: CSR does not appear to be client csr
I0511 06:35:20.274916       1 csr_check.go:542] retrieving serving cert from ip-10-0-222-232.us-east-2.compute.internal (10.0.222.232:10250)
I0511 06:35:20.277286       1 csr_check.go:181] Failed to retrieve current serving cert: remote error: tls: internal error
I0511 06:35:20.277303       1 csr_check.go:201] Falling back to machine-api authorization for ip-10-0-222-232.us-east-2.compute.internal
E0511 06:35:20.277312       1 csr_check.go:392] csr-466wh: DNS name 'ip-10-0-222-232' not in machine names: ip-10-0-222-232.us-east-2.compute.internal ip-10-0-222-232.us-east-2.compute.internal
I0511 06:35:20.277319       1 csr_check.go:204] Could not use Machine for serving cert authorization: DNS name 'ip-10-0-222-232' not in machine names: ip-10-0-222-232.us-east-2.compute.internal ip-10-0-222-232.us-east-2.compute.internal
I0511 06:35:20.284132       1 controller.go:199] csr-466wh: CSR not authorized


Verified on 4.11.0-0.nightly-2022-05-10-045003
Steps:
1. Create dhcp-options-set
liuhuali@Lius-MacBook-Pro huali-test % aws ec2 create-dhcp-options --dhcp-configurations '[{"Key":"domain-name-servers","Values":["AmazonProvidedDNS"]}]'
DHCPOPTIONS	dopt-0c9dfcde919f49105	301721915996
DHCPCONFIGURATIONS	domain-name-servers
VALUES	AmazonProvidedDNS
liuhuali@Lius-MacBook-Pro huali-test % 

2. Create default OCP IPI cluster, allowing the installer to create it's own VPC

3. After installation, swap the dhcp-options-set for the VPC with the one above

4. Delete a worker machine backed by a machineset, allowing MAPI to recreate the machine
liuhuali@Lius-MacBook-Pro huali-test % oc delete machine huliu-aws411d-clw88-worker-us-east-2c-ggdqn
machine.machine.openshift.io "huliu-aws411d-clw88-worker-us-east-2c-ggdqn" deleted
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                                          PHASE     TYPE         REGION      ZONE         AGE
huliu-aws411d-clw88-master-0                  Running   m6i.xlarge   us-east-2   us-east-2a   3h38m
huliu-aws411d-clw88-master-1                  Running   m6i.xlarge   us-east-2   us-east-2b   3h38m
huliu-aws411d-clw88-master-2                  Running   m6i.xlarge   us-east-2   us-east-2c   3h39m
huliu-aws411d-clw88-worker-us-east-2a-lsshh   Running   m6i.large    us-east-2   us-east-2a   3h36m
huliu-aws411d-clw88-worker-us-east-2b-zqjnh   Running   m6i.large    us-east-2   us-east-2b   3h36m
huliu-aws411d-clw88-worker-us-east-2c-vrtkz   Running   m6i.large    us-east-2   us-east-2c   8m28s
liuhuali@Lius-MacBook-Pro huali-test % oc get node   
NAME                                         STATUS   ROLES    AGE     VERSION
ip-10-0-135-122.us-east-2.compute.internal   Ready    master   3h38m   v1.23.3+69213f8
ip-10-0-136-113.us-east-2.compute.internal   Ready    worker   3h34m   v1.23.3+69213f8
ip-10-0-171-251.us-east-2.compute.internal   Ready    master   3h38m   v1.23.3+69213f8
ip-10-0-177-8.us-east-2.compute.internal     Ready    worker   3h34m   v1.23.3+69213f8
ip-10-0-207-202.us-east-2.compute.internal   Ready    worker   5m13s   v1.23.3+69213f8
ip-10-0-216-30.us-east-2.compute.internal    Ready    master   3h39m   v1.23.3+69213f8
liuhuali@Lius-MacBook-Pro huali-test % oc get pod -n openshift-cluster-machine-approver
NAME                                READY   STATUS    RESTARTS   AGE
machine-approver-6d975cff6f-nhc4t   2/2     Running   0          3h39m
liuhuali@Lius-MacBook-Pro huali-test % oc -n openshift-cluster-machine-approver logs -f machine-approver-6d975cff6f-nhc4t -c machine-approver-controller
...
I0511 06:45:16.284945       1 controller.go:121] Reconciling CSR: csr-8jffv
I0511 06:45:16.306608       1 csr_check.go:157] csr-8jffv: CSR does not appear to be client csr
I0511 06:45:16.308250       1 csr_check.go:545] retrieving serving cert from ip-10-0-207-202.us-east-2.compute.internal (10.0.207.202:10250)
I0511 06:45:16.310061       1 csr_check.go:182] Failed to retrieve current serving cert: remote error: tls: internal error
I0511 06:45:16.310076       1 csr_check.go:202] Falling back to machine-api authorization for ip-10-0-207-202.us-east-2.compute.internal
I0511 06:45:16.314493       1 controller.go:240] CSR csr-8jffv approved

Comment 15 errata-xmlrpc 2022-08-10 11:03:38 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Note You need to log in before you can comment on or make changes to this bug.