1779107 – AWS: csr failed to approve if node has multiple ip addresses

Bug 1779107 - AWS: csr failed to approve if node has multiple ip addresses

Summary: AWS: csr failed to approve if node has multiple ip addresses

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.4
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	urgent
Target Milestone:	---
Target Release:	4.4.0
Assignee:	Alberto
QA Contact:	Jianwei Hou
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1780590
TreeView+	depends on / blocked

Reported:	2019-12-03 10:27 UTC by Florin Peter
Modified:	2023-03-24 16:17 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	1780590 (view as bug list)
Environment:
Last Closed:	2020-05-15 15:13:32 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
screenshots (1.53 MB, application/zip) 2019-12-03 10:27 UTC, Florin Peter	no flags	Details
t000-bz666-machineset.yaml (30.51 KB, text/plain) 2019-12-04 11:06 UTC, Florin Peter	no flags	Details
machines (37.79 KB, text/plain) 2019-12-04 11:08 UTC, Florin Peter	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-api-provider-aws pull 277	'None'	closed	Bug 1779107: Update Machine status with all instance IP addresses	2020-12-03 00:23:13 UTC
Github	openshift cluster-machine-approver pull 57	'None'	closed	Bug 1779107: Fall back to machine-api check if certificate check fails	2020-12-03 00:23:13 UTC
Red Hat Bugzilla	1734319	unspecified	CLOSED	AWS: secondary IP address order still wrong	2023-03-24 15:08:36 UTC

Description Florin Peter 2019-12-03 10:27:48 UTC

Created attachment 1641631 [details]
screenshots

Description of problem:
The cluster-machine-approver is not able to automatically approve the csr for nodes if a node has multiple ip addresses.


Version-Release number of selected component (if applicable):
4.2.8

How reproducible:
Add multiple nics to a node.
VPC CIDR: 10.0.0.0/21 (see screenshot)
Subnets (see screenshot)

Steps to Reproduce:
1. Add multiple nics from another subnet to a node.
2. reboot the node
3. wait for csr to be created

Actual results:
Log from cluster-machine-approver:
I1203 09:41:19.788840       1 csr_check.go:403] retrieving serving cert from ip-10-0-4-137.eu-central-1.compute.internal (10.0.4.137:10250)
E1203 09:41:19.790442       1 csr_check.go:163] failed to retrieve current serving cert: remote error: tls: internal error
I1203 09:41:19.790464       1 csr_check.go:168] No existing serving certificate found for ip-10-0-4-137.eu-central-1.compute.internal
I1203 09:41:19.790480       1 main.go:174] CSR csr-p5wzs not authorized: IP address '10.0.6.137' not in machine addresses: 10.0.4.137
I1203 09:41:19.790489       1 main.go:210] Error syncing csr csr-p5wzs: IP address '10.0.6.137' not in machine addresses: 10.0.4.137

$ oc get csr csr-p5wzs -o yaml
apiVersion: certificates.k8s.io/v1beta1
kind: CertificateSigningRequest
metadata:
  creationTimestamp: "2019-12-03T09:38:35Z"
  generateName: csr-
  name: csr-p5wzs
  resourceVersion: "572015"
  selfLink: /apis/certificates.k8s.io/v1beta1/certificatesigningrequests/csr-p5wzs
  uid: ad3249b7-15b0-11ea-bf1e-0af318cc29ce
spec:
  groups:
  - system:nodes
  - system:authenticated
  request: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURSBSRVFVRVNULS0tLS0KTUlJQmFqQ0NBUkFDQVFBd1dURVZNQk1HQTFVRUNoTU1jM2x6ZEdWdE9tNXZaR1Z6TVVBd1BnWURWUVFERXpkegplWE4wWlcwNmJtOWtaVHBwY0MweE1DMHdMVFF0TVRNM0xtVjFMV05sYm5SeVlXd3RNUzVqYjIxd2RYUmxMbWx1CmRHVnlibUZzTUZrd0V3WUhLb1pJemowQ0FRWUlLb1pJemowREFRY0RRZ0FFdlJ3emQyVjRGYXh5VjdIWmdBcHQKaGlhbjVkSjhrRVhyOFlOaW45d05YU2dJTzhLdnNQbzBBUEdMdzYzQSsrSnRUc21pUE5ySG15RU4yTkJsMVhJTgpPcUJWTUZNR0NTcUdTSWIzRFFFSkRqRkdNRVF3UWdZRFZSMFJCRHN3T1lJcmFYQXRNVEF0TUMwMExURXpOeTVsCmRTMWpaVzUwY21Gc0xURXVZMjl0Y0hWMFpTNXBiblJsY201aGJJY0VDZ0FFaVljRUNnQUdpVEFLQmdncWhrak8KUFFRREFnTklBREJGQWlFQWphZmhReVZ2eUl4NUlQeERRSmxSRmZkRm8rMXNPY1dmRldlT3VnZ1VDM0lDSUgvMwpMS28wRitMdkozKzl2R0h4bnZSQUtNNS9WZWJteXAxT1ZscTYyZVVDCi0tLS0tRU5EIENFUlRJRklDQVRFIFJFUVVFU1QtLS0tLQo=
  usages:
  - digital signature
  - key encipherment
  - server auth
  username: system:node:ip-10-0-4-137.eu-central-1.compute.internal
status: {}
    
Certificate Request:
    Data:
        Version: 1 (0x0)
        Subject: O=system:nodes, CN=system:node:ip-10-0-4-137.eu-central-1.compute.internal
        Subject Public Key Info:
            Public Key Algorithm: id-ecPublicKey
                Public-Key: (256 bit)
                pub:
                    04:21:8b:72:c3:5c:dc:ec:d7:ec:f9:b1:03:0f:50:
                    68:d3:ea:39:ed:e2:7d:4e:e8:f2:7d:c7:7e:97:66:
                    29:3a:ca:e6:f6:e9:05:92:a9:e9:c9:27:5d:d0:d3:
                    7b:66:bf:5b:4e:53:ff:68:4e:9a:9f:e8:59:9d:fa:
                    f5:80:16:6a:ca
                ASN1 OID: prime256v1
                NIST CURVE: P-256
        Attributes:
        Requested Extensions:
            X509v3 Subject Alternative Name: 
                DNS:ip-10-0-4-137.eu-central-1.compute.internal, IP Address:10.0.4.137, IP Address:10.0.6.137
    Signature Algorithm: ecdsa-with-SHA256
         30:45:02:20:33:08:6f:3e:39:93:7e:c9:e6:f9:15:e9:55:c9:
         fd:73:8a:a3:1d:c6:cb:a6:7f:11:21:30:12:30:af:7b:62:da:
         02:21:00:b7:c6:27:32:27:22:4d:d3:81:46:6a:cd:07:13:96:
         fe:83:1d:8a:5b:ca:e3:9a:61:7a:ef:f9:c8:af:fe:7a:a7

$ oc get nodes ip-10-0-4-137.eu-central-1.compute.internal -o yaml
apiVersion: v1
kind: Node
metadata:
  annotations:
    machine.openshift.io/machine: openshift-machine-api/t000-bz666-master-1
    machineconfiguration.openshift.io/currentConfig: rendered-master-df35b22dc36804707e9de2d041773105
    machineconfiguration.openshift.io/desiredConfig: rendered-master-df35b22dc36804707e9de2d041773105
    machineconfiguration.openshift.io/reason: ""
    machineconfiguration.openshift.io/state: Done
    volumes.kubernetes.io/controller-managed-attach-detach: "true"
  creationTimestamp: "2019-12-02T09:38:28Z"
  labels:
    beta.kubernetes.io/arch: amd64
    beta.kubernetes.io/instance-type: m5.xlarge
    beta.kubernetes.io/os: linux
    failure-domain.beta.kubernetes.io/region: eu-central-1
    failure-domain.beta.kubernetes.io/zone: eu-central-1b
    kubernetes.io/arch: amd64
    kubernetes.io/hostname: ip-10-0-4-137
    kubernetes.io/os: linux
    node-role.kubernetes.io/master: ""
    node.openshift.io/os_id: rhcos
  name: ip-10-0-4-137.eu-central-1.compute.internal
  resourceVersion: "585050"
  selfLink: /api/v1/nodes/ip-10-0-4-137.eu-central-1.compute.internal
  uid: 7ee3e96b-14e7-11ea-ab7c-02a858a0ee64
spec:
  providerID: aws:///eu-central-1b/i-088df72e83ae33999
  taints:
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
status:
  addresses:
  - address: 10.0.4.137
    type: InternalIP
  - address: 10.0.6.137
    type: InternalIP
  - address: ip-10-0-4-137.eu-central-1.compute.internal
    type: Hostname
  - address: ip-10-0-4-137.eu-central-1.compute.internal
    type: InternalDNS
  allocatable:
    attachable-volumes-aws-ebs: "25"
    cpu: 3500m
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    memory: 15331908Ki
    pods: "250"
  capacity:
    attachable-volumes-aws-ebs: "25"
    cpu: "4"
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    memory: 15946308Ki
    pods: "250"

$ oc get nodes -o wide
NAME                                          STATUS   ROLES            AGE   VERSION             INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                                                   KERNEL-VERSION                CONTAINER-RUNTIME
ip-10-0-4-137.eu-central-1.compute.internal   Ready    master           24h   v1.14.6+6ac6aa4b0   10.0.4.137    <none>        Red Hat Enterprise Linux CoreOS 42.81.20191119.1 (Ootpa)   4.18.0-147.0.3.el8_1.x86_64   cri-o://1.14.11-0.24.dev.rhaos4.2.gitc41de67.el8
ip-10-0-4-153.eu-central-1.compute.internal   Ready    infra,worker     24h   v1.14.6+6ac6aa4b0   10.0.4.153    <none>        Red Hat Enterprise Linux CoreOS 42.81.20191119.1 (Ootpa)   4.18.0-147.0.3.el8_1.x86_64   cri-o://1.14.11-0.24.dev.rhaos4.2.gitc41de67.el8
ip-10-0-4-17.eu-central-1.compute.internal    Ready    primary,worker   24h   v1.14.6+6ac6aa4b0   10.0.4.17     <none>        Red Hat Enterprise Linux CoreOS 42.81.20191119.1 (Ootpa)   4.18.0-147.0.3.el8_1.x86_64   cri-o://1.14.11-0.24.dev.rhaos4.2.gitc41de67.el8
ip-10-0-4-189.eu-central-1.compute.internal   Ready    logging,worker   17h   v1.14.6+6ac6aa4b0   10.0.4.189    <none>        Red Hat Enterprise Linux CoreOS 42.81.20191119.1 (Ootpa)   4.18.0-147.0.3.el8_1.x86_64   cri-o://1.14.11-0.24.dev.rhaos4.2.gitc41de67.el8
ip-10-0-4-223.eu-central-1.compute.internal   Ready    primary,worker   24h   v1.14.6+6ac6aa4b0   10.0.4.223    <none>        Red Hat Enterprise Linux CoreOS 42.81.20191119.1 (Ootpa)   4.18.0-147.0.3.el8_1.x86_64   cri-o://1.14.11-0.24.dev.rhaos4.2.gitc41de67.el8
ip-10-0-4-34.eu-central-1.compute.internal    Ready    infra,worker     24h   v1.14.6+6ac6aa4b0   10.0.4.34     <none>        Red Hat Enterprise Linux CoreOS 42.81.20191119.1 (Ootpa)   4.18.0-147.0.3.el8_1.x86_64   cri-o://1.14.11-0.24.dev.rhaos4.2.gitc41de67.el8
ip-10-0-4-57.eu-central-1.compute.internal    Ready    logging,worker   17h   v1.14.6+6ac6aa4b0   10.0.4.57     <none>        Red Hat Enterprise Linux CoreOS 42.81.20191119.1 (Ootpa)   4.18.0-147.0.3.el8_1.x86_64   cri-o://1.14.11-0.24.dev.rhaos4.2.gitc41de67.el8
ip-10-0-4-9.eu-central-1.compute.internal     Ready    master           24h   v1.14.6+6ac6aa4b0   10.0.4.9      <none>        Red Hat Enterprise Linux CoreOS 42.81.20191119.1 (Ootpa)   4.18.0-147.0.3.el8_1.x86_64   cri-o://1.14.11-0.24.dev.rhaos4.2.gitc41de67.el8
ip-10-0-5-53.eu-central-1.compute.internal    Ready    primary,worker   24h   v1.14.6+6ac6aa4b0   10.0.5.53     <none>        Red Hat Enterprise Linux CoreOS 42.81.20191119.1 (Ootpa)   4.18.0-147.0.3.el8_1.x86_64   cri-o://1.14.11-0.24.dev.rhaos4.2.gitc41de67.el8
ip-10-0-5-60.eu-central-1.compute.internal    Ready    logging,worker   17h   v1.14.6+6ac6aa4b0   10.0.5.60     <none>        Red Hat Enterprise Linux CoreOS 42.81.20191119.1 (Ootpa)   4.18.0-147.0.3.el8_1.x86_64   cri-o://1.14.11-0.24.dev.rhaos4.2.gitc41de67.el8
ip-10-0-5-87.eu-central-1.compute.internal    Ready    infra,worker     24h   v1.14.6+6ac6aa4b0   10.0.5.87     <none>        Red Hat Enterprise Linux CoreOS 42.81.20191119.1 (Ootpa)   4.18.0-147.0.3.el8_1.x86_64   cri-o://1.14.11-0.24.dev.rhaos4.2.gitc41de67.el8
ip-10-0-5-9.eu-central-1.compute.internal     Ready    master           24h   v1.14.6+6ac6aa4b0   10.0.5.9      <none>        Red Hat Enterprise Linux CoreOS 42.81.20191119.1 (Ootpa)   4.18.0-147.0.3.el8_1.x86_64   cri-o://1.14.11-0.24.dev.rhaos4.2.gitc41de67.el8

Expected results:
csr get automatically approved

Additional info:

Comment 6 Alberto 2019-12-04 08:59:17 UTC

>Steps to Reproduce:
>1. Add multiple nics from another subnet to a node.
>2. reboot the node
>3. wait for csr to be created

Were these nics added outside the machine API via e.g aws console?
If so you might need to approve the CSRs manually. The machine object won't refresh its status with the new IPs before 10min or if a reconciling loop is forced somehow.
Can you please share the oc get machine object?

Comment 7 Florin Peter 2019-12-04 11:05:14 UTC

Hi Alberto,

yes we added the nics from outside the machine API with the AWS api.

$ oc get machineset -n openshift-machine-api
NAME                                         DESIRED   CURRENT   READY   AVAILABLE   AGE
t000-bz666-infra-m5large-eu-central-1a       1         1         1       1           2d1h
t000-bz666-infra-m5large-eu-central-1b       1         1         1       1           2d1h
t000-bz666-infra-m5large-eu-central-1c       1         1         1       1           2d1h
t000-bz666-logging-r5axlarge-eu-central-1a   1         1         1       1           2d1h
t000-bz666-logging-r5axlarge-eu-central-1b   1         1         1       1           2d1h
t000-bz666-logging-r5axlarge-eu-central-1c   1         1         1       1           2d1h
t000-bz666-storage-m5large-eu-central-1a     0         0                             2d1h
t000-bz666-storage-m5large-eu-central-1b     0         0                             2d1h
t000-bz666-storage-m5large-eu-central-1c     0         0                             2d1h
t000-bz666-worker-m5large-eu-central-1a      1         1         1       1           2d1h
t000-bz666-worker-m5large-eu-central-1b      1         1         1       1           2d1h
t000-bz666-worker-m5large-eu-central-1c      1         1         1       1           2d1h


$ oc get machineset -n openshift-machine-api -o yaml > /tmp/t000-bz666-machineset.yaml
t000-bz666-machineset.yaml will be attached

Comment 8 Florin Peter 2019-12-04 11:06:24 UTC

Created attachment 1642022 [details]
t000-bz666-machineset.yaml

Comment 9 Florin Peter 2019-12-04 11:08:29 UTC

Created attachment 1642023 [details]
machines

Comment 10 Brad Ison 2019-12-05 15:11:11 UTC

It looks like this is likely due to the fact that, unlike the Kubernetes cloud provider that adds the addresses to the Node object, our AWS provider for the machine-api only takes into account the primary private and public addresses. We need the address lists to match for automatic CSR approval.

I'm testing a PR against master / v4.4 that uses the same method as the cloud provider to extract the addresses from the EC2 instance. Once that's done, we can look at back porting to the 4.3 and 4.2 releases.

Comment 11 Brad Ison 2019-12-05 17:36:39 UTC

Working on verifying the fix on 4.4 now, but I wanted to mention further to Alberto's earlier comment that you should be able to workaround this for the time being by manually approving any pending CSRs:

```
$ oc get csr

NAME        AGE     REQUESTOR                                                                   CONDITION
csr-8b2br   15m     system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending 
csr-8vnps   15m     system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-bfd72   5m26s   system:node:ip-10-0-50-126.us-east-2.compute.internal                       Pending 
csr-c57lv   5m26s   system:node:ip-10-0-95-157.us-east-2.compute.internal                       Pending
...

$ oc adm certificate approve <csr_name>

```

Comment 12 Brad Ison 2019-12-05 19:03:36 UTC

Mirroring what I've said in the GitHub PR. This seems to basically work, but there are some outstanding issues complicating things.

Here's what I've done to test this:
- Create a new OpenShift 4.4 cluster.
- Change the kubelet certificate rotation to be more frequent -- every 15 minutes.
- Add a new Machine.
- Add an IP address to that machine in the EC2 console.
- Verify the next cert renewal fails.
- Delete the new machine and any pending CSRs.

Next:
- Switch the machine-controller image to one containing this fix.
- Create new machine.
- Add an IP in the EC2 console.
- The address shows up in the Machine status after the next sync.
- The CSR renewal eventually succeeds.

However, there a few outstanding issues:
- It looks like the current code is always setting an `ExternalDNS` address, even if it's an empty string. That doesn't seem like the correct behavior, but not continuing to do it causes problems with the CSR renewal.

- The SANs on new CSRs will never match the existing certs after an IP address has been added. This is how the serving cert renewal validates CSRs, and it won't work in this case. I guess the easiest fix here is to always fall back to the machine-api based flow. Otherwise renewal fails until the cert actually expires, then succeeds.

- This probably isn't a problem with longer lived production certs, but the machine-controller doesn't actually seem to be doing a full sync every 10 minutes. It sometimes takes 15 minutes or more for the new addresses to be added to the Machine object.

Comment 13 Brad Ison 2019-12-06 13:38:55 UTC

Here's the other half of this:

  https://github.com/openshift/cluster-machine-approver/pull/57

This causes the approver to always fall back to the machine-api check, which means that along with the machine-controller tracking all addresses, we can approve the new certificates faster.

So, now if a new addresses is added to an instance, the node should pick it up and generate a new CSR immediately, that CSR will remain pending for a while, but the current certificate will remain valid. Within 10 - 15 minutes the machine-controller will pick up the new address and the machine-api based approval flow will succeed for the new CSR.

The empty values for `ExternalDNS` the machine-controller was previously adding don't actually seem to be causing any problems here, and they will be removed by this fix.

Comment 17 Jianwei Hou 2019-12-20 04:18:46 UTC

Verified in 4.4.0-0.nightly-2019-12-19-223334

When the new interface is attached, a CSR is generated pending approval. The CSR is approved after the new ip is added to the machine.


W1220 04:15:14.072758       1 csr_check.go:173] Current SAN Values: [ip-10-0-155-254.us-east-2.compute.internal 10.0.155.254], CSR SAN Values: [ip-10-0-155-254.us-east-2.compute.internal 10.0.144.162 10.0.155.254]
I1220 04:15:14.072785       1 csr_check.go:183] Falling back to machine-api authorization for ip-10-0-155-254.us-east-2.compute.internal
I1220 04:15:14.078397       1 main.go:196] CSR csr-lprct approved

Note You need to log in before you can comment on or make changes to this bug.