Created attachment 1641631 [details] screenshots Description of problem: The cluster-machine-approver is not able to automatically approve the csr for nodes if a node has multiple ip addresses. Version-Release number of selected component (if applicable): 4.2.8 How reproducible: Add multiple nics to a node. VPC CIDR: 10.0.0.0/21 (see screenshot) Subnets (see screenshot) Steps to Reproduce: 1. Add multiple nics from another subnet to a node. 2. reboot the node 3. wait for csr to be created Actual results: Log from cluster-machine-approver: I1203 09:41:19.788840 1 csr_check.go:403] retrieving serving cert from ip-10-0-4-137.eu-central-1.compute.internal (10.0.4.137:10250) E1203 09:41:19.790442 1 csr_check.go:163] failed to retrieve current serving cert: remote error: tls: internal error I1203 09:41:19.790464 1 csr_check.go:168] No existing serving certificate found for ip-10-0-4-137.eu-central-1.compute.internal I1203 09:41:19.790480 1 main.go:174] CSR csr-p5wzs not authorized: IP address '10.0.6.137' not in machine addresses: 10.0.4.137 I1203 09:41:19.790489 1 main.go:210] Error syncing csr csr-p5wzs: IP address '10.0.6.137' not in machine addresses: 10.0.4.137 $ oc get csr csr-p5wzs -o yaml apiVersion: certificates.k8s.io/v1beta1 kind: CertificateSigningRequest metadata: creationTimestamp: "2019-12-03T09:38:35Z" generateName: csr- name: csr-p5wzs resourceVersion: "572015" selfLink: /apis/certificates.k8s.io/v1beta1/certificatesigningrequests/csr-p5wzs uid: ad3249b7-15b0-11ea-bf1e-0af318cc29ce spec: groups: - system:nodes - system:authenticated request: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURSBSRVFVRVNULS0tLS0KTUlJQmFqQ0NBUkFDQVFBd1dURVZNQk1HQTFVRUNoTU1jM2x6ZEdWdE9tNXZaR1Z6TVVBd1BnWURWUVFERXpkegplWE4wWlcwNmJtOWtaVHBwY0MweE1DMHdMVFF0TVRNM0xtVjFMV05sYm5SeVlXd3RNUzVqYjIxd2RYUmxMbWx1CmRHVnlibUZzTUZrd0V3WUhLb1pJemowQ0FRWUlLb1pJemowREFRY0RRZ0FFdlJ3emQyVjRGYXh5VjdIWmdBcHQKaGlhbjVkSjhrRVhyOFlOaW45d05YU2dJTzhLdnNQbzBBUEdMdzYzQSsrSnRUc21pUE5ySG15RU4yTkJsMVhJTgpPcUJWTUZNR0NTcUdTSWIzRFFFSkRqRkdNRVF3UWdZRFZSMFJCRHN3T1lJcmFYQXRNVEF0TUMwMExURXpOeTVsCmRTMWpaVzUwY21Gc0xURXVZMjl0Y0hWMFpTNXBiblJsY201aGJJY0VDZ0FFaVljRUNnQUdpVEFLQmdncWhrak8KUFFRREFnTklBREJGQWlFQWphZmhReVZ2eUl4NUlQeERRSmxSRmZkRm8rMXNPY1dmRldlT3VnZ1VDM0lDSUgvMwpMS28wRitMdkozKzl2R0h4bnZSQUtNNS9WZWJteXAxT1ZscTYyZVVDCi0tLS0tRU5EIENFUlRJRklDQVRFIFJFUVVFU1QtLS0tLQo= usages: - digital signature - key encipherment - server auth username: system:node:ip-10-0-4-137.eu-central-1.compute.internal status: {} Certificate Request: Data: Version: 1 (0x0) Subject: O=system:nodes, CN=system:node:ip-10-0-4-137.eu-central-1.compute.internal Subject Public Key Info: Public Key Algorithm: id-ecPublicKey Public-Key: (256 bit) pub: 04:21:8b:72:c3:5c:dc:ec:d7:ec:f9:b1:03:0f:50: 68:d3:ea:39:ed:e2:7d:4e:e8:f2:7d:c7:7e:97:66: 29:3a:ca:e6:f6:e9:05:92:a9:e9:c9:27:5d:d0:d3: 7b:66:bf:5b:4e:53:ff:68:4e:9a:9f:e8:59:9d:fa: f5:80:16:6a:ca ASN1 OID: prime256v1 NIST CURVE: P-256 Attributes: Requested Extensions: X509v3 Subject Alternative Name: DNS:ip-10-0-4-137.eu-central-1.compute.internal, IP Address:10.0.4.137, IP Address:10.0.6.137 Signature Algorithm: ecdsa-with-SHA256 30:45:02:20:33:08:6f:3e:39:93:7e:c9:e6:f9:15:e9:55:c9: fd:73:8a:a3:1d:c6:cb:a6:7f:11:21:30:12:30:af:7b:62:da: 02:21:00:b7:c6:27:32:27:22:4d:d3:81:46:6a:cd:07:13:96: fe:83:1d:8a:5b:ca:e3:9a:61:7a:ef:f9:c8:af:fe:7a:a7 $ oc get nodes ip-10-0-4-137.eu-central-1.compute.internal -o yaml apiVersion: v1 kind: Node metadata: annotations: machine.openshift.io/machine: openshift-machine-api/t000-bz666-master-1 machineconfiguration.openshift.io/currentConfig: rendered-master-df35b22dc36804707e9de2d041773105 machineconfiguration.openshift.io/desiredConfig: rendered-master-df35b22dc36804707e9de2d041773105 machineconfiguration.openshift.io/reason: "" machineconfiguration.openshift.io/state: Done volumes.kubernetes.io/controller-managed-attach-detach: "true" creationTimestamp: "2019-12-02T09:38:28Z" labels: beta.kubernetes.io/arch: amd64 beta.kubernetes.io/instance-type: m5.xlarge beta.kubernetes.io/os: linux failure-domain.beta.kubernetes.io/region: eu-central-1 failure-domain.beta.kubernetes.io/zone: eu-central-1b kubernetes.io/arch: amd64 kubernetes.io/hostname: ip-10-0-4-137 kubernetes.io/os: linux node-role.kubernetes.io/master: "" node.openshift.io/os_id: rhcos name: ip-10-0-4-137.eu-central-1.compute.internal resourceVersion: "585050" selfLink: /api/v1/nodes/ip-10-0-4-137.eu-central-1.compute.internal uid: 7ee3e96b-14e7-11ea-ab7c-02a858a0ee64 spec: providerID: aws:///eu-central-1b/i-088df72e83ae33999 taints: - effect: NoSchedule key: node-role.kubernetes.io/master status: addresses: - address: 10.0.4.137 type: InternalIP - address: 10.0.6.137 type: InternalIP - address: ip-10-0-4-137.eu-central-1.compute.internal type: Hostname - address: ip-10-0-4-137.eu-central-1.compute.internal type: InternalDNS allocatable: attachable-volumes-aws-ebs: "25" cpu: 3500m hugepages-1Gi: "0" hugepages-2Mi: "0" memory: 15331908Ki pods: "250" capacity: attachable-volumes-aws-ebs: "25" cpu: "4" hugepages-1Gi: "0" hugepages-2Mi: "0" memory: 15946308Ki pods: "250" $ oc get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME ip-10-0-4-137.eu-central-1.compute.internal Ready master 24h v1.14.6+6ac6aa4b0 10.0.4.137 <none> Red Hat Enterprise Linux CoreOS 42.81.20191119.1 (Ootpa) 4.18.0-147.0.3.el8_1.x86_64 cri-o://1.14.11-0.24.dev.rhaos4.2.gitc41de67.el8 ip-10-0-4-153.eu-central-1.compute.internal Ready infra,worker 24h v1.14.6+6ac6aa4b0 10.0.4.153 <none> Red Hat Enterprise Linux CoreOS 42.81.20191119.1 (Ootpa) 4.18.0-147.0.3.el8_1.x86_64 cri-o://1.14.11-0.24.dev.rhaos4.2.gitc41de67.el8 ip-10-0-4-17.eu-central-1.compute.internal Ready primary,worker 24h v1.14.6+6ac6aa4b0 10.0.4.17 <none> Red Hat Enterprise Linux CoreOS 42.81.20191119.1 (Ootpa) 4.18.0-147.0.3.el8_1.x86_64 cri-o://1.14.11-0.24.dev.rhaos4.2.gitc41de67.el8 ip-10-0-4-189.eu-central-1.compute.internal Ready logging,worker 17h v1.14.6+6ac6aa4b0 10.0.4.189 <none> Red Hat Enterprise Linux CoreOS 42.81.20191119.1 (Ootpa) 4.18.0-147.0.3.el8_1.x86_64 cri-o://1.14.11-0.24.dev.rhaos4.2.gitc41de67.el8 ip-10-0-4-223.eu-central-1.compute.internal Ready primary,worker 24h v1.14.6+6ac6aa4b0 10.0.4.223 <none> Red Hat Enterprise Linux CoreOS 42.81.20191119.1 (Ootpa) 4.18.0-147.0.3.el8_1.x86_64 cri-o://1.14.11-0.24.dev.rhaos4.2.gitc41de67.el8 ip-10-0-4-34.eu-central-1.compute.internal Ready infra,worker 24h v1.14.6+6ac6aa4b0 10.0.4.34 <none> Red Hat Enterprise Linux CoreOS 42.81.20191119.1 (Ootpa) 4.18.0-147.0.3.el8_1.x86_64 cri-o://1.14.11-0.24.dev.rhaos4.2.gitc41de67.el8 ip-10-0-4-57.eu-central-1.compute.internal Ready logging,worker 17h v1.14.6+6ac6aa4b0 10.0.4.57 <none> Red Hat Enterprise Linux CoreOS 42.81.20191119.1 (Ootpa) 4.18.0-147.0.3.el8_1.x86_64 cri-o://1.14.11-0.24.dev.rhaos4.2.gitc41de67.el8 ip-10-0-4-9.eu-central-1.compute.internal Ready master 24h v1.14.6+6ac6aa4b0 10.0.4.9 <none> Red Hat Enterprise Linux CoreOS 42.81.20191119.1 (Ootpa) 4.18.0-147.0.3.el8_1.x86_64 cri-o://1.14.11-0.24.dev.rhaos4.2.gitc41de67.el8 ip-10-0-5-53.eu-central-1.compute.internal Ready primary,worker 24h v1.14.6+6ac6aa4b0 10.0.5.53 <none> Red Hat Enterprise Linux CoreOS 42.81.20191119.1 (Ootpa) 4.18.0-147.0.3.el8_1.x86_64 cri-o://1.14.11-0.24.dev.rhaos4.2.gitc41de67.el8 ip-10-0-5-60.eu-central-1.compute.internal Ready logging,worker 17h v1.14.6+6ac6aa4b0 10.0.5.60 <none> Red Hat Enterprise Linux CoreOS 42.81.20191119.1 (Ootpa) 4.18.0-147.0.3.el8_1.x86_64 cri-o://1.14.11-0.24.dev.rhaos4.2.gitc41de67.el8 ip-10-0-5-87.eu-central-1.compute.internal Ready infra,worker 24h v1.14.6+6ac6aa4b0 10.0.5.87 <none> Red Hat Enterprise Linux CoreOS 42.81.20191119.1 (Ootpa) 4.18.0-147.0.3.el8_1.x86_64 cri-o://1.14.11-0.24.dev.rhaos4.2.gitc41de67.el8 ip-10-0-5-9.eu-central-1.compute.internal Ready master 24h v1.14.6+6ac6aa4b0 10.0.5.9 <none> Red Hat Enterprise Linux CoreOS 42.81.20191119.1 (Ootpa) 4.18.0-147.0.3.el8_1.x86_64 cri-o://1.14.11-0.24.dev.rhaos4.2.gitc41de67.el8 Expected results: csr get automatically approved Additional info:
>Steps to Reproduce: >1. Add multiple nics from another subnet to a node. >2. reboot the node >3. wait for csr to be created Were these nics added outside the machine API via e.g aws console? If so you might need to approve the CSRs manually. The machine object won't refresh its status with the new IPs before 10min or if a reconciling loop is forced somehow. Can you please share the oc get machine object?
Hi Alberto, yes we added the nics from outside the machine API with the AWS api. $ oc get machineset -n openshift-machine-api NAME DESIRED CURRENT READY AVAILABLE AGE t000-bz666-infra-m5large-eu-central-1a 1 1 1 1 2d1h t000-bz666-infra-m5large-eu-central-1b 1 1 1 1 2d1h t000-bz666-infra-m5large-eu-central-1c 1 1 1 1 2d1h t000-bz666-logging-r5axlarge-eu-central-1a 1 1 1 1 2d1h t000-bz666-logging-r5axlarge-eu-central-1b 1 1 1 1 2d1h t000-bz666-logging-r5axlarge-eu-central-1c 1 1 1 1 2d1h t000-bz666-storage-m5large-eu-central-1a 0 0 2d1h t000-bz666-storage-m5large-eu-central-1b 0 0 2d1h t000-bz666-storage-m5large-eu-central-1c 0 0 2d1h t000-bz666-worker-m5large-eu-central-1a 1 1 1 1 2d1h t000-bz666-worker-m5large-eu-central-1b 1 1 1 1 2d1h t000-bz666-worker-m5large-eu-central-1c 1 1 1 1 2d1h $ oc get machineset -n openshift-machine-api -o yaml > /tmp/t000-bz666-machineset.yaml t000-bz666-machineset.yaml will be attached
Created attachment 1642022 [details] t000-bz666-machineset.yaml
Created attachment 1642023 [details] machines
It looks like this is likely due to the fact that, unlike the Kubernetes cloud provider that adds the addresses to the Node object, our AWS provider for the machine-api only takes into account the primary private and public addresses. We need the address lists to match for automatic CSR approval. I'm testing a PR against master / v4.4 that uses the same method as the cloud provider to extract the addresses from the EC2 instance. Once that's done, we can look at back porting to the 4.3 and 4.2 releases.
Working on verifying the fix on 4.4 now, but I wanted to mention further to Alberto's earlier comment that you should be able to workaround this for the time being by manually approving any pending CSRs: ``` $ oc get csr NAME AGE REQUESTOR CONDITION csr-8b2br 15m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-8vnps 15m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-bfd72 5m26s system:node:ip-10-0-50-126.us-east-2.compute.internal Pending csr-c57lv 5m26s system:node:ip-10-0-95-157.us-east-2.compute.internal Pending ... $ oc adm certificate approve <csr_name> ```
Mirroring what I've said in the GitHub PR. This seems to basically work, but there are some outstanding issues complicating things. Here's what I've done to test this: - Create a new OpenShift 4.4 cluster. - Change the kubelet certificate rotation to be more frequent -- every 15 minutes. - Add a new Machine. - Add an IP address to that machine in the EC2 console. - Verify the next cert renewal fails. - Delete the new machine and any pending CSRs. Next: - Switch the machine-controller image to one containing this fix. - Create new machine. - Add an IP in the EC2 console. - The address shows up in the Machine status after the next sync. - The CSR renewal eventually succeeds. However, there a few outstanding issues: - It looks like the current code is always setting an `ExternalDNS` address, even if it's an empty string. That doesn't seem like the correct behavior, but not continuing to do it causes problems with the CSR renewal. - The SANs on new CSRs will never match the existing certs after an IP address has been added. This is how the serving cert renewal validates CSRs, and it won't work in this case. I guess the easiest fix here is to always fall back to the machine-api based flow. Otherwise renewal fails until the cert actually expires, then succeeds. - This probably isn't a problem with longer lived production certs, but the machine-controller doesn't actually seem to be doing a full sync every 10 minutes. It sometimes takes 15 minutes or more for the new addresses to be added to the Machine object.
Here's the other half of this: https://github.com/openshift/cluster-machine-approver/pull/57 This causes the approver to always fall back to the machine-api check, which means that along with the machine-controller tracking all addresses, we can approve the new certificates faster. So, now if a new addresses is added to an instance, the node should pick it up and generate a new CSR immediately, that CSR will remain pending for a while, but the current certificate will remain valid. Within 10 - 15 minutes the machine-controller will pick up the new address and the machine-api based approval flow will succeed for the new CSR. The empty values for `ExternalDNS` the machine-controller was previously adding don't actually seem to be causing any problems here, and they will be removed by this fix.
Verified in 4.4.0-0.nightly-2019-12-19-223334 When the new interface is attached, a CSR is generated pending approval. The CSR is approved after the new ip is added to the machine. W1220 04:15:14.072758 1 csr_check.go:173] Current SAN Values: [ip-10-0-155-254.us-east-2.compute.internal 10.0.155.254], CSR SAN Values: [ip-10-0-155-254.us-east-2.compute.internal 10.0.144.162 10.0.155.254] I1220 04:15:14.072785 1 csr_check.go:183] Falling back to machine-api authorization for ip-10-0-155-254.us-east-2.compute.internal I1220 04:15:14.078397 1 main.go:196] CSR csr-lprct approved