Bug 1779107
| Summary: | AWS: csr failed to approve if node has multiple ip addresses | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Florin Peter <florin-alexandru.peter> | ||||||||
| Component: | Cloud Compute | Assignee: | Alberto <agarcial> | ||||||||
| Cloud Compute sub component: | Other Providers | QA Contact: | Jianwei Hou <jhou> | ||||||||
| Status: | CLOSED ERRATA | Docs Contact: | |||||||||
| Severity: | urgent | ||||||||||
| Priority: | medium | CC: | aos-bugs, brad.ison, joboyer, jokerman, openshift-bugs-escalate, pjakobs, rludva, vlaad | ||||||||
| Version: | 4.4 | ||||||||||
| Target Milestone: | --- | ||||||||||
| Target Release: | 4.4.0 | ||||||||||
| Hardware: | x86_64 | ||||||||||
| OS: | Linux | ||||||||||
| Whiteboard: | |||||||||||
| Fixed In Version: | Doc Type: | No Doc Update | |||||||||
| Doc Text: | Story Points: | --- | |||||||||
| Clone Of: | |||||||||||
| : | 1780590 (view as bug list) | Environment: | |||||||||
| Last Closed: | 2020-05-15 15:13:32 UTC | Type: | Bug | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Embargoed: | |||||||||||
| Bug Depends On: | |||||||||||
| Bug Blocks: | 1780590 | ||||||||||
| Attachments: |
|
||||||||||
|
Description
Florin Peter
2019-12-03 10:27:48 UTC
>Steps to Reproduce:
>1. Add multiple nics from another subnet to a node.
>2. reboot the node
>3. wait for csr to be created
Were these nics added outside the machine API via e.g aws console?
If so you might need to approve the CSRs manually. The machine object won't refresh its status with the new IPs before 10min or if a reconciling loop is forced somehow.
Can you please share the oc get machine object?
Hi Alberto, yes we added the nics from outside the machine API with the AWS api. $ oc get machineset -n openshift-machine-api NAME DESIRED CURRENT READY AVAILABLE AGE t000-bz666-infra-m5large-eu-central-1a 1 1 1 1 2d1h t000-bz666-infra-m5large-eu-central-1b 1 1 1 1 2d1h t000-bz666-infra-m5large-eu-central-1c 1 1 1 1 2d1h t000-bz666-logging-r5axlarge-eu-central-1a 1 1 1 1 2d1h t000-bz666-logging-r5axlarge-eu-central-1b 1 1 1 1 2d1h t000-bz666-logging-r5axlarge-eu-central-1c 1 1 1 1 2d1h t000-bz666-storage-m5large-eu-central-1a 0 0 2d1h t000-bz666-storage-m5large-eu-central-1b 0 0 2d1h t000-bz666-storage-m5large-eu-central-1c 0 0 2d1h t000-bz666-worker-m5large-eu-central-1a 1 1 1 1 2d1h t000-bz666-worker-m5large-eu-central-1b 1 1 1 1 2d1h t000-bz666-worker-m5large-eu-central-1c 1 1 1 1 2d1h $ oc get machineset -n openshift-machine-api -o yaml > /tmp/t000-bz666-machineset.yaml t000-bz666-machineset.yaml will be attached Created attachment 1642022 [details] t000-bz666-machineset.yaml Created attachment 1642023 [details]
machines
It looks like this is likely due to the fact that, unlike the Kubernetes cloud provider that adds the addresses to the Node object, our AWS provider for the machine-api only takes into account the primary private and public addresses. We need the address lists to match for automatic CSR approval. I'm testing a PR against master / v4.4 that uses the same method as the cloud provider to extract the addresses from the EC2 instance. Once that's done, we can look at back porting to the 4.3 and 4.2 releases. Working on verifying the fix on 4.4 now, but I wanted to mention further to Alberto's earlier comment that you should be able to workaround this for the time being by manually approving any pending CSRs: ``` $ oc get csr NAME AGE REQUESTOR CONDITION csr-8b2br 15m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-8vnps 15m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-bfd72 5m26s system:node:ip-10-0-50-126.us-east-2.compute.internal Pending csr-c57lv 5m26s system:node:ip-10-0-95-157.us-east-2.compute.internal Pending ... $ oc adm certificate approve <csr_name> ``` Mirroring what I've said in the GitHub PR. This seems to basically work, but there are some outstanding issues complicating things. Here's what I've done to test this: - Create a new OpenShift 4.4 cluster. - Change the kubelet certificate rotation to be more frequent -- every 15 minutes. - Add a new Machine. - Add an IP address to that machine in the EC2 console. - Verify the next cert renewal fails. - Delete the new machine and any pending CSRs. Next: - Switch the machine-controller image to one containing this fix. - Create new machine. - Add an IP in the EC2 console. - The address shows up in the Machine status after the next sync. - The CSR renewal eventually succeeds. However, there a few outstanding issues: - It looks like the current code is always setting an `ExternalDNS` address, even if it's an empty string. That doesn't seem like the correct behavior, but not continuing to do it causes problems with the CSR renewal. - The SANs on new CSRs will never match the existing certs after an IP address has been added. This is how the serving cert renewal validates CSRs, and it won't work in this case. I guess the easiest fix here is to always fall back to the machine-api based flow. Otherwise renewal fails until the cert actually expires, then succeeds. - This probably isn't a problem with longer lived production certs, but the machine-controller doesn't actually seem to be doing a full sync every 10 minutes. It sometimes takes 15 minutes or more for the new addresses to be added to the Machine object. Here's the other half of this: https://github.com/openshift/cluster-machine-approver/pull/57 This causes the approver to always fall back to the machine-api check, which means that along with the machine-controller tracking all addresses, we can approve the new certificates faster. So, now if a new addresses is added to an instance, the node should pick it up and generate a new CSR immediately, that CSR will remain pending for a while, but the current certificate will remain valid. Within 10 - 15 minutes the machine-controller will pick up the new address and the machine-api based approval flow will succeed for the new CSR. The empty values for `ExternalDNS` the machine-controller was previously adding don't actually seem to be causing any problems here, and they will be removed by this fix. Verified in 4.4.0-0.nightly-2019-12-19-223334 When the new interface is attached, a CSR is generated pending approval. The CSR is approved after the new ip is added to the machine. W1220 04:15:14.072758 1 csr_check.go:173] Current SAN Values: [ip-10-0-155-254.us-east-2.compute.internal 10.0.155.254], CSR SAN Values: [ip-10-0-155-254.us-east-2.compute.internal 10.0.144.162 10.0.155.254] I1220 04:15:14.072785 1 csr_check.go:183] Falling back to machine-api authorization for ip-10-0-155-254.us-east-2.compute.internal I1220 04:15:14.078397 1 main.go:196] CSR csr-lprct approved |