Bug 1874328
| Summary: | Stale openshift-machine-api CredentialsRequest causes AWS provisioning errors for born-in-4.1 clusters | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | W. Trevor King <wking> |
| Component: | Cloud Compute | Assignee: | Alberto <agarcial> |
| Cloud Compute sub component: | Other Providers | QA Contact: | sunzhaohua <zhsun> |
| Status: | CLOSED WONTFIX | Docs Contact: | |
| Severity: | low | ||
| Priority: | low | CC: | bbrownin, dgrigore, dofinn, haowang |
| Version: | 4.2.0 | Keywords: | ServiceDeliveryImpact, Upgrades |
| Target Milestone: | --- | ||
| Target Release: | 4.7.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-10-29 13:05:12 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
W. Trevor King
2020-09-01 04:32:16 UTC
Once this issue was identified, SREP performed the following to resolve failing machines in openshift-machine-api. Deleted the openshift-machine-api credentialsrequests `oc -n openshift-cloud-credential-operator delete credentialsrequests openshift-machine-api` This caused the openshift-machine-api-aws credentialsrequest to be sync'd by the CCO. However this only removed the aws-cloud-credentials secret from the openshift-machine-api namespace. The secret was not replaced until: The CCO operator was deleted `oc delete pod cloud-credential-operator-645df5d75c-b28wp -n openshift-cloud-credential-operator` The machine-api controller `oc delete pod machine-api-controllers-6c9d6d94c7-8vp86 -n openshift-machine-api` The new secret was then sync'd to the openshift-machine-api namespace and stuck `machines` became resolved. >The machine-API operator seems to still prefer the original name though The machines has no preference. The referenced secret name (aws-cloud-credentials) [1] has never changed across releases [2], [3]. This must be the CCO preferring the old openshift-machine-api resource to step on the openshift-machine-api-aws one, they both target the same aws-cloud-credentials secret (aws-cloud-credentials). [1] https://github.com/openshift/installer/blob/master/pkg/asset/machines/aws/machines.go#L107. [2] https://github.com/openshift/machine-api-operator/commit/17395fd62accddcebcde3fab6fa04e582dd992d1#diff-2f6d565bd12bda7da4bd512364f31fa5R10 [3] https://github.com/openshift/machine-api-operator/blob/master/install/0000_30_machine-api-operator_00_credentials-request.yaml#L12 Once https://github.com/openshift/enhancements/pull/419 lands we can take advantage of it. I'm moving this to CCO to validate my reasoning above and see if the CCO can give preference to the most recent resource when multiple resources target the same spec.secretRef.name Via deads in https://github.com/openshift/enhancements/pull/419: "No matter what we do, we expect to have operators delete or orphan the resources they are managing before this deletion happens since the operators create more resources than are covered here. AdmissionWebhooks, APIServices, and CRDs would be canonical examples. Since we already require that, it's not really clear to me how much better we make the situation by trying to have the CVO delete resources too. Since the knowledge is fully contained in the operator, why not simply have the operator be responsible for deleting itself when finished?" This is what I would expect, this is operational knowledge that should be encoded in the machine-api operator, it knows it renamed it's cred request, if it sees the old name it should be cleaned up. It seems reasonable to have the CCO refuse to do anything if it detects two cred requests targeting the same secret, but this should be a secondary fix, and I would prefer we don't try to make the CCO guess who is authoritative as there's no real guarantee that the newest should win, it's just true in this specific case. Cred operator discovery of these conflicts is covered in bug 1874549. I don't think this is a CVO bug today, although the in-flight enhancement#419 referenced from comment 0 would teach the CVO how to delete things going forward. That's unlikely to land in 4.6 though. Currently the CVO only provides an API for creating resources. Tracking resources that we should remove without explicit requests would be... possible. But not trivial, because the CVO generally does not have access to the manifests which were part of previous releases by the time it is applying the incoming release's manifests. We could, and probably should, be setting ownerReference [1] on CVO-created resources. Although maybe not on create-only resources? I'm not sure what we do in that space today. But I think today the simplest fix would be to have the machine-API operator remove the old CredentialsRequest on its own. Or we could stuff this into the config operator or some such. I'm happy to submit a PR in that space to get the code location I prefer if the push to get this into the CVO before enhancement#419 is actually more about dev capacity than it is about code location. Thoughts? [1]: https://kubernetes.io/docs/concepts/workloads/controllers/garbage-collection/#owners-and-dependents Unsetting severity and priority so that the assigned team re-evaluates. Setting to low as this affects only cluster born in 4.1 and there's a known workaround (i.e delete the old resource). Once https://github.com/openshift/enhancements/pull/419 is available we should try to infer how many clusters have both resources and based on that number we can consider if it's worth it to add a removal manifest in mao. To infer the number of clusters that have both resources we probably need the CCO to trigger an alarm so this goes through telemtry. I just hit this issue with my AWS IPI cluster that started life as 4.1.x when upgrading from 4.5.9 to 4.5.11. After manually applying the workaround in Comment 2 (deleting the old openshift-machine-api creds, bouncing pods) things seem to be progressing again. I believe we are still waiting for CVO to be able to remove objects before we can progress with this BZ, setting to upcoming sprint Comment 7 is me pushing back against this being a CVO issue in 4.6 or earlier. There are so few of these born-in-4.1 AWS clusters left around who have updated but not as far as the impacted 4.4.8+ or 4.5+, that we may just want to WONTFIX this if the alternatives I floated in comment 7 don't sound appealing. Sorry Trevor, didn't quite follow the whole conversation back while responding. I see your analysis of the number of clusters that are actually vulnerable and that makes me lean towards a no fix on this. We have a known workaround and the number of clusters is so low that it feels like any effort for a fix as you've described would not really be worth it Closing this. This BZ has a workaround - delete the dangling CredentialRequest and restart CCO. Correct way to handle such situations requires a feature work from CVO, which is yet being done. The enhancement, allowing resource removal was merged, but the code was not implemented even in 4.6, and people were facing this in 4.5.9-11 upgrades. We can't backport the feature in previous release, and we probably not going to automate this task in MAO, as this will be a temporal fix for a single release, and the number of customers impacted by this issue is low. |