Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1874328

Summary:	Stale openshift-machine-api CredentialsRequest causes AWS provisioning errors for born-in-4.1 clusters
Product:	OpenShift Container Platform	Reporter:	W. Trevor King <wking>
Component:	Cloud Compute	Assignee:	Alberto <agarcial>
Cloud Compute sub component:	Other Providers	QA Contact:	sunzhaohua <zhsun>
Status:	CLOSED WONTFIX	Docs Contact:
Severity:	low
Priority:	low	CC:	bbrownin, dgrigore, dofinn, haowang
Version:	4.2.0	Keywords:	ServiceDeliveryImpact, Upgrades
Target Milestone:	---
Target Release:	4.7.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-10-29 13:05:12 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description W. Trevor King 2020-09-01 04:32:16 UTC

[1] renamed openshift-machine-api to openshift-machine-api-aws and openshift-machine-api-openstack and landed in 4.2:

  $ git log --oneline origin/release-4.1 | grep 'Append platform'
  ...no hits...
  $ git log --oneline origin/release-4.2 | grep 'Append platform'
  641c5592 Append platform for all credentialRequest's naming for consistency

However, the cluster-version operator does not remove cluster objects when their manifests are removed ([2] is open about teaching the CVO about explict cluster object removal requests).  The machine-API operator seems to still prefer the original name though [3].  So born-in-4.1 clusters on AWS and OpenStack entering 4.2 and later had both the openshift-machine-api and openshift-machine-api-$PLATFORM CredentialsRequests in cluster.

That didn't really matter as long as the CredentialsRequest had the same content at both names, but [4] added ec2:DescribeDhcpOptions to the new name in bug 1822200, and was backported to 4.4.8 in [5] but not to 4.3 [6].  So born-in-4.1 clusters on AWS and OpenStack which have since been updated to 4.4.8+ or 4.5+ will have a machine-API operator using the no-longer-sufficient openshift-machine-api credential and will fail to provision machines with errors like:

E0901 02:14:10.770110       1 actuator.go:507] ...-worker-us-east-1a-wn8n9: error describing dhcp: UnauthorizedOperation: You are not authorized to perform this operation.
	status code: 403, request id: ...
E0901 02:14:10.770296       1 controller.go:273] ...-worker-us-east-1a-wn8n9: error updating machine: UnauthorizedOperation: You are not authorized to perform this operation.
	status code: 403, request id: ...

Removing the old openshift-machine-api credential (and possibly kicking some machine-API-controller pod?) should get the controller to pick up the openshift-machine-$PLATFORM credential and recover its ability to provision machines.

[1]: https://github.com/openshift/machine-api-operator/pull/355
[2]: https://github.com/openshift/enhancements/pull/419
[3]: citation needed
[4]: https://github.com/openshift/machine-api-operator/pull/572
[5]: https://bugzilla.redhat.com/show_bug.cgi?id=1833359#c8
[6]: https://bugzilla.redhat.com/show_bug.cgi?id=1833361#c2

Comment 2 dofinn 2020-09-01 04:44:41 UTC

Once this issue was identified, SREP performed the following to resolve failing machines in openshift-machine-api. 

Deleted the openshift-machine-api credentialsrequests

`oc -n openshift-cloud-credential-operator delete credentialsrequests openshift-machine-api` 

This caused the openshift-machine-api-aws credentialsrequest to be sync'd by the CCO. However this only removed the aws-cloud-credentials secret from the openshift-machine-api namespace. 
The secret was not replaced until:

The CCO operator was deleted `oc delete pod cloud-credential-operator-645df5d75c-b28wp  -n openshift-cloud-credential-operator`
The machine-api controller `oc delete pod machine-api-controllers-6c9d6d94c7-8vp86 -n openshift-machine-api` 

The new secret was then sync'd to the openshift-machine-api namespace and stuck `machines` became resolved.

Comment 4 Alberto 2020-09-01 08:15:28 UTC

>The machine-API operator seems to still prefer the original name though

The machines has no preference. The referenced secret name (aws-cloud-credentials) [1] has never changed across releases [2], [3].

This must be the CCO preferring the old openshift-machine-api resource to step on the openshift-machine-api-aws one, they both target the same aws-cloud-credentials secret (aws-cloud-credentials).

[1] https://github.com/openshift/installer/blob/master/pkg/asset/machines/aws/machines.go#L107.
[2] https://github.com/openshift/machine-api-operator/commit/17395fd62accddcebcde3fab6fa04e582dd992d1#diff-2f6d565bd12bda7da4bd512364f31fa5R10
[3] https://github.com/openshift/machine-api-operator/blob/master/install/0000_30_machine-api-operator_00_credentials-request.yaml#L12

Once https://github.com/openshift/enhancements/pull/419 lands we can take advantage of it.

I'm moving this to CCO to validate my reasoning above and see if the CCO can give preference to the most recent resource when multiple resources target the same spec.secretRef.name

Comment 5 Devan Goodwin 2020-09-01 11:39:33 UTC

Via deads in https://github.com/openshift/enhancements/pull/419:

"No matter what we do, we expect to have operators delete or orphan the resources they are managing before this deletion happens since the operators create more resources than are covered here. AdmissionWebhooks, APIServices, and CRDs would be canonical examples. Since we already require that, it's not really clear to me how much better we make the situation by trying to have the CVO delete resources too. Since the knowledge is fully contained in the operator, why not simply have the operator be responsible for deleting itself when finished?"

This is what I would expect, this is operational knowledge that should be encoded in the machine-api operator, it knows it renamed it's cred request, if it sees the old name it should be cleaned up.

It seems reasonable to have the CCO refuse to do anything if it detects two cred requests targeting the same secret, but this should be a secondary fix, and I would prefer we don't try to make the CCO guess who is authoritative as there's no real guarantee that the newest should win, it's just true in this specific case.

Comment 6 W. Trevor King 2020-09-01 15:10:59 UTC

Cred operator discovery of these conflicts is covered in bug 1874549.

Comment 7 W. Trevor King 2020-09-01 15:20:30 UTC

I don't think this is a CVO bug today, although the in-flight enhancement#419 referenced from comment 0 would teach the CVO how to delete things going forward.  That's unlikely to land in 4.6 though.  Currently the CVO only provides an API for creating resources.  Tracking resources that we should remove without explicit requests would be... possible.  But not trivial, because the CVO generally does not have access to the manifests which were part of previous releases by the time it is applying the incoming release's manifests.  We could, and probably should, be setting ownerReference [1] on CVO-created resources.  Although maybe not on create-only resources?  I'm not sure what we do in that space today.

But I think today the simplest fix would be to have the machine-API operator remove the old CredentialsRequest on its own.  Or we could stuff this into the config operator or some such.  I'm happy to submit a PR in that space to get the code location I prefer if the push to get this into the CVO before enhancement#419 is actually more about dev capacity than it is about code location.  Thoughts?

[1]: https://kubernetes.io/docs/concepts/workloads/controllers/garbage-collection/#owners-and-dependents

Comment 8 Scott Dodson 2020-09-02 14:04:19 UTC

Unsetting severity and priority so that the assigned team re-evaluates.

Comment 9 Alberto 2020-09-02 14:12:09 UTC

Setting to low as this affects only cluster born in 4.1 and there's a known workaround (i.e delete the old resource).

Once https://github.com/openshift/enhancements/pull/419 is available we should try to infer how many clusters have both resources and based on that number we can consider if it's worth it to add a removal manifest in mao.

To infer the number of clusters that have both resources we probably need the CCO to trigger an alarm so this goes through telemtry.

Comment 11 Ben Browning 2020-09-29 16:37:05 UTC

I just hit this issue with my AWS IPI cluster that started life as 4.1.x when upgrading from 4.5.9 to 4.5.11. After manually applying the workaround in Comment 2 (deleting the old openshift-machine-api creds, bouncing pods) things seem to be progressing again.

Comment 12 Joel Speed 2020-09-30 10:56:56 UTC

I believe we are still waiting for CVO to be able to remove objects before we can progress with this BZ, setting to upcoming sprint

Comment 13 W. Trevor King 2020-09-30 18:40:51 UTC

Comment 7 is me pushing back against this being a CVO issue in 4.6 or earlier.  There are so few of these born-in-4.1 AWS clusters left around who have updated but not as far as the impacted 4.4.8+ or 4.5+, that we may just want to WONTFIX this if the alternatives I floated in comment 7 don't sound appealing.

Comment 14 Joel Speed 2020-10-01 11:14:28 UTC

Sorry Trevor, didn't quite follow the whole conversation back while responding.

I see your analysis of the number of clusters that are actually vulnerable and that makes me lean towards a no fix on this.

We have a known workaround and the number of clusters is so low that it feels like any effort for a fix as you've described would not really be worth it

Comment 15 Danil Grigorev 2020-10-29 13:05:12 UTC

Closing this.

This BZ has a workaround - delete the dangling CredentialRequest and restart CCO. Correct way to handle such situations requires a feature work from CVO, which is yet being done. The enhancement, allowing resource removal was merged, but the code was not implemented even in 4.6, and people were facing this in 4.5.9-11 upgrades. We can't backport the feature in previous release, and we probably not going to automate this task in MAO, as this will be a temporal fix for a single release, and the number of customers impacted by this issue is low.