Bug 2104511

Summary:	Ensure mapi-aws compatibility with deprecated AMIs
Product:	OpenShift Container Platform	Reporter:	Scott Dodson <sdodson>
Component:	Cloud Compute	Assignee:	Michael McCune <mimccune>
Cloud Compute sub component:	Other Providers	QA Contact:	sunzhaohua <zhsun>
Status:	CLOSED NOTABUG	Docs Contact:
Severity:	low
Priority:	low	CC:	jaharrin, mimccune, wking
Version:	4.1.z
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-08-24 14:53:45 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Scott Dodson 2022-07-06 13:58:34 UTC

Description of problem:
Apparently AWS is deprecating AMIs which were originally published > 2yrs ago starting July 30th 2022. Per their note the images don't go away but calls to the DescribeImages api call will stop showing them. Can someone check to make sure that the mapi-aws won't run into problems creating new instances on long lived clusters? I did a quick search for that call in mapi-aws and didn't find it but by no means should my naive search be trusted.

As we all know, we don't manage the AMI with the lifecycle of the cluster, it's effectively set in stone at install time unless an admin takes action to update the image and machine config necessary to ignite the newer images.


Version-Release number of selected component (if applicable):
4.1

How reproducible:
unknown, this is really a request that we investigate to ensure we don't have a problem coming at us on July 30th when they start deprecating AMIs. The outcome of this investigation will likely need to make it into a KCS which can be published to allay customer concerns when they receive similar emails.

Additional info: The full contents of an email AWS delivered to customer using images which will be deprecated follows with select redactions.

====== cut/paste ======
On March 31, 2022, EC2 announced that all public AMIs will have the deprecation time set to two years after their creation date [1]. Once an AMI is deprecated, it will no longer appear in DescribeImages [2] API calls for users that are not the owner of the AMI. Deprecating an AMI only reduces the visibility of the AMI in untargeted searches, but continues to be usable and available to you. Users of a deprecated AMI can continue to launch instances and describe the deprecated AMI using its ID. Deprecation time for all public AMIs will now be set automatically to two years after their creation. Public AMIs already older than two years will be deprecated on July 30, 2022 instead of the originally stated date of June 30 to give you time to assess and mitigate potential impact to your workflows which may perform DescribeImages requests that do not specify image IDs.

We have identified that you have launched one or more instances in the last month using a public AMI which will be deprecated on July 30, 2022. While you will continue to be able to launch instances using the AMI IDs, the AMIs will not be visible on the public AMI catalog [3] or in DescribeImages calls by default once deprecated. To list deprecated AMIs in the DescribeImages API calls, you must specify the --include-deprecated parameter [4]. If you would like to view details about the deprecated AMI, you can describe it using its ID as well. This message is to notify you of the upcoming change, and does not require any action from you.

The following is a list of old public AMIs you have used for instance launches in the last month in the AP-SOUTHEAST-1 Region :
ami-0d76ac0ebaac29e40

To learn more about viewing and managing the deprecation time for your AMIs, please refer to the "Deprecate an AMI" documentation [5].

If you have any further questions or comments regarding this matter, please contact AWS Support [6].


[1] https://aws.amazon.com/about-aws/whats-new/2022/03/amazon-machine-images-public-visibility-two-years/
[2] https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_DescribeImages.html
[3] https://aws.amazon.com/blogs/aws/new-ami-catalog-and-new-ami-launch-button-are-now-available/
[4] https://docs.aws.amazon.com/cli/latest/reference/ec2/describe-images.html
[5] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ami-deprecate.html
[6] https://aws.amazon.com/support 
====== cut/paste ======

Comment 1 Michael McCune 2022-07-06 16:39:48 UTC

i do see several references to `DescribeImages` in github.com/openshift/machine-api-provider-aws. it looks like we have 2 usages in instances.go[0] that will need to be reviewed. i'm not sure that we can mitigate the issue from within code, but we could print some log messages when that call fails. my concern here is that given the guidance from AWS, it seems like we could try to launch an instance using a deprecated AMI id, but we wouldn't be able to check that AMI using DescribeImages in the provider code, but this seems like it will be difficult for users to debug. 

@Scott, would adding more logging around this be sufficient to mitigate the risk for long lived clusters?

[0] https://github.com/openshift/machine-api-provider-aws/blob/main/pkg/actuators/machine/instances.go

Comment 2 W. Trevor King 2022-07-06 23:21:56 UTC

Looks like the only getAMI caller is [1], where it's pulling the ID (which will still work [2]) or a filter set (which will stop working [3]) right off the provider config.  So how are our production Machine(Set)s distributed?  The installer currently switches in [4], based on whether it has an osimage, falling back to by-tag filters when it does not have osimage.  And osimage seems to come from the install-config's amiID [5,6].  Checking 4.11 CI [7]:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-serial/1544255021804163072/artifacts/e2e-aws-serial/gather-extra/artifacts/machinesets.json | jq -c '.items[].spec.template.spec.providerSpec.value.ami' | uniq -c
        2 {"id":"ami-0373a8d3b2a246ec5"}

so that's promising.  And while our CI harness sometimes sets amiID [8], there is no 'patching rhcos ami' from that step in this run [9].

But we'd need to look at Insights or something to see what folks were doing beyond the installer's default (and I haven't looked back before 4.11 to confirm the installer default hasn't evolved).

[1]: https://github.com/openshift/machine-api-provider-aws/blob/d701bcb720a12bd7d169d79699962c447a1f026d/pkg/actuators/machine/instances.go#L287
[2]: https://github.com/openshift/machine-api-provider-aws/blob/d701bcb720a12bd7d169d79699962c447a1f026d/pkg/actuators/machine/instances.go#L155-L158
[3]: https://github.com/openshift/machine-api-provider-aws/blob/d701bcb720a12bd7d169d79699962c447a1f026d/pkg/actuators/machine/instances.go#L160-L165
[4]: https://github.com/openshift/installer/blob/b644753051f3d5d1ea9d52552c2943fab1d9954d/pkg/asset/machines/aws/machines.go#L125-L132
[5]: https://github.com/openshift/installer/blob/b644753051f3d5d1ea9d52552c2943fab1d9954d/pkg/asset/machines/aws/machinesets.go#L47
[6]: https://github.com/openshift/installer/blob/b644753051f3d5d1ea9d52552c2943fab1d9954d/docs/user/aws/customization.md#cluster-scoped-properties
[7]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-serial/1544255021804163072
[8]: https://github.com/openshift/release/blob/a4464056a7928138f363c474e801abe5ca258b2c/ci-operator/step-registry/ipi/conf/aws/ipi-conf-aws-commands.sh#L130-L136
[9]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-serial/1544255021804163072/artifacts/e2e-aws-serial/ipi-conf-aws/build-log.txt

Comment 3 Michael McCune 2022-07-07 14:13:27 UTC

> or a filter set (which will stop working [3]) right off the provider config

i'm kinda wondering if we should add a note about the deprecation to our error logs there?

> So how are our production Machine(Set)s distributed?

i'm not sure i follow, do you mean how do we distribute the initial AMIs in the release payload or did you have something else in mind?

Comment 4 Michael McCune 2022-07-12 15:24:05 UTC

in general, we don't use names to do these lookups. talking with the team, we don't think this will be a problem but is worth investigating a little further. we are not closing this just yet, but setting the priority low.

Trevor and i talked a little about creating a telemetry query that could help to determine if we have a large number of users who are using the filter method on their AMIs. this is probably a good next step to understand how big a problem this could be for users on older versions of AWS.

Comment 7 Michael McCune 2022-08-18 21:26:41 UTC

as follow up actions to this bug, i have created some jira cards which describe actions we can take to improve the reporting around this condition.

https://issues.redhat.com/browse/OCPCLOUD-1659
https://issues.redhat.com/browse/OCPCLOUD-1660
https://issues.redhat.com/browse/OCPCLOUD-1661

Comment 8 Scott Dodson 2022-08-24 14:53:45 UTC

There are JIRAs tracking potential visibility improvements but no indication that out of the box we're prone to failure in this space. So I'm closing this NOTABUG.