Bug 2104511
| Summary: | Ensure mapi-aws compatibility with deprecated AMIs | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Scott Dodson <sdodson> |
| Component: | Cloud Compute | Assignee: | Michael McCune <mimccune> |
| Cloud Compute sub component: | Other Providers | QA Contact: | sunzhaohua <zhsun> |
| Status: | CLOSED NOTABUG | Docs Contact: | |
| Severity: | low | ||
| Priority: | low | CC: | jaharrin, mimccune, wking |
| Version: | 4.1.z | ||
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-08-24 14:53:45 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Scott Dodson
2022-07-06 13:58:34 UTC
i do see several references to `DescribeImages` in github.com/openshift/machine-api-provider-aws. it looks like we have 2 usages in instances.go[0] that will need to be reviewed. i'm not sure that we can mitigate the issue from within code, but we could print some log messages when that call fails. my concern here is that given the guidance from AWS, it seems like we could try to launch an instance using a deprecated AMI id, but we wouldn't be able to check that AMI using DescribeImages in the provider code, but this seems like it will be difficult for users to debug. @Scott, would adding more logging around this be sufficient to mitigate the risk for long lived clusters? [0] https://github.com/openshift/machine-api-provider-aws/blob/main/pkg/actuators/machine/instances.go Looks like the only getAMI caller is [1], where it's pulling the ID (which will still work [2]) or a filter set (which will stop working [3]) right off the provider config. So how are our production Machine(Set)s distributed? The installer currently switches in [4], based on whether it has an osimage, falling back to by-tag filters when it does not have osimage. And osimage seems to come from the install-config's amiID [5,6]. Checking 4.11 CI [7]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-serial/1544255021804163072/artifacts/e2e-aws-serial/gather-extra/artifacts/machinesets.json | jq -c '.items[].spec.template.spec.providerSpec.value.ami' | uniq -c 2 {"id":"ami-0373a8d3b2a246ec5"} so that's promising. And while our CI harness sometimes sets amiID [8], there is no 'patching rhcos ami' from that step in this run [9]. But we'd need to look at Insights or something to see what folks were doing beyond the installer's default (and I haven't looked back before 4.11 to confirm the installer default hasn't evolved). [1]: https://github.com/openshift/machine-api-provider-aws/blob/d701bcb720a12bd7d169d79699962c447a1f026d/pkg/actuators/machine/instances.go#L287 [2]: https://github.com/openshift/machine-api-provider-aws/blob/d701bcb720a12bd7d169d79699962c447a1f026d/pkg/actuators/machine/instances.go#L155-L158 [3]: https://github.com/openshift/machine-api-provider-aws/blob/d701bcb720a12bd7d169d79699962c447a1f026d/pkg/actuators/machine/instances.go#L160-L165 [4]: https://github.com/openshift/installer/blob/b644753051f3d5d1ea9d52552c2943fab1d9954d/pkg/asset/machines/aws/machines.go#L125-L132 [5]: https://github.com/openshift/installer/blob/b644753051f3d5d1ea9d52552c2943fab1d9954d/pkg/asset/machines/aws/machinesets.go#L47 [6]: https://github.com/openshift/installer/blob/b644753051f3d5d1ea9d52552c2943fab1d9954d/docs/user/aws/customization.md#cluster-scoped-properties [7]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-serial/1544255021804163072 [8]: https://github.com/openshift/release/blob/a4464056a7928138f363c474e801abe5ca258b2c/ci-operator/step-registry/ipi/conf/aws/ipi-conf-aws-commands.sh#L130-L136 [9]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-serial/1544255021804163072/artifacts/e2e-aws-serial/ipi-conf-aws/build-log.txt > or a filter set (which will stop working [3]) right off the provider config i'm kinda wondering if we should add a note about the deprecation to our error logs there? > So how are our production Machine(Set)s distributed? i'm not sure i follow, do you mean how do we distribute the initial AMIs in the release payload or did you have something else in mind? in general, we don't use names to do these lookups. talking with the team, we don't think this will be a problem but is worth investigating a little further. we are not closing this just yet, but setting the priority low. Trevor and i talked a little about creating a telemetry query that could help to determine if we have a large number of users who are using the filter method on their AMIs. this is probably a good next step to understand how big a problem this could be for users on older versions of AWS. as follow up actions to this bug, i have created some jira cards which describe actions we can take to improve the reporting around this condition. https://issues.redhat.com/browse/OCPCLOUD-1659 https://issues.redhat.com/browse/OCPCLOUD-1660 https://issues.redhat.com/browse/OCPCLOUD-1661 There are JIRAs tracking potential visibility improvements but no indication that out of the box we're prone to failure in this space. So I'm closing this NOTABUG. |