Bug 2320081

Summary: osbuild tasks are consistently failing and breaking composes, but building ARM minimal with ImageFactory instead would compromise device support
Product: [Fedora] Fedora Reporter: Adam Williamson <awilliam>
Component: distributionAssignee: Aoife Moloney <amoloney>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 41CC: cmdr, kevin, ngompa13, obudai, pbrobinson, robatino, samjain, sraymaek
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard: AcceptedBlocker
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2024-10-22 18:31:41 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2247867    

Description Adam Williamson 2024-10-20 20:34:29 UTC
I'm filing this because AFAIK we don't have any other ticket for it yet, and we kinda need to track it as a release blocker.

We're more or less lined up for a Fedora 41 candidate compose, but we have not been able to build one, because the ARM minimal image build - which is the only one we build with osbuild ATM - keeps failing.

One recent attempt failed with:

requests.exceptions.ConnectionError: HTTPSConnectionPool(host='sso.redhat.com', port=443): Max retries exceeded with url: /auth/realms/redhat-external/protocol/openid-connect/token (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fe19b582c60>: Failed to establish a new connection: [Errno 101] Network is unreachable'))

which seems like an intermittent connectivity problem we've been having ever since we adopted osbuild (I noticed an instance of it 8 months back, when we were initially reviewing the Change that moved ARM minimal to osbuild). But when we don't hit that one, we keep hitting errors like this instead:

"Unable to start secure instance: Unable to create fleet: InvalidAMIID.NotFound: The image id '[ami-01160ab65b10a3873]' does not exist"

this seems to relate to some kind of new feature from the last month or two in the osbuild service about these "secure instances" which has had known issues since deployment. We've been talking to Simon about it on Matrix since at least October 11 and kind of expecting it to be resolved or ameliorated since then, but...it's just kept on being a problem.

This is a risk to the F41 release because the first go/no-go date is Thursday. I really wanted to have a compose done over the weekend so folks could get initial testing done. But with this issue we can't really do one. We considered reverting this image build to using ImageFactory - see https://pagure.io/pungi-fedora/pull-request/1398 - but Peter says that would unacceptably compromise its device support (apparently osbuild-built images support several devices which IF-built images do not support).

So...at this point our options are:

1) keep firing composes with osbuild and hope we get lucky, though Simon says "From my understanding it means something went wrong in the deployment of the service in the Fedora tenant and retrying will just keep failing"
2) wait for the osbuild folks to address the issue more comprehensively
3) build the arm minimal image with something else and eat the functionality loss
4) drop the arm minimal image

4 seems like an obvious non-starter. 3 seems bad. 1 sounds futile. So I don't see much besides 2, unfortunately :(

Proposing as a Final blocker for obvious reasons: we can't release if we can't compose.

Comment 1 Simon de Vlieger 2024-10-20 20:42:38 UTC
I've filed the following upstream bug: https://issues.redhat.com/browse/COMPOSER-2376 for the missing AMI.

Comment 2 Simon de Vlieger 2024-10-20 20:45:20 UTC
As for a resolution timeline to at least let the composes progress, I can say tentatively tomorrow morning Europe TZ.

Comment 3 Adam Williamson 2024-10-20 20:50:49 UTC
That would be great, thanks. I can plan to line up the candidate request this evening so whenever you give us the word, we can fire another attempt. Adding Samyak, sorry, missed you in the initial CC.

Comment 4 Adam Williamson 2024-10-20 23:29:53 UTC
I think we can count this as an automatic blocker per "Complete failure of any release-blocking image to boot at all under any circumstance - "DOA" image (conditional failure is not an automatic blocker)". It's even worse than DOA: the image doesn't even A. :D

Comment 5 Simon de Vlieger 2024-10-21 08:03:49 UTC
@obudai and @sraymaek looked into this this morning and discovered that we had a nightly pipeline to rebuild our AMI's. This pipeline removed previous AMIs which led to the above problem. We have disabled the pipeline and redeployed the Fedora tenant. I've done a test compose which passed: https://koji.fedoraproject.org/koji/taskinfo?taskID=125049061 We'll be changing the pipeline to not remove previous AMIs and fail instead (which would break *our* CI, but *not* Fedora).

Please let me know if any other issues.

Comment 6 Adam Williamson 2024-10-21 08:28:23 UTC
Thanks a lot! Let's mark it ON_QA for now till we run a compose and confirm it's good now.

Comment 7 Adam Williamson 2024-10-22 18:31:41 UTC
The candidate compose worked and no other recent composes seem to have hit this, so let's call it fixed. Thanks!

Comment 8 Red Hat Bugzilla 2025-02-20 04:25:06 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days