Bug 2260395 - Disk selector dropdown on anaconda webui welcome screen is greyed out (but actually still functional) with 40.18
Summary: Disk selector dropdown on anaconda webui welcome screen is greyed out (but ac...
Keywords:
Status: CLOSED RAWHIDE
Alias: None
Product: Fedora
Classification: Fedora
Component: anaconda
Version: rawhide
Hardware: All
OS: Linux
unspecified
medium
Target Milestone: ---
Assignee: anaconda-maint
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard: openqa
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2024-01-25 21:19 UTC by Adam Williamson
Modified: 2024-02-15 21:04 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2024-01-30 01:12:25 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
screenshot of the issue (73.66 KB, image/png)
2024-01-25 21:20 UTC, Adam Williamson
no flags Details
UEFI failure screenshot (70.54 KB, image/png)
2024-01-25 22:00 UTC, Adam Williamson
no flags Details
storage.log from the UEFI failure (58.14 KB, text/plain)
2024-01-25 22:00 UTC, Adam Williamson
no flags Details
storage.log from the BIOS failure (57.79 KB, text/plain)
2024-01-25 22:01 UTC, Adam Williamson
no flags Details
The selector in its expected state (not grayed). (87.66 KB, image/png)
2024-01-26 12:34 UTC, Radek Vykydal
no flags Details

Description Adam Williamson 2024-01-25 21:19:21 UTC
The anaconda-40.18-1.fc40 is failing openQA tests because it seems like almost every interactive element in the webUI welcome screen is greyed out. The only one that isn't is "Modify storage". See attached screenshot. Specifically, openQA is expecting the "Select a disk" dropdown to appear interactive.

Proposing as a Beta blocker as this appears to prevent install from Workstation live (I say 'appears' because I need to play around with it a bit manually to see if there's a way to work around it).

Comment 1 Adam Williamson 2024-01-25 21:20:07 UTC
Created attachment 2010571 [details]
screenshot of the issue

Comment 2 Adam Williamson 2024-01-25 21:20:32 UTC
actually, not proposing as a blocker as the update is gated, but it *would* be a blocker if it got through.

Comment 3 Adam Williamson 2024-01-25 21:40:58 UTC
Hmm. I downloaded an affected image and booted it in a local VM and the problem doesn't happen there. It's definitely happening reproducibly in openQA, though. Not sure why the difference; maybe something to do with the attached disk.

Comment 4 Adam Williamson 2024-01-25 22:00:20 UTC
Created attachment 2010573 [details]
UEFI failure screenshot

The UEFI failure looks slightly different - it says "No usable disks detected".

Comment 5 Adam Williamson 2024-01-25 22:00:56 UTC
Created attachment 2010574 [details]
storage.log from the UEFI failure

Comment 6 Adam Williamson 2024-01-25 22:01:42 UTC
Created attachment 2010575 [details]
storage.log from the BIOS failure

Comment 7 Adam Williamson 2024-01-25 22:11:48 UTC
Huh. This seems mysterious from a few angles.

I wondered if you need an empty disk to reproduce it, but using a completely fresh disk image on my local test VM still doesn't reproduce it. Also, I cannot see where the "No usable disks detected" message comes from at all. It's not in anaconda or blivet AFAICS. There's a "No usable disks." error and a "No usable disks selected." error, but not "No usable disks detected". I looked through the changelog from 40.17 to 40.18 and didn't really see anything that looked very relevant. But the bug is definitely only happening on this update. I'm rather confused.

Comment 8 Adam Williamson 2024-01-25 22:36:45 UTC
Whew, okay, bit less mysterious, I can reproduce it now. You need to have two disks. (The openQA update tests always run with two disks, for...reasons). It's purely a visual issue: the "Select a disk" dropdown is grey, implying it's not usable, but actually it is usable. If you click on it it works as normal, and selecting a disk unblocks all the other things on the screen.

So the bug is really: when you have more than one disk, the disk selector is shown grey, but it shouldn't be.

Comment 9 Adam Williamson 2024-01-25 22:37:20 UTC
oh, and the UEFI "No usable disks detected" failure I saw seems to have been a blip, the previous run of the test failed the same way as the BIOS one. Still don't know where that message came from, though.

Comment 10 Radek Vykydal 2024-01-26 07:59:51 UTC
(In reply to Adam Williamson from comment #9)
> oh, and the UEFI "No usable disks detected" failure I saw seems to have been
> a blip, the previous run of the test failed the same way as the BIOS one.
> Still don't know where that message came from, though.

anaconda-webui was split out into a separate repository, the message: https://github.com/rhinstaller/anaconda-webui/blob/d89f70186fd087d7f6649689429f2bacf4f614fe/src/components/storage/InstallationDestination.jsx#L433

Comment 11 Radek Vykydal 2024-01-26 08:23:35 UTC
(In reply to Adam Williamson from comment #8)
 
> So the bug is really: when you have more than one disk, the disk selector is
> shown grey, but it shouldn't be.

Maybe we should set focus on the selector in case no disk is (pre)selected ?

Comment 12 Radek Vykydal 2024-01-26 12:31:31 UTC
(In reply to Radek Vykydal from comment #11)
> (In reply to Adam Williamson from comment #8)
>  
> > So the bug is really: when you have more than one disk, the disk selector is
> > shown grey, but it shouldn't be.
> 
> Maybe we should set focus on the selector in case no disk is (pre)selected ?

Ah, I see now the issue with the selector being gray. I'll attach screenshot without the issue for comparison.
Seems like the issue is a flake?

Comment 13 Radek Vykydal 2024-01-26 12:34:34 UTC
Created attachment 2010712 [details]
The selector in its expected state (not grayed).

Comment 14 Radek Vykydal 2024-01-26 12:44:25 UTC
(In reply to Radek Vykydal from comment #12)

> Ah, I see now the issue with the selector being gray. I'll attach screenshot
> without the issue for comparison.
> Seems like the issue is a flake?

Or maybe not, unfortunately I deleted the reproducer iso :(.

Comment 15 Katerina Koukiou 2024-01-26 16:26:11 UTC
Hi Adam, is the disabled Selector persisting, or it's just for a few seconds.

The was some deliberate change with:

commit d4150448c7d03029ca4ee3f0c6846da44b4aa8c7
Author: Katerina Koukiou <kkoukiou>
Date:   Mon Oct 9 13:49:21 2023 +0200

    webui: disable the whole form when disk re-scan is taking place

A while back, which disables the Disk Selector component and also the Re-scan button while re-scanning happens.

Comment 16 Katerina Koukiou 2024-01-26 16:36:58 UTC
Alright, I see from the openQA video that's persistently disabled.
I see from the storage logs, that these finish with the Re-scan tasks still running, see:
`INFO:anaconda.core.threads:Thread Done: AnaTaskThread-ScanDevicesTask-1 (140613865440960)`

This would prove why the buttons are disabled. But it looks like if the timeout was slightly more - it would be passing.

Where I can find the ISO for reproducing this?

Comment 18 Adam Williamson 2024-01-26 18:29:49 UTC
So it looks like this was caused by a mismatch between anaconda and anaconda-webui versions. I was not aware that anaconda-webui has been split out of anaconda. That caused a problem, because it seems like interdependent updates to anaconda and anaconda-webui were submitted separately - there was an update for anaconda-40.18-1.fc40 and a separate update for anaconda-webui-3-1.fc40 , but the changes to these packages are interdependent, i.e. neither works correctly without the other.

I've been talking to kkoukiou about this, but to reiterate for the record: this is not allowed by policy. Inter-dependent changes *must* be submitted as a single update: https://docs.fedoraproject.org/en-US/fesco/Updates_Policy/#updating-inter-dependent-packages

This is because it is the only safe way to ensure the changes actually land together. If there are separate updates for each package, it is possible for one to land without the other, and then the distribution is broken. Correct dependencies between the packages don't solve the problem, they will only make it somewhat more obvious what the problem is (instead of a weird test failure we'd get a dependency error).

I realize you want to use packit to handle updates, but packit does not support this yet: https://github.com/packit/packit/issues/1870

unfortunately there really isn't a way around this. As long as packit does not support multi-package updates, the only viable choices I can see are:

1. Handle the update creation part of the process manually until it does
2. Re-combine anaconda and anaconda-webui - at least in a single source package - for now

Unless you do that, we'll keep running into problems like this. This time, the anaconda update broke at first (this bug) because it was tested without the anaconda-webui update (openQA had no way to know they should be tested together; grouped updates are how openQA knows such things). The anaconda-webui update got through - because anaconda-webui is not currently in critical path, so it was not tested or gated - and then the Workstation live install test started failing for all *other* updates, because now anaconda-webui was updated but anaconda was not. To fix this, I had to manually re-run the tests on the anaconda update so they ran with the new anaconda-webui which had gone stable, and now those tests have finished and anaconda has been submitted for stable, I will have to manually re-run all the other failed tests.

I have proposed adding anaconda-webui to critical path now: https://pagure.io/fedora-comps/pull-request/932

with that change, what would happen in future is that neither the anaconda nor anaconda-webui updates would pass gating, if each required the other and they were submitted separately. But at least they wouldn't break anything else.

Comment 19 Adam Williamson 2024-01-30 01:12:25 UTC
This is resolved now the two updates both went stable.

Comment 20 Martin Pitt 2024-02-14 09:37:48 UTC
Adam: A side tag is a rather big hammer, and neither Anaconda nor Cockpit teams have experience with it. What's wrong with setting proper versioned Requires:/Conflicts: instead, and uploading new releases at the same time?

Comment 21 Adam Williamson 2024-02-14 20:42:38 UTC
Well, basically, nothing in the workflow is built to handle that.

The number one thing we need to ensure is that, if package A requires package B and both are submitted to -testing at once, A is not pushed stable until B is pushed stable.

Bodhi has no mechanism for checking RPM dependencies. There is no check in Bodhi which says "before doing this update push, resolve all the dependencies of everything I'm about to push, and make sure they're all present in the same push". That would be a huge thing to engineer into Bodhi (and it would probably come with all sorts of weird corner cases). So we have no actual protection at that level.

Similarly, openQA does not have such a mechanism. It would be equally complex to code at the openQA level. If the current update policies are followed, all openQA has to do is say "OK, I'm testing update FEDORA-2024-AAAAAA, so I will download all the packages in that update and expect dependencies to be consistent". With your design, openQA would have to say "OK, I'm testing update FEDORA-2024-AAAAAA. I will download all the packages in that update, check all of their dependencies, see if anything is missing, if so, somehow go out and query every other package currently in the update system and see if any of those provides the required dependency, and if so, pull that update into the test too."

FCOS CI is also now doing automated testing at the update level, and it similarly does not do any such complex dependency resolution stuff.

Basically, by policy and also by implementation, the update is the intended level of granularity at which changes to Fedora must be dependency-consistent.

Side tags really aren't a "big hammer". They are self-service with a single command:

fedpkg request-side-tag

run that from a package repo checkout on a release branch and it will request a side tag for that release branch, and print the command you can use to wait for the side tag to be generated. Then you just modify your build command slightly to build on the side tag, and once all builds on the side tag are done, create an update from it. All of this could be automated in packit, there is no technical reason why not AFAIK.

Comment 22 Martin Pitt 2024-02-15 10:47:18 UTC
They are a *very* big hammer compared to what we do now: Push an upstream release tag and click "merge" on the Fedora PRs. That's it.

I know that Fedora isn't great with dependency handling, but I'm sorry -- I'm not going to do everything manually because bodhi doesn't know how to check satisfiability of dependencies (coming from Debian or Ubuntu which would never allow a thing like bug 2262526). Rawhide breaks very often due to that, and usually we (the cockpit team) are on the "receiving the pain" side of it, as nobody else seems to care. We waste enough time after cleaning up that, and we already have a too small team. Katerina and I will just release in sync, and let composes sort it out eventually. That's what everybody else apparently does, and it's apparently the acceptable level of quality.

I'm sorry if that sounds angry -- I am, but not at you (you are just the messenger).

Comment 23 Adam Williamson 2024-02-15 16:45:54 UTC
That will not work. The updates will fail gating and not be pushed stable. It will create additional work for me, and additional work and delay for you.

It's not acceptable to just unilaterally declare that the update policies adopted by the distribution you are working on do not apply to you. They do apply to you. If you don't like them, you can lobby for them to be changed.

I have no idea how Debian or Ubuntu does reverse dependency checking (or for that matter, exactly how they approach the problem of grouping changes and landing them in the distribution), but it is a very difficult operation that multiple people have tried to do in Fedora over the years and not been able to get to a sufficient standard of accuracy to be used for gating purposes. We have to work with the tools we have, not the tools we wish we had.

The reason for this policy and for openQA's enforcement of it is precisely to prevent Rawhide from "breaking very often" due to this kind of issue (dependent updates not being pushed together). I disagree that this happens "very often", at least in the critical path package set. If you have examples of cases where it has happened recently I will look at them.

Comment 24 Martin Pitt 2024-02-15 17:29:20 UTC
Two very recent examples are bug 2262526 and https://download.copr.fedorainfracloud.org/results/packit/cockpit-project-cockpit-navigator-274/fedora-rawhide-x86_64/07020312-cockpit-navigator/builder-live.log.gz . I suppose neither tuned nor nodejs are in the critical path, but it's both things that packages in the critical path (like cockpit) depend on. A few weeks before we had lots of uninstallability due to a new Python.

Just in case anyone cases, Debian/Ubuntu use https://release.debian.org/doc/britney/ . I do agree it is a difficult problem, I was just surprised that there's nothing equivalent in Fedora.

Anyway, I apologize for my grumpiness. Just about every piece of Fedora and CentOS infra failed today, which left me in a bad mood.

Comment 25 Adam Williamson 2024-02-15 18:01:20 UTC
Yeah, sorry, I apologize for being a bit grumpy too, and I realized my comment depends on a lot of assumed knowledge and history, so let me unpick it a bit.

The biggest problem isn't "are the dependencies of proposed update X satisfiable?", really. That one we can do. The hardest problem is "does proposed update X break anyone else's dependencies?"

So, historically we did more or less what would be most convenient for you here: we just trusted packagers to update everything together correctly. We expected that if your cockpit and anaconda-webui builds went together, you'd build them immediately in sequence and make sure they worked, so at any given time we could just run a compose and everything would be fine.

A decade or so of history indicates this is very definitely not what happens. :D For a start there's just an unavoidable timing issue - if you're doing something like an soname bump you *have* to do the new soname build first *then* rebuild everything else, and if a compose happens to fire in the middle of that process, there are going to be broken deps.

But beyond that, it turns out we just...can't rely on maintainers to do it. People would do soname bumps of libraries that large chunks of the distro depended on and just leave them there. If you were lucky they'd at least notify people about it, often even that didn't happen. It was not at all uncommon to have days or weeks at a time when Rawhide wouldn't compose due to multiple *different* chains like this.

So it's pretty clear we can't just trust maintainers, which is (partly, there were other reasons too) why we wound up sticking Bodhi into the Rawhide flow to make it *possible* to bundle interdependent updates, wrote a policy telling people they should do so (so we have something to point to in rejecting updates that aren't properly bundled), and why I subsequently made openQA do some degree of enforcing this for critpath packages (it is not comprehensive, but it *is* fairly good at not producing false failures, which is very important). 

To "solve" this problem entirely such that as a maintainer you could just go ahead and fire builds off and trust that The Tools Will Take Care Of The Rest, we'd need tools to handle some fairly complex scenarios. The classic 'big soname bump' is one. So you bump the soname on something that fifty other packages build against. How to handle that? We need to detect that "this new package breaks the dependencies of these 50 other packages". We then need to "keep an eye on things" such that once those other 50 packages are rebuilt, we push the soname bump and all those other packages together. But there probably needs to be some kind of configurability to this, because *sometimes* we do need to just go ahead and push the bump with only 48 packages rebuilt and two comparatively unimportant ones left broken. (Currently, you can usually do this because the only real 'enforcement' of the rules is openQA's incomplete check for critical path packages; basically openQA tests enough operations that it's *usually* the case that if openQA passes an update, it *probably* won't break the compose).

But it's not even "just" that simple, because there are lots of awkward complicating factors, like when multiple packages provide the same capability. Or say you want to do an soname bump *but also introduce a new compat library package*, which would provide the dependency that's otherwise "missing" - the system would need to be smart enough to recognize and smoothly handle that situation, pushing the soname bump and the new compat library out together.

There is actually a CI check that broadly aims at this area - rpmdeplint - but it's never been reliable or detailed enough to use as a gating check. And even if we have a comprehensive gating check, that's just a *check*, it doesn't solve the problem of "have the update tooling actually do the work of figuring out the appropriate groupings of builds so maintainers don't have to".

Thanks for the link to the Debian thing, I did google around for a bit to try and find out how this works there but did not manage to find that.

Comment 26 Adam Williamson 2024-02-15 18:13:19 UTC
Looking briefly at the britney doc, it seems interesting, but it also kinda reads to me like it's kinda set up to work on 'batches'. Like, you need a workflow where you're kinda semi-regularly pushing a fairly large group of changes from one place to another (it gives the example of 'unstable' to 'testing').

So, yeah, I can see how it works with that workflow, kinda. Say you're doing you're big soname bump. You can just do the soname bump and then keep dropping rebuilds into the "pool". Each time britney 'tests the migration' while that's going on, it will fail, so it won't run - until the last dependent rebuild is done and shows up in the "pool", then the britney check will pass, and the "migration" will happen. (You still have to handle the buildroot problem, in this case - presumably new builds are part of "the buildroot" immediately, and if the migration checks fail for too long, you're going to wind up with an awkwardly large delta between your buildroot and your 'stable' package set, but that's not my problem. :>)

This is, AIUI, kinda how openSUSE Tumbleweed works too, with openQA testing candidate 'groups' of updates (I don't recall the details exactly of how the groups are defined). At least, how it was explained to me a few years ago.

That's not how Rawhide, especially, works, though. If we only pushed things into Rawhide periodically, in big batches, there'd probably be a packager revolt. The fundamental flow for Rawhide still kinda needs to be "your change goes in immediately it passes any automated checks that apply to it" (and people get antsy if the automated checks take more than a couple of hours). That makes an approach that relies on ad-hoc "batches" of updates harder, I think.

We did actually try using batched updates for stable Fedora releases for a while (never Rawhide), not for this specific reason, but it didn't really work out very well. That was before openQA had gating update checks, too, IIRC. I suppose it might be kinda interesting to try it again and apply this kind of installability check to the batches, but that would be for stable, not Rawhide...

Is britney used for landing things *in unstable*? Or experimental? Does it work in that context?

Comment 27 Martin Pitt 2024-02-15 20:31:18 UTC
Thanks for the backgroud!

> People would do soname bumps of libraries that large chunks of the distro depended on and just leave them there.

Heh, yes, been there, done that (I had been Ubuntu release manager for a few years, and it had been an utter PITA), until we put a stop to this and said "library transitions are going to land completely or not at all" (via britney). If we want to exchange field stories and ideas, I propose to do that over a beer in Brno (hello devconf) ;-)

> Is britney used for landing things *in unstable*?

in Debian, britney controls the migration from unstable to testing. I.e. it computes which sets of packages can move as a group with having satisfiable dependencies, don't break any other dependencies, and don't regress any tests. This is not ideal for users of unstable (which is an actual thing). It's better in Ubuntu where uploads land in a kind of "quarantine" (called $release-proposed), an archive which is by definition "broken" and only used by CI. It only propagates to $release via britney, i.e. all releases (including development) are always installable and have no unknown regressions. It of course means that in some cases, broken/incomplete uploads are stuck there for a long time, but that's fine.

Wrt. the concrete anaconda/cockpit issue at hand: Katerina made anaconda backwards compatible, so we can land them independently (i.e. cockpit first), we are doing that now. Easier all around than learning an entirely new process all around, and probably also easier for the tooling.

Thanks!

Comment 28 Martin Pitt 2024-02-15 20:37:50 UTC
For the record, I filed bug 2264473 about nodejs disappearing, I couldn't find an existing one.

Comment 29 Adam Williamson 2024-02-15 20:39:28 UTC
The critical path includes dependencies. That is, if package A is declared to be in the critical path, everything that package A requires is also in the critical path. We *implement* this through dnf's depsolving mechanism - we literally have dnf build a transaction to 'install' every package listed in the critpath definition, on each arch, and then say 'everything that's depsolved into that transaction is the critical path'. https://pagure.io/releng/blob/main/f/scripts/critpath.py is the script.

The 'cockpit' package is declared to be in the critical path (critical-path-server). cockpit requires cockpit-bridge, cockpit-system, and cockpit-ws; anything required by any of those packages is therefore also in critical-path-server. Apparently tuned is not in that set, because the most recent tuned update I can find - https://bodhi.fedoraproject.org/updates/FEDORA-2024-eca7d3eec9 - is not marked as being in critical-path-server. Neither, apparently, is nodejs, because https://bodhi.fedoraproject.org/updates/FEDORA-2024-36311c673e (nodejs binary package is provided by nodejs20 source package) is not in critical-path-server either. From https://download.copr.fedorainfracloud.org/results/packit/cockpit-project-cockpit-navigator-274/fedora-rawhide-x86_64/07020312-cockpit-navigator/builder-live.log.gz it seems like nodejs is a *build time* dependency of cockpit-navigator ? So, yeah, since the critical path definition considers run-time not build-time deps, and also cockpit-navigator is not a dependency of cockpit, it would be left out two ways.

If any of the three critpath cockpit packages had a dependency on tuned, we would have spotted 2262526 , but not immediately, because it wasn't caused by a tuned update, it was caused by the retirement of kernel-tools. What would have happened is that we'd have seen a failure of the cockpit tests for the first update tested after kernel-tools got retired, and I would've had to investigate that and work out what had happened. I've filed https://pagure.io/releng/issue/11957 for the issue that we have no checks at all around package retirements, which has been bugging me for a while but which definitely isn't easy to fix.

Comment 30 Adam Williamson 2024-02-15 20:40:45 UTC
"It of course means that in some cases, broken/incomplete uploads are stuck there for a long time, but that's fine."

Yeah, that's the part I think might not be "fine" for Rawhide. I'm not sure Fedora packagers would be happy if, for instance, their totally-internally-coherent update was blocked from landing for two days because someone was struggling to get a big soname rebuild or GNOME megaupdate lined up...

Comment 31 Adam Williamson 2024-02-15 20:42:28 UTC
oh, or is britney 'smart enough' to kinda drill down the 'proposed' pool and find the largest-possible set of things in it which are "fine", push *those*, and only hold up the broken bits?

Comment 32 Martin Pitt 2024-02-15 20:53:37 UTC
(In reply to Adam Williamson from comment #30)

> Yeah, that's the part I think might not be "fine" for Rawhide. I'm not sure
> Fedora packagers would be happy if, for instance, their
> totally-internally-coherent update was blocked from landing for two days
> because someone was struggling to get a big soname rebuild or GNOME
> megaupdate lined up...

If the soname rebuild isn't complete, then that is not a "totally coherent update".

But I suppose you meant something different: if maintainer A uploads app1 and lib1, and they as a group don't break anything, they land. Maintainer B can upload some app2 at any time before, in between or afterwards, and if that breaks anything, it'll be kept. I.e. what you asked here:

> is britney 'smart enough' to kinda drill down the 'proposed' pool and find the largest-possible set of things in it which are "fine", push *those*, and only hold up the broken bits?

Yes, exactly, that's the whole point of it. It is computationally rather expensive (a run takes several minutes, as that is conceptually a very much superlinear problem; possibly even exponential), but it's worth the trouble. See https://ubuntu-archive-team.ubuntu.com/proposed-migration/update_excuses.html for how the output looks like. At pretty much any time, the "proposed" pool is rather full, with most of it just being there for a few hours and then landing, then some complicated bits like Python or big library transitions which may take days or weeks, and then some broken stuff that has been there for a long time which nobody is interested in (and it really wouldn't hurt to clean these up from time to time, but that's not my business any more now :-) )

Comment 33 Adam Williamson 2024-02-15 21:04:35 UTC
ah, okay. then yeah, at least theoretically we could apply something like that to Rawhide, yeah, if it could run fast enough that nobody got mad. It would I guess have to be an additional step that Bodhi would do when actually landing things in Rawhide, and we'd still have to figure out the timing of how often to run it...whenever a new update landed in the 'proposed' pool? And integrate it with the whole Koji signing dance, of course...now, who volunteers to write it? :D


Note You need to log in before you can comment on or make changes to this bug.