openQA has been seeing this for a while, but today kparal confirmed he's seen it on his own tests, so apparently it's not just a blip as I was hoping. Sometimes during live installs, when the rsync operation is happening, it just...gets stuck. No progress is indicated in the UI or the system journal. If you leave it this way it will stay stuck indefinitely, the install will never complete. However, Kamil says that if you do something like launching a gnome-terminal , it somehow 'unsticks' the situation and the install completes. In openQA, this seems to be happening to something like 2% of installs. I'll attach the logs we have (they're not obviously illuminating).
Created attachment 2084325 [details] anaconda.log
Created attachment 2084326 [details] dbus.log
Created attachment 2084327 [details] df.log
Created attachment 2084328 [details] free.log
Created attachment 2084329 [details] packaging.log
Created attachment 2084330 [details] program.log
Created attachment 2084331 [details] storage.log
Created attachment 2084332 [details] var/log tarball
Created attachment 2084334 [details] var/tmp tarball
Going from my vague memory, just triggering the gnome overview (using the win key) and returning back to anaconda (another win key press) didn't unfreeze the installation. However, launching a gnome-terminal (I wanted to see cpu utilization, etc) immediately unfroze it. Also, in my virt-manager stats (it was running in a VM), I saw constant cpu utilization (around 50%, with 3 cpus) during the frozen period. The qemu process on my host system had high cpu usage. So it was clearly doing something (otoh, the spinner was spinning the whole time, so perhaps the cpu usage was just related to animating the spinner). After the installation unfroze, the cpu utilization went up a bit, I believe.
So far, I haven't seen this on bare metal (but since this is a race condition, and the likelihood is not high, it might not mean much).
The earliest occurrences of this I can see in openQA were on Feb 19 at 22:19 UTC and Feb 20 at 14:39 UTC, in F42.
rsync was last changed on jan 30, so that doesn't fit in. anaconda changed on jan 28 and mar 10, so that doesn't fit in. kernel went from kernel-6.14.0-0.rc1.15.fc42 to kernel-6.14.0-0.rc3.29.fc42 on Feb 17, so that's a possible suspect.
Proposing as a Final blocker as a conditional violation of "The installer must be able to complete an installation using any supported locally connected storage interface" (and any other 'install must complete' criterion), on live installs, some relatively small percentage of the time, possibly only on VM (we're looking into this).
I've managed to reproduce it again, in a VM. I made a snapshot of the broken state, unfortunately restoring it is racy as well, and it's kept in the broken (installation frozen) state only rarely, mostly the installation continues as expected. In the few occurrences where I could explore the broken state, anything I did unfroze the installation. That included: * running a different app (gnome-terminal) * switching to a VT and back * logging in over ssh I suspect that anything that causes a disk read (or any I/O) unfreezes the installation. So my current suspects are kernel in the VM guest, or virtio/libvirt/qemu libraries (+ possibly kernel) in the VM host.
Note, in the logs, you can see it gets stuck at 60%: Apr 05 22:00:51 localhost-live org.fedoraproject.Anaconda.Modules.Payloads[2928]: DEBUG:anaconda.modules.payloads.payload.live_image.installation:rsync progress: 58% Apr 05 22:00:57 localhost-live org.fedoraproject.Anaconda.Modules.Payloads[2928]: DEBUG:anaconda.modules.payloads.payload.live_image.installation:rsync progress: 59% Apr 05 22:01:01 localhost-live org.fedoraproject.Anaconda.Modules.Payloads[2928]: DEBUG:anaconda.modules.payloads.payload.live_image.installation:rsync progress: 60% Apr 05 22:08:55 localhost-live systemd[1]: Starting systemd-tmpfiles-clean.service - Cleanup of Temporary Directories... then about 40 minutes later, openQA gives up and switches to a VT to upload logs, and rsync picks up again: Apr 05 22:49:23 localhost-live systemd[1]: Started getty - Getty on tty3. ... Apr 05 22:49:28 localhost-live org.fedoraproject.Anaconda.Modules.Payloads[2928]: DEBUG:anaconda.modules.payloads.payload.live_image.installation:rsync progress: 62% Apr 05 22:49:28 localhost-live org.fedoraproject.Anaconda.Modules.Payloads[2928]: DEBUG:anaconda.modules.payloads.payload.live_image.installation:rsync progress: 63% so that matches with kparal's "doing stuff makes it unstick" experience.
Maybe trying to upgrade to kernel 6.14.1 before running anaconda can fix this?
Rawhide has 6.15 kernels and is still affected by this, so I doubt it.
Discussed at 2025-04-10 F42 go/no-go meeting, acting as a blocker review meeting: https://meetbot-raw.fedoraproject.org/meeting_matrix_fedoraproject-org/2025-04-10/fedora-linux-final-go-no-go-meeting.2025-04-10-17.01.html . This was rejected as a blocker on the basis that the prevalence is a bit too low to block on, especially since so far it seems to be VM-only, and there are obvious workarounds (fiddle around and it starts working again, or just reboot and try again).
This is still happening - https://openqa.fedoraproject.org/tests/3459809#step/_do_install_and_reboot/114 is a recent affected test. That test was just stuck at 63% till it failed.
This message is a reminder that Fedora Linux 42 is nearing its end of life. Fedora will stop maintaining and issuing updates for Fedora Linux 42 on 2026-05-13. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a 'version' of '42'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, change the 'version' to a later Fedora Linux version. Note that the version field may be hidden. Click the "Show advanced fields" button if you do not see it. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora Linux 42 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora Linux, you are encouraged to change the 'version' to a later version prior to this bug being closed.
My last comment links to an F43 test. However, looking at that test's history, it looks like it never happened again after https://openqa.fedoraproject.org/tests/3494684 . And I don't see any occurrences of it in F44 test history either. So...I think we can maybe call this closed? It's possible it just got hidden when we moved to more powerful test runners with the data centre move, but...