openQA has been seeing this for a while, but today kparal confirmed he's seen it on his own tests, so apparently it's not just a blip as I was hoping. Sometimes during live installs, when the rsync operation is happening, it just...gets stuck. No progress is indicated in the UI or the system journal. If you leave it this way it will stay stuck indefinitely, the install will never complete. However, Kamil says that if you do something like launching a gnome-terminal , it somehow 'unsticks' the situation and the install completes. In openQA, this seems to be happening to something like 2% of installs. I'll attach the logs we have (they're not obviously illuminating).
Created attachment 2084325 [details] anaconda.log
Created attachment 2084326 [details] dbus.log
Created attachment 2084327 [details] df.log
Created attachment 2084328 [details] free.log
Created attachment 2084329 [details] packaging.log
Created attachment 2084330 [details] program.log
Created attachment 2084331 [details] storage.log
Created attachment 2084332 [details] var/log tarball
Created attachment 2084334 [details] var/tmp tarball
Going from my vague memory, just triggering the gnome overview (using the win key) and returning back to anaconda (another win key press) didn't unfreeze the installation. However, launching a gnome-terminal (I wanted to see cpu utilization, etc) immediately unfroze it. Also, in my virt-manager stats (it was running in a VM), I saw constant cpu utilization (around 50%, with 3 cpus) during the frozen period. The qemu process on my host system had high cpu usage. So it was clearly doing something (otoh, the spinner was spinning the whole time, so perhaps the cpu usage was just related to animating the spinner). After the installation unfroze, the cpu utilization went up a bit, I believe.
So far, I haven't seen this on bare metal (but since this is a race condition, and the likelihood is not high, it might not mean much).
The earliest occurrences of this I can see in openQA were on Feb 19 at 22:19 UTC and Feb 20 at 14:39 UTC, in F42.
rsync was last changed on jan 30, so that doesn't fit in. anaconda changed on jan 28 and mar 10, so that doesn't fit in. kernel went from kernel-6.14.0-0.rc1.15.fc42 to kernel-6.14.0-0.rc3.29.fc42 on Feb 17, so that's a possible suspect.
Proposing as a Final blocker as a conditional violation of "The installer must be able to complete an installation using any supported locally connected storage interface" (and any other 'install must complete' criterion), on live installs, some relatively small percentage of the time, possibly only on VM (we're looking into this).
I've managed to reproduce it again, in a VM. I made a snapshot of the broken state, unfortunately restoring it is racy as well, and it's kept in the broken (installation frozen) state only rarely, mostly the installation continues as expected. In the few occurrences where I could explore the broken state, anything I did unfroze the installation. That included: * running a different app (gnome-terminal) * switching to a VT and back * logging in over ssh I suspect that anything that causes a disk read (or any I/O) unfreezes the installation. So my current suspects are kernel in the VM guest, or virtio/libvirt/qemu libraries (+ possibly kernel) in the VM host.
Note, in the logs, you can see it gets stuck at 60%: Apr 05 22:00:51 localhost-live org.fedoraproject.Anaconda.Modules.Payloads[2928]: DEBUG:anaconda.modules.payloads.payload.live_image.installation:rsync progress: 58% Apr 05 22:00:57 localhost-live org.fedoraproject.Anaconda.Modules.Payloads[2928]: DEBUG:anaconda.modules.payloads.payload.live_image.installation:rsync progress: 59% Apr 05 22:01:01 localhost-live org.fedoraproject.Anaconda.Modules.Payloads[2928]: DEBUG:anaconda.modules.payloads.payload.live_image.installation:rsync progress: 60% Apr 05 22:08:55 localhost-live systemd[1]: Starting systemd-tmpfiles-clean.service - Cleanup of Temporary Directories... then about 40 minutes later, openQA gives up and switches to a VT to upload logs, and rsync picks up again: Apr 05 22:49:23 localhost-live systemd[1]: Started getty - Getty on tty3. ... Apr 05 22:49:28 localhost-live org.fedoraproject.Anaconda.Modules.Payloads[2928]: DEBUG:anaconda.modules.payloads.payload.live_image.installation:rsync progress: 62% Apr 05 22:49:28 localhost-live org.fedoraproject.Anaconda.Modules.Payloads[2928]: DEBUG:anaconda.modules.payloads.payload.live_image.installation:rsync progress: 63% so that matches with kparal's "doing stuff makes it unstick" experience.
Maybe trying to upgrade to kernel 6.14.1 before running anaconda can fix this?
Rawhide has 6.15 kernels and is still affected by this, so I doubt it.
Discussed at 2025-04-10 F42 go/no-go meeting, acting as a blocker review meeting: https://meetbot-raw.fedoraproject.org/meeting_matrix_fedoraproject-org/2025-04-10/fedora-linux-final-go-no-go-meeting.2025-04-10-17.01.html . This was rejected as a blocker on the basis that the prevalence is a bit too low to block on, especially since so far it seems to be VM-only, and there are obvious workarounds (fiddle around and it starts working again, or just reboot and try again).
This is still happening - https://openqa.fedoraproject.org/tests/3459809#step/_do_install_and_reboot/114 is a recent affected test. That test was just stuck at 63% till it failed.