Bug 2358880 - Live install sometimes gets stuck during rsync, launching another application unsticks it
Summary: Live install sometimes gets stuck during rsync, launching another application...
Keywords:
Status: NEW
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 42
Hardware: All
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard: RejectedBlocker https://discussion.fe...
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2025-04-10 15:07 UTC by Adam Williamson
Modified: 2025-05-27 21:13 UTC (History)
20 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed:
Type: Bug
Embargoed:


Attachments (Terms of Use)
anaconda.log (6.92 KB, text/plain)
2025-04-10 15:15 UTC, Adam Williamson
no flags Details
dbus.log (3.91 KB, text/plain)
2025-04-10 15:16 UTC, Adam Williamson
no flags Details
df.log (1.17 KB, text/plain)
2025-04-10 15:16 UTC, Adam Williamson
no flags Details
free.log (207 bytes, text/plain)
2025-04-10 15:17 UTC, Adam Williamson
no flags Details
packaging.log (11.66 KB, text/plain)
2025-04-10 15:18 UTC, Adam Williamson
no flags Details
program.log (1.77 KB, text/plain)
2025-04-10 15:18 UTC, Adam Williamson
no flags Details
storage.log (165.45 KB, text/plain)
2025-04-10 15:18 UTC, Adam Williamson
no flags Details
var/log tarball (1.14 MB, application/octet-stream)
2025-04-10 15:20 UTC, Adam Williamson
no flags Details
var/tmp tarball (832 bytes, application/octet-stream)
2025-04-10 15:20 UTC, Adam Williamson
no flags Details

Description Adam Williamson 2025-04-10 15:07:26 UTC
openQA has been seeing this for a while, but today kparal confirmed he's seen it on his own tests, so apparently it's not just a blip as I was hoping.

Sometimes during live installs, when the rsync operation is happening, it just...gets stuck. No progress is indicated in the UI or the system journal. If you leave it this way it will stay stuck indefinitely, the install will never complete. However, Kamil says that if you do something like launching a gnome-terminal , it somehow 'unsticks' the situation and the install completes.

In openQA, this seems to be happening to something like 2% of installs.

I'll attach the logs we have (they're not obviously illuminating).

Comment 1 Adam Williamson 2025-04-10 15:15:38 UTC
Created attachment 2084325 [details]
anaconda.log

Comment 2 Adam Williamson 2025-04-10 15:16:01 UTC
Created attachment 2084326 [details]
dbus.log

Comment 3 Adam Williamson 2025-04-10 15:16:23 UTC
Created attachment 2084327 [details]
df.log

Comment 4 Adam Williamson 2025-04-10 15:17:00 UTC
Created attachment 2084328 [details]
free.log

Comment 5 Adam Williamson 2025-04-10 15:18:00 UTC
Created attachment 2084329 [details]
packaging.log

Comment 6 Adam Williamson 2025-04-10 15:18:27 UTC
Created attachment 2084330 [details]
program.log

Comment 7 Adam Williamson 2025-04-10 15:18:55 UTC
Created attachment 2084331 [details]
storage.log

Comment 8 Adam Williamson 2025-04-10 15:20:02 UTC
Created attachment 2084332 [details]
var/log tarball

Comment 9 Adam Williamson 2025-04-10 15:20:33 UTC
Created attachment 2084334 [details]
var/tmp tarball

Comment 10 Kamil Páral 2025-04-10 15:30:33 UTC
Going from my vague memory, just triggering the gnome overview (using the win key) and returning back to anaconda (another win key press) didn't unfreeze the installation. However, launching a gnome-terminal (I wanted to see cpu utilization, etc) immediately unfroze it. Also, in my virt-manager stats (it was running in a VM), I saw constant cpu utilization (around 50%, with 3 cpus) during the frozen period. The qemu process on my host system had high cpu usage. So it was clearly doing something (otoh, the spinner was spinning the whole time, so perhaps the cpu usage was just related to animating the spinner). After the installation unfroze, the cpu utilization went up a bit, I believe.

Comment 11 Kamil Páral 2025-04-10 15:32:09 UTC
So far, I haven't seen this on bare metal (but since this is a race condition, and the likelihood is not high, it might not mean much).

Comment 12 Adam Williamson 2025-04-10 15:47:27 UTC
The earliest occurrences of this I can see in openQA were on Feb 19 at 22:19 UTC and Feb 20 at 14:39 UTC, in F42.

Comment 13 Adam Williamson 2025-04-10 15:49:26 UTC
rsync was last changed on jan 30, so that doesn't fit in.
anaconda changed on jan 28 and mar 10, so that doesn't fit in.
kernel went from kernel-6.14.0-0.rc1.15.fc42 to kernel-6.14.0-0.rc3.29.fc42 on Feb 17, so that's a possible suspect.

Comment 14 Adam Williamson 2025-04-10 15:50:46 UTC
Proposing as a Final blocker as a conditional violation of "The installer must be able to complete an installation using any supported locally connected storage interface" (and any other 'install must complete' criterion), on live installs, some relatively small percentage of the time, possibly only on VM (we're looking into this).

Comment 15 Kamil Páral 2025-04-10 15:56:00 UTC
I've managed to reproduce it again, in a VM. I made a snapshot of the broken state, unfortunately restoring it is racy as well, and it's kept in the broken (installation frozen) state only rarely, mostly the installation continues as expected.

In the few occurrences where I could explore the broken state, anything I did unfroze the installation. That included:
* running a different app (gnome-terminal)
* switching to a VT and back
* logging in over ssh

I suspect that anything that causes a disk read (or any I/O) unfreezes the installation. So my current suspects are kernel in the VM guest, or virtio/libvirt/qemu libraries (+ possibly kernel) in the VM host.

Comment 16 Adam Williamson 2025-04-10 16:20:16 UTC
Note, in the logs, you can see it gets stuck at 60%:

Apr 05 22:00:51 localhost-live org.fedoraproject.Anaconda.Modules.Payloads[2928]: DEBUG:anaconda.modules.payloads.payload.live_image.installation:rsync progress: 58%
Apr 05 22:00:57 localhost-live org.fedoraproject.Anaconda.Modules.Payloads[2928]: DEBUG:anaconda.modules.payloads.payload.live_image.installation:rsync progress: 59%
Apr 05 22:01:01 localhost-live org.fedoraproject.Anaconda.Modules.Payloads[2928]: DEBUG:anaconda.modules.payloads.payload.live_image.installation:rsync progress: 60%
Apr 05 22:08:55 localhost-live systemd[1]: Starting systemd-tmpfiles-clean.service - Cleanup of Temporary Directories...

then about 40 minutes later, openQA gives up and switches to a VT to upload logs, and rsync picks up again:

Apr 05 22:49:23 localhost-live systemd[1]: Started getty - Getty on tty3.
...
Apr 05 22:49:28 localhost-live org.fedoraproject.Anaconda.Modules.Payloads[2928]: DEBUG:anaconda.modules.payloads.payload.live_image.installation:rsync progress: 62%
Apr 05 22:49:28 localhost-live org.fedoraproject.Anaconda.Modules.Payloads[2928]: DEBUG:anaconda.modules.payloads.payload.live_image.installation:rsync progress: 63%

so that matches with kparal's "doing stuff makes it unstick" experience.

Comment 17 Geraldo Simião 2025-04-10 16:32:22 UTC
Maybe trying to upgrade to kernel 6.14.1 before running anaconda can fix this?

Comment 18 Adam Williamson 2025-04-10 18:52:59 UTC
Rawhide has 6.15 kernels and is still affected by this, so I doubt it.

Comment 19 Adam Williamson 2025-04-10 19:59:11 UTC
Discussed at 2025-04-10 F42 go/no-go meeting, acting as a blocker review meeting: https://meetbot-raw.fedoraproject.org/meeting_matrix_fedoraproject-org/2025-04-10/fedora-linux-final-go-no-go-meeting.2025-04-10-17.01.html . This was rejected as a blocker on the basis that the prevalence is a bit too low to block on, especially since so far it seems to be VM-only, and there are obvious workarounds (fiddle around and it starts working again, or just reboot and try again).

Comment 20 Adam Williamson 2025-05-27 21:13:37 UTC
This is still happening - https://openqa.fedoraproject.org/tests/3459809#step/_do_install_and_reboot/114 is a recent affected test. That test was just stuck at 63% till it failed.


Note You need to log in before you can comment on or make changes to this bug.