Bug 2063156 - Workstation Live is frozen in a VM with QXL video driver (Virtio works OK)
Summary: Workstation Live is frozen in a VM with QXL video driver (Virtio works OK)
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: gnome-shell
Version: 36
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Florian Müllner
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard: https://ask.fedoraproject.org/t/commo...
Depends On:
Blocks: F36FinalFreezeException
TreeView+ depends on / blocked
 
Reported: 2022-03-11 11:59 UTC by Kamil Páral
Modified: 2022-04-19 22:04 UTC (History)
16 users (show)

Fixed In Version: gnome-shell-42.0-3.fc36
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-04-19 22:04:15 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
VM xml from 'virsh dumpxml' (5.12 KB, text/xml)
2022-03-11 12:02 UTC, Kamil Páral
no flags Details
system journal while stuck (236.07 KB, text/plain)
2022-03-14 10:21 UTC, Kamil Páral
no flags Details
system journal (notice priority and above) while stuck (52.80 KB, text/plain)
2022-03-14 10:22 UTC, Kamil Páral
no flags Details
rpm diff between 0228 and 0307 (13.67 KB, text/plain)
2022-03-23 07:23 UTC, Kamil Páral
no flags Details


Links
System ID Private Priority Status Summary Last Updated
GNOME Gitlab GNOME mutter issues 2201 0 None None None 2022-03-29 19:22:30 UTC

Description Kamil Páral 2022-03-11 11:59:29 UTC
Description of problem:
When I boot F36 Workstation Live iso (Fedora-Workstation-Live-x86_64-36-20220310.n.0.iso) on a new default VM created in virt-manager, the Live session boots, but appears frozen. The system doesn't respond to any mouse events. It also doesn't respond to keyboard events, except Escape, which closes the initial 'Welcome to Fedora' dialog. After that, I'm unable to perform any further action.

This only happens when the VM has Video: QXL (which is the default value). When I change it to Video: Virtio the Live system works as excepted.

This also affects gnome-boxes, when you start the existing VM (created in virt-manager). However, if you create a new VM in gnome-boxes, it uses virtio by default, and therefore works OK.

When I try this with KDE, it works OK even with qxl, and so this problem seems to be related to GNOME stack (mutter?) rather than the qxl driver itself.

I also tested an older Workstation Live (Fedora-Workstation-Live-x86_64-36-20220228.n.0.iso) and then one works OK even with qxl. So the regression is quite recent.


Version-Release number of selected component (if applicable):
Fedora-Workstation-Live-x86_64-36-20220310.n.0.iso
Packages in the VM:
gnome-shell-42~beta-4.fc36.x86_64
mutter-42~beta-1.fc36.x86_64
Packages on the host:
virt-manager-3.2.0-4.fc35.noarch
qemu-kvm-6.1.0-14.fc35.x86_64
qemu-device-display-qxl-6.1.0-14.fc35.x86_64


How reproducible:
always

Steps to Reproduce:
1. create a new VM in virt-manager, confirm that it has QXL video driver (the default)
2. boot Fedora-Workstation-Live-x86_64-36-20220310.n.0.iso
3. see the welcome screen frozen - mouse doesn't work nor keyboard, except the Esc key

Comment 1 Kamil Páral 2022-03-11 12:02:19 UTC
Created attachment 1865437 [details]
VM xml from 'virsh dumpxml'

Comment 2 Fedora Blocker Bugs Application 2022-03-11 12:04:40 UTC
Proposed as a Blocker for 36-beta by Fedora user kparal using the blocker tracking app because:

 Proposing as a Beta blocker because:
"The release must install and boot successfully as a virtual guest in a situation where the virtual host is running the current stable Fedora release."
https://fedoraproject.org/wiki/Basic_Release_Criteria#Guest_on_current_stable_release

Comment 3 Kamil Páral 2022-03-11 13:59:40 UTC
Quite interestingly, if I install Fedora-Workstation-Live-x86_64-36-20220310.n.0.iso using virtio (to work around the problem) and then switch the VM to qxl, I can no longer reproduce the problem with the installed system (I even tried enabling autologin, to match the Live scenario). Which means I don't know how to gather system logs, when it happens only on Live (and the system then doesn't respond to anything).

Comment 4 Adam Williamson 2022-03-12 00:08:21 UTC
I can't reproduce this, on F36 at least. First of all, a newly-created virt-manager VM uses virtio for me, not qxl. If I create a new VM and change it to qxl before booting, I can boot fine and use the live system.

I tested with the live image openQA built from the GNOME 42-rc megaupdate. Could you test with the Beta candidate, which also includes that update, and see if that works for you?

Comment 5 Carl G. 2022-03-13 03:06:29 UTC
>I can't reproduce this, on F36 at least. First of all, a newly-created virt-manager VM uses virtio for me, not qxl.

qxl is the default display device prior to virt-manager 4.0.0 (source: https://listman.redhat.com/archives/virt-tools-list/2022-March/017511.html )

I managed to reproduce this issue three times out of maybe 10-15 attempts. The clock show the current time so it's not completely frozen but the VM doesn't respond to keyboard and mouse events, except ESC.

host: Fedora 36
disk image: Fedora-Workstation-Live-x86_64-36-20220311.n.0.iso

maybe libguestfs-tools can be used to extract the journal but I'm not sure if it's compatible with snapshots.

Comment 6 Adam Williamson 2022-03-13 07:21:35 UTC
What I would maybe try is to first boot runlevel 3, enable sshd, ssh in, then do 'systemctl isolate graphical.target'. Then if the bug reproduces you have a logged-in ssh session to grab the logs from. you could also try network logging, I guess.

Comment 7 Kamil Páral 2022-03-14 09:57:05 UTC
I spent more time with debugging this and have some interesting findings. First of all, all the testing is done on F35 (virt-manager 3.2, qxl by default), so please note that F36 results might be different.

Originally I had 100% failure rate. However, now that Carl mentioned race conditions, I can confirm this is indeed a race. But in my case, I still get almost 100% failure rate, but only when the VM is started in *user session mode*. In *system mode*, I get a failure rate similar to what Carl described, e.g. 1 failure in 5-10 attempts. All of this applies both to Fedora-Workstation-Live-x86_64-36-20220310.n.0.iso and Fedora-Workstation-Live-x86_64-36_Beta-1.1.iso. Fedora-Workstation-Live-x86_64-36-20220228.n.0.iso seems to always work fine (or maybe I'm just lucky because of the races).

Comment 8 Kamil Páral 2022-03-14 10:20:50 UTC
I managed to log in to the stuck system using `console=ttyS0` on the boot cmdline and then connecting through `virsh console`. I can confirm that the clock on top keeps updating, so the system is not actually stuck, just all/most input seems stuck. Also, I saw an "update available" popup appear and disappear (which it shouldn't appear at all, on a Live image), and the serial console works normally, so this is not a complete system freeze. I can try to debug something from the cmdline, if you tell me where to look.

Comment 9 Kamil Páral 2022-03-14 10:21:48 UTC
Created attachment 1865840 [details]
system journal while stuck

Comment 10 Kamil Páral 2022-03-14 10:22:17 UTC
Created attachment 1865841 [details]
system journal (notice priority and above) while stuck

Comment 11 Kamil Páral 2022-03-14 11:05:08 UTC
Another interesting find - if you wait 5 minutes without interacting with the VM, so that the screensaver kicks in (even if the screen looks frozen), the keyboard input starts working and you can interact with the OS. Mouse input is still broken, though.

> I still get almost 100% failure rate, but only when the VM is started in *user session mode*

I was wrong about this, it's still a race. I see it much more often in user session mode, but not nearly 100%. There are times when it works several times in a row, and times when the opposite is true.

Comment 12 Lukas Ruzicka 2022-03-14 11:13:18 UTC
tldr: I tried to reproduce this on Fedora 36 and I have not experienced any problems.

I have been running Fedora 36 for some time already and I did not notice exactly when the default driver switched from QXL to Virtio, but currently my default option is Virtio. I changed that to QXL on both sessions, system and users, and started to run the virtual machines based on F36 Workstation.

I ran 20 attempts on each session and I did not see a single freeze. 

The current versions are:

libvirt-gconfig-4.0.0-4.fc36.x86_64
python3-libvirt-8.0.0-2.fc36.x86_64
libvirt-client-8.1.0-2.fc36.x86_64
qemu-common-6.2.0-5.fc36.x86_64
virt-manager-4.0.0-1.fc36.noarch

The ISO used was the 20220313 nightly build.

Comment 13 Kamil Páral 2022-03-14 12:50:33 UTC
Folks, if possible, can you please test on F35 using the user session mode? Thanks.

Comment 14 Geoffrey Marr 2022-03-14 18:48:41 UTC
Discussed during the 2022-03-14 blocker review meeting: [0]

The decision to classify this bug as a "RejectedBlocker (Beta)" was made as this is pretty bad, but since it apparently doesn't happen consistently in the most common config (system virt session) and is easy to workaround (use virtio), we think it's not bad enough to block Beta.

[0] https://meetbot.fedoraproject.org/fedora-blocker-review/2022-03-14/f36-blocker-review.2022-03-14-16.01.txt

Comment 15 Geoffrey Marr 2022-03-14 18:56:24 UTC
Discussed during the 2022-03-14 blocker review meeting: [0]

The decision to delay the classification of this as a blocker bug was made as we were not able to reach a clear decision at this time and with the information currently available. We'll aim to have more folks test and hopefully get feedback from the developers on what may be going on here.

[0] https://meetbot.fedoraproject.org/fedora-blocker-review/2022-03-14/f36-blocker-review.2022-03-14-16.01.txt

Comment 16 Carl G. 2022-03-14 22:50:03 UTC
booting w/ multi-user.target and switching to graphical.target:

This might be a different bug because the VM is unresponsive for ~10s... every minute or so.

localhost-live kernel: qxl 0000:00:01.0: object_init failed for (3149824, 0x00000001)
localhost-live kernel: [drm:qxl_alloc_bo_reserved [qxl]] *ERROR* failed to allocate VRAM BO

(fedora 36, qemu:///system session)

Comment 17 Brandon Nielsen 2022-03-20 17:35:41 UTC
(In reply to Kamil Páral from comment #13)
> Folks, if possible, can you please test on F35 using the user session mode?
> Thanks.

Using Boxes (which uses user mode) on a fully updated F35 install, the F36 candidate beta demonstrates the described behavior.

[0] - https://kojipkgs.fedoraproject.org/compose/36/Fedora-36-20220319.0/compose/Workstation/x86_64/iso/Fedora-Workstation-Live-x86_64-36_Beta-1.2.iso

Comment 18 Kamil Páral 2022-03-21 07:39:00 UTC
(In reply to Brandon Nielsen from comment #17)
> Using Boxes (which uses user mode) on a fully updated F35 install, the F36
> candidate beta demonstrates the described behavior.

Brandon, Boxes creates new VMs with the virtio driver, and so it shouldn't affected by this bug. Unless you created the VM in virt-manager (then it has qxl), and then started it in Boxes. Can you describe how you created that VM? Also, can you open virt-manager and display details of the affected Boxes VM and check whether it has qxl or virtio video driver? Thanks.

If you can reproduce this bug with a VM with the virtio driver, that would be quite an important finding.

Comment 19 Geoffrey Marr 2022-03-21 18:12:49 UTC
Discussed during the 2022-03-21 blocker review meeting: [0]

The decision to delay the classification of this as a Final Blocker bug was made as it's still not clear how common this bug is, so we can't really make a Final blocker determination yet. We will try to test it further after the meeting. We do accept it as a Beta freeze exception, just in case the fix is in the client side and shows up soon.

[0] https://meetbot.fedoraproject.org/fedora-blocker-review/2022-03-21/f36-blocker-review.2022-03-21-16.01.txt

Comment 20 Brandon Nielsen 2022-03-21 19:35:54 UTC
(In reply to Kamil Páral from comment #18)
> (In reply to Brandon Nielsen from comment #17)
> > Using Boxes (which uses user mode) on a fully updated F35 install, the F36
> > candidate beta demonstrates the described behavior.
> 
> Brandon, Boxes creates new VMs with the virtio driver, and so it shouldn't
> affected by this bug. Unless you created the VM in virt-manager (then it has
> qxl), and then started it in Boxes. Can you describe how you created that
> VM? Also, can you open virt-manager and display details of the affected
> Boxes VM and check whether it has qxl or virtio video driver? Thanks.
> 
> If you can reproduce this bug with a VM with the virtio driver, that would
> be quite an important finding.

Let's ignore that report. virtio graphics on that machine just seem all kinds of broken, no matter the guest.

I cannot reproduce this bug on a Fedora 35 host installing the beta 1.3 compose[0] as guest on either of the two machines I tested on. Using QXL and virt-manager, both system and user sessions work fine. The resulting installs work fine as well.

[0] - https://kojipkgs.fedoraproject.org/compose/36/Fedora-36-20220320.0/compose/Workstation/x86_64/iso/Fedora-Workstation-Live-x86_64-36_Beta-1.3.iso

Comment 21 Geoffrey Marr 2022-03-22 00:31:52 UTC
I figured I'd try what I currently had installed on my baremetal test machine before I attempted an install as Kamil did (F35 host). I tested this on a host system running Fedora-Workstation-Live-x86_64-36-20220321.n.0.iso, attempting to install Fedora-Workstation-36_Beta-1.3 using virt-manager. I could not reproduce this bug under these conditions. My virt-manager defaults to using Virtio as the video driver, and not QXL. In order to replicate this test as it was originally run, I had to manually change the video driver to QXL.

Installed (virtualized) system:
Fedora-Workstation-Live-x86_64-36_Beta-1.3.iso
gnome-shell-42~rc-2.fc36.x86_64
mutter-42~rc-5.fc36.x86_64

Host system:
Fedora-Workstation-Live-x86_64-36-20220321.n.0.iso
virt-manager-4.0.0-1.fc36.noarch
qemu-kvm-6.2.0-5.fc36.x86_64
qemu-device-display-qxl-6.2.0-5.fc36.x86_64

Comment 22 Adam Williamson 2022-03-22 00:45:02 UTC
Yeah, we know virt-manager on F36 defaults to virtio. On F35 it defaults to qxl (AIUI).

Comment 23 Kamil Páral 2022-03-22 12:32:05 UTC
As another datapoint, I tested this on 2 additional PCs. On the first PC with Fedora 36, I couldn't reproduce the issue, even though I performed ~10 VM boots using the user session mode and ~10 VM boots using the system mode. All of this while having qxl graphics, of course. On the second PC with Fedora 35, I reproduced the issue easily. It happened 6 times out of 10 boots, both in the user session mode and in the system mode.

So at least in my testing, it seems the issue is much more likely to occur on F35 than on F36.

Comment 24 Adam Williamson 2022-03-22 15:14:21 UTC
I actually saw something that looked rather like this in passing while working on https://bugzilla.redhat.com/show_bug.cgi?id=2066424 , with an *F35* guest image. Kamil, have you tried reproducing this using an F35 guest image, in the setups where you can easily reproduce it?

Comment 25 Kamil Páral 2022-03-22 16:06:17 UTC
I tested with Fedora-Workstation-Live-x86_64-35-1.2.iso and saw no issues in 10 boots. I also re-checked Fedora-Workstation-Live-x86_64-36-20220228.n.0.iso as mentioned in comment 0 and again saw no issues in 10 boots. So I'm quite certain the regression is at maximum 1 month old.

Comment 26 Adam Williamson 2022-03-22 16:38:49 UTC
Thanks, I guess I must've hit something different. I'll try this again with F36 images later today.

Comment 27 Adam Williamson 2022-03-22 16:39:20 UTC
BTW, another thing that might be interesting - does the bug happen with current Rawhide images?

Comment 28 Brandon Nielsen 2022-03-22 19:18:34 UTC
Tested the Workstation beta 1.4 compose[0] on yet another machine with virt-manager and QXL. Could not reproduce the issue with either a user or system session.

[0] - https://kojipkgs.fedoraproject.org/compose/36/Fedora-36-20220322.0/compose/Workstation/x86_64/iso/Fedora-Workstation-Live-x86_64-36_Beta-1.4.iso

Comment 29 Kamil Páral 2022-03-23 07:23:57 UTC
Created attachment 1867693 [details]
rpm diff between 0228 and 0307

Fedora-Workstation-Live-x86_64-36-20220228.n.0.iso  <- works
Fedora-Workstation-Live-x86_64-36-20220307.n.0.iso  <- broken

Unfortunately images older than 0307 have already been cleaned up in Koji, so I can't test them.

See the attached diff for an overview of changes packages between 0228 and 0307. Most notably, GNOME packages were upgraded from 42 alpha to 42 beta.

I also found out that 'nomodeset' boot argument ("basic graphics mode") avoids this issue. So this is something related either to graphics acceleration or wayland (because nomodeset starts X11).

Comment 30 Kamil Páral 2022-03-23 07:35:59 UTC
(In reply to Adam Williamson from comment #27)
> BTW, another thing that might be interesting - does the bug happen with
> current Rawhide images?

Fedora-Workstation-Live-x86_64-Rawhide-20220308.n.0.iso  -> broken
Fedora-Workstation-Live-x86_64-Rawhide-20220322.n.0.iso  -> broken

Unfortunately older images than 0308 have already been cleaned up by Koji.

Comment 31 Adam Williamson 2022-03-24 19:37:43 UTC
Beta is signed off, so no point to the FE proposal any more.

Comment 32 Geoffrey Marr 2022-03-28 19:40:54 UTC
Discussed during the 2022-03-28 blocker review meeting: [0]

The decision to classify this bug as an "AcceptedBlocker (Final)" was made as it violates the following criterion:

"A system installed with a release-blocking desktop must boot to a log in screen where it is possible to log in to a working desktop...", when running on a default F35 or earlier virt-manager VM and hitting the bug.

[0] https://meetbot.fedoraproject.org/fedora-blocker-review/2022-03-28/f36-blocker-review.2022-03-28-16.00.txt

Comment 33 Adam Williamson 2022-03-29 19:22:31 UTC
Upstream issue filed - https://gitlab.gnome.org/GNOME/mutter/-/issues/2201 . Kamil, please correct any errors and add anything I missed :)

Comment 34 Kamil Páral 2022-04-04 14:59:56 UTC
As a follow up, see bug 2071226 (in particular bug 2071226 comment 8) - I believe this still happens and other people are hitting it, but it's partially obscured by another issue - an autologin problem.

Comment 35 František Zatloukal 2022-04-05 14:29:59 UTC
virt-manager change to virtio was merged for f35: https://bodhi.fedoraproject.org/updates/FEDORA-2022-fec53b10e3

Comment 36 Carlos Garnacho 2022-04-05 15:36:34 UTC
I wonder if it'd be possible to inject WAYLAND_DEBUG=1 somewhere in the environment (e.g. /etc/environment) and try to reproduce the bug again (e.g. try to click and type in the client). I am not sure what the root of this issue might be yet, so it would be good to see what events are actually reaching to the client.

Comment 37 Kamil Páral 2022-04-06 14:21:37 UTC
Carlos, for me this only happens on the Live image and not the installed system, so I'm not sure how to modify the environment. I guess I'd have to build my own Live image. I can try to do that, but since it seems we'll work around this a bit by changing the virt-manager default to virtio, I need to put out some other blocker-related fires first, before working on this one :-/

Comment 38 František Zatloukal 2022-04-06 22:03:23 UTC
Lifting AcceptedBlocker as workaround https://bodhi.fedoraproject.org/updates/FEDORA-2022-fec53b10e3 landed in F35.

Comment 39 Jonas Ådahl 2022-04-07 07:09:12 UTC
If you can get access to a console as described in https://bugzilla.redhat.com/show_bug.cgi?id=2063156#c8 (didn't work here, it fails with "error: operation failed: Active console session exists for this domain") you could maybe install a few packages (debug symbols, gdb) and do some digging by attaching gdb to the gnome-shell process.

Comment 40 František Zatloukal 2022-04-11 21:19:34 UTC
Discussed during the 2022-04-11 blocker review meeting: [1]

The decision to classify this bug as an AcceptedFreezeException was made:

"It is a noticeable issue that cannot be fixed with an update."

[1] https://meetbot-raw.fedoraproject.org/fedora-blocker-review/2022-04-11/f36-blocker-review.2022-04-11-16.00.log.txt

Comment 41 Jonas Ådahl 2022-04-13 08:33:36 UTC
Seems to be related to the "cover pane" used during startup in gnome-shell; changing component.

Comment 42 Fedora Update System 2022-04-18 22:03:54 UTC
FEDORA-2022-d0c4cc0d54 has been submitted as an update to Fedora 36. https://bodhi.fedoraproject.org/updates/FEDORA-2022-d0c4cc0d54

Comment 43 Kamil Páral 2022-04-19 11:49:18 UTC
(In reply to Fedora Update System from comment #42)
> FEDORA-2022-d0c4cc0d54 has been submitted as an update to Fedora 36.
> https://bodhi.fedoraproject.org/updates/FEDORA-2022-d0c4cc0d54

I created a custom Workstation Live ISO containing this update, and started it 10 times on F35 and 10 times on F36 (using qxl). There was no issue on any of the systems, so I believe this is now fixed. If anyone else want to test, here's the ISO:
https://fedorapeople.org/groups/qa/rhbz2063156.iso

Comment 44 Fedora Update System 2022-04-19 17:27:38 UTC
FEDORA-2022-d0c4cc0d54 has been pushed to the Fedora 36 testing repository.
Soon you'll be able to install the update with the following command:
`sudo dnf upgrade --enablerepo=updates-testing --advisory=FEDORA-2022-d0c4cc0d54`
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2022-d0c4cc0d54

See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.

Comment 45 Fedora Update System 2022-04-19 22:04:15 UTC
FEDORA-2022-d0c4cc0d54 has been pushed to the Fedora 36 stable repository.
If problem still persists, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.