Bug 1403343 - Stops updating screen with a SPICE error while running a Fedora VM with `-vga qxl` and `-vnc`
Summary: Stops updating screen with a SPICE error while running a Fedora VM with `-vga...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: spice
Version: rawhide
Hardware: All
OS: Unspecified
unspecified
medium
Target Milestone: ---
Assignee: Christophe Fergeau
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-12-09 18:39 UTC by Adam Williamson
Modified: 2018-07-12 23:54 UTC (History)
18 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2018-07-12 23:54:32 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1523563 0 high CLOSED qemu crashes with "display-channel.c:2035:display_channel_update: condition `display_channel_validate_surface(display, s... 2021-02-22 00:41:40 UTC

Internal Links: 1523563

Description Adam Williamson 2016-12-09 18:39:15 UTC
QA uses an automated test system called openQA. In our deployment it tests by running VMs using qemu (directly, no libvirt involvement) and talking to them via VNC (not SPICE). We currently have it configured to use `-vga qxl`, as we found this worked most reliably on Fedora 23 and Fedora 24 (even with VNC not SPICE).

Since I upgraded the boxes that run the openQA VMs to Fedora 25, occasionally, a test will fail for no 'real' reason, and the logs will show the same set of errors from qemu:

    15:23:50.4843 32146 QEMU: (process:32151): Spice-WARNING **: display-channel.h:295:validate_surface: canvas address is 0x561113e0be10 for 0 (and is NULL)
    15:23:50.4844 32146 QEMU: 
    15:23:50.4844 32146 QEMU: 
    15:23:50.4844 32146 QEMU: (process:32151): Spice-WARNING **: display-channel.h:296:validate_surface: failed on 0
    15:23:50.4844 32146 QEMU: 
    15:23:50.4844 32146 QEMU: (process:32151): Spice-CRITICAL **: display-channel.c:1666:display_channel_update: condition `validate_surface(display, surface_id)' failed

It's slightly hard to tell as we're operating 'at a distance', but I think what happens after these errors are triggered is that qemu keeps running but stops sending screen updates over VNC, or at least screen updates stop reaching the openQA test runner. Usually what happens to the openQA test when these errors happen is it just sits around until whatever screen match it's currently looking for times out, sometimes complaining that it's stopped seeing the screen change.

I was a bit surprised that we hit SPICE errors at all, but teuf says that using the qxl adapter can trigger some 'SPICE' code paths even though we're using VNC not SPICE to actually control the VM.

The qemu command openQA runs looks like this:

/usr/bin/qemu-kvm -serial file:serial0 -soundhw ac97 -global isa-fdc.driveA= -vga qxl -m 2048 -cpu host -netdev user,id=qanet0 -device virtio-net,netdev=qanet0,mac=52:54:00:12:34:56 -device virtio-scsi-pci,id=scsi0 -device virtio-blk,drive=hd1 -drive file=raid/l1,cache=unsafe,if=none,id=hd1,format=qcow2 -drive media=cdrom,if=none,id=cd0,format=raw,file=/var/lib/openqa/factory/iso/Fedora-Server-dvd-i386-Rawhide-20161209.n.0.iso -device scsi-cd,drive=cd0,bus=scsi0.0 -boot order=c,menu=on,splash-time=5000 -device usb-ehci -device usb-tablet -smp 2 -enable-kvm -no-shutdown -vnc :99,share=force-shared -qmp unix:qmp_socket,server,nowait -monitor unix:hmp_socket,server,nowait -S -monitor telnet:127.0.0.1:20092,server,nowait

the attached drives can vary a bit between tests, but all the hardware setup stuff should always be the same. That's the command from the most recent failure of this kind, https://openqa.fedoraproject.org/tests/50953 . https://openqa.fedoraproject.org/tests/50953/file/autoinst-log.txt is the os-autoinst (test runner) log which shows the qemu errors. Note that in this case os-autoinst actually killed qemu after the errors occurred; it's a bit hard to tell why it did that as the logging from os-autoinst is insufficient, but I think it's probably because it happened to notice that one of its communication channels to qemu stopped responding. The qemu process didn't *crash*, it was intentionally killed by the test runner.

We've seen this happen with both i686 and x86_64 tests. The test runner systems are x86_64.

Comment 1 Adam Williamson 2016-12-09 18:40:58 UTC
Another case of this, where the test runner didn't kill qemu but just sat around until it timed out - and where the guest was running x86_64 Fedora not i686 - was:

https://openqa.fedoraproject.org/tests/50639
https://openqa.fedoraproject.org/tests/50639/file/autoinst-log.txt

Comment 2 Cole Robinson 2016-12-12 22:41:54 UTC
Those warnings _are_ coming from spice-server at least, I didn't realize qxl + vnc would still hit spice libs. So moving to spice

teuf et al, any thoughts?

Comment 3 Adam Williamson 2017-01-30 23:30:30 UTC
This is still happening. On a closer look at the logs, the qemu process does indeed crash, e.g. this one: https://openqa.fedoraproject.org/tests/55194/file/autoinst-log.txt

Process 4626 is definitely qemu:

13:45:53.7583 4626 QEMU emulator version 2.7.0(qemu-2.7.0-8.fc25), Copyright (c) 2003-2016 Fabrice Bellard and the QEMU Project developers

and waitpid for it returns right where the run fails:

14:03:57.6167 4618 waitpid for 4626 returned 4626

if I understand perl waitpid correctly, that only happens when the process ends. Looking at the worker system logs, I see:

Jan 19 14:03:57 qa14.qa.fedoraproject.org audit[4651]: ANOM_ABEND auid=4294967295 uid=991 gid=988 ses=4294967295 subj=system_u:system_r:unconfined_service_t:s0 pid=4651 comm="qemu-system-x86" exe="/usr/bin/qemu-system-x86_64" sig=6
Jan 19 14:03:57 qa14.qa.fedoraproject.org systemd[1]: Started Process Core Dump (PID 6566/UID 0).
Jan 19 14:03:57 qa14.qa.fedoraproject.org audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-coredump@53-6566-0 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Jan 19 14:03:57 qa14.qa.fedoraproject.org systemd-coredump[6567]: Core Dumping has been disabled for process 4626 (qemu-system-x86).
Jan 19 14:03:57 qa14.qa.fedoraproject.org systemd-coredump[6567]: Process 4626 (qemu-system-x86) of user 991 dumped core.
Jan 19 14:03:57 qa14.qa.fedoraproject.org audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-coredump@53-6566-0 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'

I can try enabling abrt or something on one of the staging workers to try and get a core dump...

Comment 4 Victor Toso 2017-01-31 07:54:21 UTC
There is not -spice on qemu command line, only -vnc. Very likely a qemu bug.

Comment 5 Adam Williamson 2017-01-31 07:58:46 UTC
Sure, we noticed that - but are you saying the bug is the code going through spice codepaths at all in this case?

Comment 6 Victor Toso 2017-01-31 08:44:03 UTC
(In reply to Adam Williamson from comment #5)
> Sure, we noticed that - but are you saying the bug is the code going through
> spice codepaths at all in this case?

You should be able to downgrade only qemu to f24 (which works based on comment #0) and see if works too.

Comment 7 Adam Williamson 2017-03-16 22:32:55 UTC
F24 qemu is not installable on F25 due to differing dependencies. I'll try rebuilding the current F24 src.rpm on F25.

This is still happening frequently and causing openQA tests to fail spuriously, it'd be really nice to have a fix...

Comment 8 Cole Robinson 2017-03-16 23:59:12 UTC
Maybe try -vga std, should be well supported with fedora, and avoid the spicelib interaction

Comment 9 Adam Williamson 2017-03-17 00:12:13 UTC
I did start playing with different adapters at some point and ran into another issue, but I can try std again, I guess...

Comment 10 Adam Williamson 2017-06-20 19:44:28 UTC
For the record I switched to 'std' for openQA and that's been working fine ever since, so this isn't causing us problems on openQA any more. It may well still be happening, though. If you need more data at any point I can switch staging openQA back to qxl temporarily.

Comment 11 Fedora End Of Life 2017-11-16 18:57:33 UTC
This message is a reminder that Fedora 25 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 25. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as EOL if it remains open with a Fedora  'version'
of '25'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version'
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not
able to fix it before Fedora 25 is end of life. If you would still like
to see this bug fixed and are able to reproduce it against a later version
of Fedora, you are encouraged  change the 'version' to a later Fedora
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's
lifetime, sometimes those efforts are overtaken by events. Often a
more recent Fedora release includes newer upstream software that fixes
bugs or makes them obsolete.

Comment 12 Fedora End Of Life 2017-12-12 10:46:40 UTC
Fedora 25 changed to end-of-life (EOL) status on 2017-12-12. Fedora 25 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.

Comment 13 Adam Williamson 2018-01-09 18:21:38 UTC
This is still happening at least as far as 27 (host on 27, guest is Rawhide). I've had to switch some openQA cases back to qxl due to https://bugzilla.redhat.com/show_bug.cgi?id=1516584 , and the ones that are now using qxl are hitting this again, e.g.:

https://openqa.stg.fedoraproject.org/tests/228514

[2018-01-09T01:21:14.0221 UTC] [debug] <<< testapi::send_key(key='ctrl-alt-f2', do_wait=0)
[2018-01-09T01:21:14.0333 UTC] [debug] QEMU: 
[2018-01-09T01:21:14.0333 UTC] [debug] QEMU: (process:11592): Spice-WARNING **: display-channel.c:2431:display_channel_validate_surface: canvas address is 0x5581113e3bc8 for 0 (and is NULL)
[2018-01-09T01:21:14.0333 UTC] [debug] QEMU: 
[2018-01-09T01:21:14.0334 UTC] [debug] QEMU: 
[2018-01-09T01:21:14.0334 UTC] [debug] QEMU: (process:11592): Spice-WARNING **: display-channel.c:2432:display_channel_validate_surface: failed on 0
[2018-01-09T01:21:14.0334 UTC] [debug] QEMU: 
[2018-01-09T01:21:14.0334 UTC] [debug] QEMU: (process:11592): Spice-CRITICAL **: display-channel.c:2035:display_channel_update: condition `display_channel_validate_surface(display, surface_id)' failed

note the crash happened right after the test switched to a console, which I guess is potentially interesting. Not sure if that's the case for all the other times this has happened.

Comment 14 Adam Williamson 2018-01-09 18:22:54 UTC
Sigh, mentioned the wrong bug there, the reason why I had to switch UEFI tests off of std is https://bugzilla.redhat.com/show_bug.cgi?id=1518464 .

Comment 15 Adam Williamson 2018-06-05 01:06:59 UTC
Still happens with 28 and Rawhide.

Comment 16 Adam Williamson 2018-07-10 18:08:08 UTC
Thanks to Bandan Das for pointing out an upstream commit that may help with this:

https://bugzilla.redhat.com/show_bug.cgi?id=1523563#c10

I'll test that out on openQA staging and see if it indeed does help with this issue.

Comment 17 Cole Robinson 2018-07-10 20:52:58 UTC
(In reply to Adam Williamson from comment #16)
> Thanks to Bandan Das for pointing out an upstream commit that may help with
> this:
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=1523563#c10
> 
> I'll test that out on openQA staging and see if it indeed does help with
> this issue.

FWIW that patch should already be in 2:2.12.0-2

Comment 18 Adam Williamson 2018-07-10 21:05:54 UTC
Yeah, I figured that out. I updated the x86_64 worker hosts to that version of qemu and I'm running them through some jobs with qxl as the adapter in the VM now; so far it's looking good, haven't seen this bug yet.

Comment 19 Adam Williamson 2018-07-12 23:54:32 UTC
I've now updated the production openQA workers to F28 with the fixed qemu and switched them back to qxl, so gonna call this fixed so far as I'm concerned - will reopen if it becomes a concern again. Thanks!


Note You need to log in before you can comment on or make changes to this bug.