QA uses an automated test system called openQA. In our deployment it tests by running VMs using qemu (directly, no libvirt involvement) and talking to them via VNC (not SPICE). We currently have it configured to use `-vga qxl`, as we found this worked most reliably on Fedora 23 and Fedora 24 (even with VNC not SPICE). Since I upgraded the boxes that run the openQA VMs to Fedora 25, occasionally, a test will fail for no 'real' reason, and the logs will show the same set of errors from qemu: 15:23:50.4843 32146 QEMU: (process:32151): Spice-WARNING **: display-channel.h:295:validate_surface: canvas address is 0x561113e0be10 for 0 (and is NULL) 15:23:50.4844 32146 QEMU: 15:23:50.4844 32146 QEMU: 15:23:50.4844 32146 QEMU: (process:32151): Spice-WARNING **: display-channel.h:296:validate_surface: failed on 0 15:23:50.4844 32146 QEMU: 15:23:50.4844 32146 QEMU: (process:32151): Spice-CRITICAL **: display-channel.c:1666:display_channel_update: condition `validate_surface(display, surface_id)' failed It's slightly hard to tell as we're operating 'at a distance', but I think what happens after these errors are triggered is that qemu keeps running but stops sending screen updates over VNC, or at least screen updates stop reaching the openQA test runner. Usually what happens to the openQA test when these errors happen is it just sits around until whatever screen match it's currently looking for times out, sometimes complaining that it's stopped seeing the screen change. I was a bit surprised that we hit SPICE errors at all, but teuf says that using the qxl adapter can trigger some 'SPICE' code paths even though we're using VNC not SPICE to actually control the VM. The qemu command openQA runs looks like this: /usr/bin/qemu-kvm -serial file:serial0 -soundhw ac97 -global isa-fdc.driveA= -vga qxl -m 2048 -cpu host -netdev user,id=qanet0 -device virtio-net,netdev=qanet0,mac=52:54:00:12:34:56 -device virtio-scsi-pci,id=scsi0 -device virtio-blk,drive=hd1 -drive file=raid/l1,cache=unsafe,if=none,id=hd1,format=qcow2 -drive media=cdrom,if=none,id=cd0,format=raw,file=/var/lib/openqa/factory/iso/Fedora-Server-dvd-i386-Rawhide-20161209.n.0.iso -device scsi-cd,drive=cd0,bus=scsi0.0 -boot order=c,menu=on,splash-time=5000 -device usb-ehci -device usb-tablet -smp 2 -enable-kvm -no-shutdown -vnc :99,share=force-shared -qmp unix:qmp_socket,server,nowait -monitor unix:hmp_socket,server,nowait -S -monitor telnet:127.0.0.1:20092,server,nowait the attached drives can vary a bit between tests, but all the hardware setup stuff should always be the same. That's the command from the most recent failure of this kind, https://openqa.fedoraproject.org/tests/50953 . https://openqa.fedoraproject.org/tests/50953/file/autoinst-log.txt is the os-autoinst (test runner) log which shows the qemu errors. Note that in this case os-autoinst actually killed qemu after the errors occurred; it's a bit hard to tell why it did that as the logging from os-autoinst is insufficient, but I think it's probably because it happened to notice that one of its communication channels to qemu stopped responding. The qemu process didn't *crash*, it was intentionally killed by the test runner. We've seen this happen with both i686 and x86_64 tests. The test runner systems are x86_64.
Another case of this, where the test runner didn't kill qemu but just sat around until it timed out - and where the guest was running x86_64 Fedora not i686 - was: https://openqa.fedoraproject.org/tests/50639 https://openqa.fedoraproject.org/tests/50639/file/autoinst-log.txt
Those warnings _are_ coming from spice-server at least, I didn't realize qxl + vnc would still hit spice libs. So moving to spice teuf et al, any thoughts?
This is still happening. On a closer look at the logs, the qemu process does indeed crash, e.g. this one: https://openqa.fedoraproject.org/tests/55194/file/autoinst-log.txt Process 4626 is definitely qemu: 13:45:53.7583 4626 QEMU emulator version 2.7.0(qemu-2.7.0-8.fc25), Copyright (c) 2003-2016 Fabrice Bellard and the QEMU Project developers and waitpid for it returns right where the run fails: 14:03:57.6167 4618 waitpid for 4626 returned 4626 if I understand perl waitpid correctly, that only happens when the process ends. Looking at the worker system logs, I see: Jan 19 14:03:57 qa14.qa.fedoraproject.org audit[4651]: ANOM_ABEND auid=4294967295 uid=991 gid=988 ses=4294967295 subj=system_u:system_r:unconfined_service_t:s0 pid=4651 comm="qemu-system-x86" exe="/usr/bin/qemu-system-x86_64" sig=6 Jan 19 14:03:57 qa14.qa.fedoraproject.org systemd[1]: Started Process Core Dump (PID 6566/UID 0). Jan 19 14:03:57 qa14.qa.fedoraproject.org audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-coredump@53-6566-0 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' Jan 19 14:03:57 qa14.qa.fedoraproject.org systemd-coredump[6567]: Core Dumping has been disabled for process 4626 (qemu-system-x86). Jan 19 14:03:57 qa14.qa.fedoraproject.org systemd-coredump[6567]: Process 4626 (qemu-system-x86) of user 991 dumped core. Jan 19 14:03:57 qa14.qa.fedoraproject.org audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-coredump@53-6566-0 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' I can try enabling abrt or something on one of the staging workers to try and get a core dump...
There is not -spice on qemu command line, only -vnc. Very likely a qemu bug.
Sure, we noticed that - but are you saying the bug is the code going through spice codepaths at all in this case?
(In reply to Adam Williamson from comment #5) > Sure, we noticed that - but are you saying the bug is the code going through > spice codepaths at all in this case? You should be able to downgrade only qemu to f24 (which works based on comment #0) and see if works too.
F24 qemu is not installable on F25 due to differing dependencies. I'll try rebuilding the current F24 src.rpm on F25. This is still happening frequently and causing openQA tests to fail spuriously, it'd be really nice to have a fix...
Maybe try -vga std, should be well supported with fedora, and avoid the spicelib interaction
I did start playing with different adapters at some point and ran into another issue, but I can try std again, I guess...
For the record I switched to 'std' for openQA and that's been working fine ever since, so this isn't causing us problems on openQA any more. It may well still be happening, though. If you need more data at any point I can switch staging openQA back to qxl temporarily.
This message is a reminder that Fedora 25 is nearing its end of life. Approximately 4 (four) weeks from now Fedora will stop maintaining and issuing updates for Fedora 25. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a Fedora 'version' of '25'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora 25 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete.
Fedora 25 changed to end-of-life (EOL) status on 2017-12-12. Fedora 25 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. If you are unable to reopen this bug, please file a new report against the current release. If you experience problems, please add a comment to this bug. Thank you for reporting this bug and we are sorry it could not be fixed.
This is still happening at least as far as 27 (host on 27, guest is Rawhide). I've had to switch some openQA cases back to qxl due to https://bugzilla.redhat.com/show_bug.cgi?id=1516584 , and the ones that are now using qxl are hitting this again, e.g.: https://openqa.stg.fedoraproject.org/tests/228514 [2018-01-09T01:21:14.0221 UTC] [debug] <<< testapi::send_key(key='ctrl-alt-f2', do_wait=0) [2018-01-09T01:21:14.0333 UTC] [debug] QEMU: [2018-01-09T01:21:14.0333 UTC] [debug] QEMU: (process:11592): Spice-WARNING **: display-channel.c:2431:display_channel_validate_surface: canvas address is 0x5581113e3bc8 for 0 (and is NULL) [2018-01-09T01:21:14.0333 UTC] [debug] QEMU: [2018-01-09T01:21:14.0334 UTC] [debug] QEMU: [2018-01-09T01:21:14.0334 UTC] [debug] QEMU: (process:11592): Spice-WARNING **: display-channel.c:2432:display_channel_validate_surface: failed on 0 [2018-01-09T01:21:14.0334 UTC] [debug] QEMU: [2018-01-09T01:21:14.0334 UTC] [debug] QEMU: (process:11592): Spice-CRITICAL **: display-channel.c:2035:display_channel_update: condition `display_channel_validate_surface(display, surface_id)' failed note the crash happened right after the test switched to a console, which I guess is potentially interesting. Not sure if that's the case for all the other times this has happened.
Sigh, mentioned the wrong bug there, the reason why I had to switch UEFI tests off of std is https://bugzilla.redhat.com/show_bug.cgi?id=1518464 .
Still happens with 28 and Rawhide.
Thanks to Bandan Das for pointing out an upstream commit that may help with this: https://bugzilla.redhat.com/show_bug.cgi?id=1523563#c10 I'll test that out on openQA staging and see if it indeed does help with this issue.
(In reply to Adam Williamson from comment #16) > Thanks to Bandan Das for pointing out an upstream commit that may help with > this: > > https://bugzilla.redhat.com/show_bug.cgi?id=1523563#c10 > > I'll test that out on openQA staging and see if it indeed does help with > this issue. FWIW that patch should already be in 2:2.12.0-2
Yeah, I figured that out. I updated the x86_64 worker hosts to that version of qemu and I'm running them through some jobs with qxl as the adapter in the VM now; so far it's looking good, haven't seen this bug yet.
I've now updated the production openQA workers to F28 with the fixed qemu and switched them back to qxl, so gonna call this fixed so far as I'm concerned - will reopen if it becomes a concern again. Thanks!