1523563 – qemu crashes with "display-channel.c:2035:display_channel_update: condition `display_channel_validate_surface(display, surface_id)' failed"

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1523563 - qemu crashes with "display-channel.c:2035:display_channel_update: condition `display_channel_validate_surface(display, surface_id)' failed"

Summary: qemu crashes with "display-channel.c:2035:display_channel_update: condition `...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	qemu-kvm-rhev
Sub Component:
Version:	7.5
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	unspecified
Target Milestone:	rc
Target Release:	---
Assignee:	David Blechter
QA Contact:	Yanan Fu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-12-08 10:08 UTC by Yanan Fu
Modified:	2019-04-11 08:59 UTC (History)
CC List:	15 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-04-11 08:59:13 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Autotest debug.log, include qemu command line and call trace when abort. (75.78 KB, text/plain) 2017-12-08 10:08 UTC, Yanan Fu	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1403343	0	unspecified	CLOSED	Stops updating screen with a SPICE error while running a Fedora VM with `-vga qxl` and `-vnc`	2021-02-22 00:41:40 UTC

Internal Links: 1403343

Description Yanan Fu 2017-12-08 10:08:25 UTC

Created attachment 1364726 [details]
Autotest debug.log, include qemu command line and call trace when abort.

Description of problem:
Hit this problem during acceptance testing. Found it by automation. 
case name: "boot_vm_in_hugepage.2M". 

Boot one vm with 2M hugepage, after guest boot up, login and execute "shutdown -r now". repeat this for several times. qemu abort.


Version-Release number of selected component (if applicable):
qemu: qemu-kvm-rhev-2.10.0-11.el7.x86_64
kernel: kernel-3.10.0-805.el7.x86_64

How reproducible:
15/25

Steps to Reproduce:
1.boot *RHEL7.5* vm with 2M hugepage
# cat /proc/meminfo | grep -i hugepage
AnonHugePages:     77824 kB
HugePages_Total:    4160
HugePages_Free:       64      ----> (4160-64) * 2048kB = 8388608kB = 8192M
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB   ----> 2M hugepage

# mount -l | grep hugetlbfs
hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime)
none on /mnt/kvm_hugepage type hugetlbfs (rw,relatime,pagesize=2048K)

qemu command line:
-m 8192  \
-mem-path /mnt/kvm_hugepage \

2. repeatedly "shutdown -r now" in guest when guest boot up.

Actual results:
qemu abort

Expected results:
VM work normally.


Additional info:
1. Test with intel + amd host,intel host has no problem when i repeat the auto case for 25 times.
intel host name:ibm-x3650m4-04.lab.eng.pek2.redhat.com, cpu model "SandyBridge".
amd host name:hp-dl385pg8-03.rhts.eng.pek2.redhat.com , cpu model "Opteron_G5".

2. qemu-kvm-rhev-2.10.0-10.el7 acceptance test didn't meet this problem, need check further if it is a regression, as it can not be reproduced 100%.

3. Same host, test with win2016.x86_64 guest for 20 times, no problem.

Comment 2 Igor Mammedov 2017-12-12 10:11:22 UTC

Yanan Fu,

Please provide full qemu command line and the name of host where it reproduces.

Comment 3 Yanan Fu 2017-12-12 11:42:46 UTC

(In reply to Igor Mammedov from comment #2)
> Yanan Fu,
> 
> Please provide full qemu command line and the name of host where it
> reproduces.

Hi Igor:
amd host name: hp-dl385pg8-03.rhts.eng.pek2.redhat.com, cpu model "Opteron_G5",listed in Description.
This host already be taken back by beaker, and it is using by other people now.
If this bz is related with HW, we can wait and loan it later.

qemu command line that get from the debug.log in attachment:
MALLOC_PERTURB_=1  /usr/libexec/qemu-kvm \
    -S  \
    -name 'avocado-vt-vm1'  \
    -sandbox off  \
    -machine pc  \
    -nodefaults  \
    -vga qxl \
    -device pci-bridge,id=pci_bridge,bus=pci.0,addr=0x3,chassis_nr=1 \
    -device intel-hda,bus=pci.0,addr=0x4 \
    -device hda-duplex  \
    -chardev socket,id=qmp_id_qmpmonitor1,path=/var/tmp/avocado_Tr9fY9/monitor-qmpmonitor1-20171207-065334-9qZE8GWj,server,nowait \
    -mon chardev=qmp_id_qmpmonitor1,mode=control  \
    -chardev socket,id=qmp_id_catch_monitor,path=/var/tmp/avocado_Tr9fY9/monitor-catch_monitor-20171207-065334-9qZE8GWj,server,nowait \
    -mon chardev=qmp_id_catch_monitor,mode=control \
    -device pvpanic,ioport=0x505,id=idsT5AjV  \
    -chardev socket,id=serial_id_serial0,path=/var/tmp/avocado_Tr9fY9/serial-serial0-20171207-065334-9qZE8GWj,server,nowait \
    -device isa-serial,chardev=serial_id_serial0 \
    -device virtio-serial-pci,id=virtio_serial_pci0,bus=pci.0,addr=0x5 \
    -chardev socket,path=/var/tmp/avocado_Tr9fY9/virtio_port-vs-20171207-065334-9qZE8GWj,nowait,id=idWrJ06X,server \
    -device virtserialport,id=idUScFpH,name=vs,bus=virtio_serial_pci0.0,chardev=idWrJ06X \
    -object rng-random,filename=/dev/random,id=passthrough-XAVGpw58 \
    -device virtio-rng-pci,id=virtio-rng-pci-gflMRhSb,rng=passthrough-XAVGpw58,bus=pci.0,addr=0x6  \
    -chardev socket,id=seabioslog_id_20171207-065334-9qZE8GWj,path=/var/tmp/avocado_Tr9fY9/seabios-20171207-065334-9qZE8GWj,server,nowait \
    -device isa-debugcon,chardev=seabioslog_id_20171207-065334-9qZE8GWj,iobase=0x402 \
    -device ich9-usb-ehci1,id=usb1,addr=0x1d.7,multifunction=on,bus=pci.0 \
    -device ich9-usb-uhci1,id=usb1.0,multifunction=on,masterbus=usb1.0,addr=0x1d.0,firstport=0,bus=pci.0 \
    -device ich9-usb-uhci2,id=usb1.1,multifunction=on,masterbus=usb1.0,addr=0x1d.2,firstport=2,bus=pci.0 \
    -device ich9-usb-uhci3,id=usb1.2,multifunction=on,masterbus=usb1.0,addr=0x1d.4,firstport=4,bus=pci.0 \
    -device nec-usb-xhci,id=usb2,bus=pci.0,addr=0x7 \
    -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pci.0,addr=0x8 \
    -drive id=drive_image1,if=none,snapshot=off,aio=threads,cache=none,format=qcow2,file=/home/kvm_autotest_root/images/rhel75-64-virtio-scsi.qcow2 \
    -device scsi-hd,id=image1,drive=drive_image1 \
    -device virtio-net-pci,mac=9a:53:54:55:56:57,id=idHUSP5o,vectors=4,netdev=idmXt8FG,bus=pci.0,addr=0x9  \
    -netdev tap,id=idmXt8FG,vhost=on,vhostfd=19,fd=17 \
    -m 8192  \
    -smp 8,maxcpus=8,cores=4,threads=1,sockets=2  \
    -cpu 'Opteron_G5',+kvm_pv_unhalt \
    -device usb-tablet,id=usb-tablet1,bus=usb2.0,port=1  \
    -spice port=3000,password=123456,addr=0,tls-port=3200,x509-dir=/tmp/spice_x509d,tls-channel=main,tls-channel=inputs,image-compression=auto_glz,zlib-glz-wan-compression=auto,streaming-video=all,agent-mouse=on,playback-compression=on,ipv4  \
    -rtc base=utc,clock=host,driftfix=slew  \
    -boot menu=off,strict=off,order=cdn,once=c  \
    -mem-path /mnt/kvm_hugepage \
    -no-hpet \
    -enable-kvm  \
    -watchdog i6300esb \
    -watchdog-action reset \
    -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0xa

Comment 8 Igor Mammedov 2018-02-27 14:54:17 UTC

(In reply to Yanan Fu from comment #7)
Thanks for landing it out. Host in no longer needed.

Tested it quite a bit but couldn't reproduce issue on this host either,
so checked attached logs once more and noticed following lines:

07:00:33 INFO | [qemu output] (process:25880): Spice-[1;33mWARNING[0m **: display-channel.c:2431:display_channel_validate_surface: canvas address is 0x562304c45b08 for 0 (and is NULL)
07:00:33 INFO | [qemu output] 
07:00:33 INFO | [qemu output] 
07:00:33 INFO | [qemu output] (process:25880): Spice-[1;33mWARNING[0m **: display-channel.c:2432:display_channel_validate_surface: failed on 0
07:00:33 INFO | [qemu output] 
07:00:33 INFO | [qemu output] (process:25880): Spice-[1;35mCRITICAL[0m **: display-channel.c:2035:display_channel_update: condition `display_channel_validate_surface(display, surface_id)' failed

After looking for similar errors, it seems that others also experience qxl related crashes qand looks like a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1441715

or similar https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=884613 and maybe related https://bugzilla.redhat.com/show_bug.cgi?id=1403343

I'd suspect crash is SPICE related.
Once using CLI similar to David's, I've managed to get locally:

  (process:101527): Spice-WARNING **: display-channel.c:2431:display_channel_validate_surface: canvas address is   0x5654ed415b08 for 0 (and is NULL)
  (process:101527): Spice-WARNING **: display-channel.c:2432:display_channel_validate_surface: failed on 0

but it didn't get to 'display_channel_update: condition ... failed' point and it didn't crash.
CLI was:

/usr/libexec/qemu-kvm -m 2048M -M pc,accel=kvm -smp 4 -drive if=none,id=cd,readonly=on,file=ubuntu-16.10-desktop-amd64.iso -device ide-cd,drive=cd -vga qxl -spice port=5900,password=12345

my env:
server:
qemu-kvm-rhev-2.10.0-11.el7.x86_64
spice-server-0.14.0-2.el7.x86_64
spice-protocol-0.12.13-2.el7.noarch
spice-glib-0.34-2.el7.x86_64
spice-gtk3-0.34-2.el7.x86_64

client:
virt-viewer-5.0-2.fc26.x86_64
spice-server-0.14.0-1.fc26.x86_64
spice-vdagent-0.17.0-2.fc26.x86_64
spice-gtk3-0.34-1.fc26.x86_64
spice-glib-0.34-1.fc26.x86_64

CCing Christophe and Gerd, perhaps they would have a better idea where to start with this.

Comment 9 Adam Williamson 2018-04-23 19:38:55 UTC

As the reporter of #1403343 , just for the record, I'll note that one is still happening in Fedora openQA testing. Typically out of every ~50 tests which run with qxl as the video driver, 1 or 2 will crash this way. All I can do is restart them. (I should actually tweak our openQA test restarter plugin to restart them automatically, I guess...)

Comment 10 Bandan Das 2018-07-10 17:25:16 UTC

Upstream Commit 5bd5c27c7d284d01477c5cc022ce22438c46bf9f might be relevant (as well as for bug 1403343)

commit 5bd5c27c7d284d01477c5cc022ce22438c46bf9f
Author: Gerd Hoffmann <kraxel>
Date:   Fri Apr 27 13:55:28 2018 +0200

    qxl: fix local renderer crash
    
    Make sure we only ask the spice local renderer for display updates in
    case we have a valid primary surface.  Without that spice is confused
    and throws errors in case a display update request (triggered by
    screendump for example) happens in parallel to a mode switch and hits
    the race window where the old primary surface is gone and the new isn't
    establisted yet.

Comment 11 Adam Williamson 2018-07-10 18:01:46 UTC

Thanks for the pointer, Bandan! I'll test that out for my case (1403343).

Comment 12 Yanan Fu 2018-07-26 13:00:43 UTC

Hi Adma,
Could you help check bz 1567733 ? It already be fixed recently, may be they are same problem. Thanks!

Comment 13 Adam Williamson 2018-07-26 16:09:16 UTC

Hi, Yanan! Sorry, I'm not sure what you mean. I didn't file this bug or bz 1567733 and I can't reproduce either of those two precisely, so I can't say whether either of those are fixed. The bug I filed is bz 1403343 , and I can tell you that one *is* fixed with recent qemu - I recently updated the bug to say so, and closed it. The fix for that issue was in qemu 2.11.2 and 2.12.0.

Thanks!

Comment 14 Victor Toso 2019-04-11 08:59:13 UTC

So, no reliable reproducer but we can see that logs including the Spice-CRITICAL one from https://bugzilla.redhat.com/show_bug.cgi?id=1403343#c0 is the same pointed out in comment #8.
Commit mentioned in comment #10 seems to have fixed the issue for bug 1403343. Commit log also mentions a race which would explain why isn't easy to reproduce it.

Full commit log actually inclueds 'Fixes: https://bugzilla.redhat.com//show_bug.cgi?id=1567733' which is in qemu-kvm-rhev-2.12.0-6.el7

Fix should also be present in qemu-kvm-1.5.3-9.el7

I'm closing this as CURRENT_RELEASE, please reopen if you can reproduce with any or above version of those qemu.

Note You need to log in before you can comment on or make changes to this bug.