1565354 – Many crashes with "memslot_get_virt: slot_id 170 too big"-type errors with recent kernels

Bug 1565354 - Many crashes with "memslot_get_virt: slot_id 170 too big"-type errors with recent kernels

Summary: Many crashes with "memslot_get_virt: slot_id 170 too big"-type errors with re...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	rawhide
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1570046 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-04-09 21:38 UTC by Adam Williamson
Modified:	2021-05-03 15:49 UTC (History)
CC List:	31 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2021-05-03 15:49:03 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
good config (179.52 KB, text/plain) 2018-04-13 17:16 UTC, Gerd Hoffmann	no flags	Details
bad config (179.52 KB, text/plain) 2018-04-13 17:17 UTC, Gerd Hoffmann	no flags	Details
bad config (for real this time) (130.65 KB, text/x-mpsub) 2018-04-13 17:18 UTC, Gerd Hoffmann	no flags	Details
0001-qxl-fix-qxl_release_-map-unmap.patch (2.38 KB, patch) 2018-04-17 20:44 UTC, Gerd Hoffmann	no flags	Details \| Diff
0002-qxl-keep-separate-release_bo-pointer.patch (4.46 KB, patch) 2018-04-17 20:45 UTC, Gerd Hoffmann	no flags	Details \| Diff
Show Obsolete (1) View All

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1762558	None	None	None	2018-04-09 21:39:01 UTC
Red Hat Bugzilla	1520729	medium	CLOSED	qemu aborted (core dumped) when reboot guest with spice	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	1540919	medium	CLOSED	Spice-CRITICAL **: memslot.c:122:memslot_get_virt: address generation is not valid, group_id 1, slot_id 0, gen 1, slot_g...	2023-09-15 00:06:22 UTC

Internal Links: 1520729 1540919

Description Adam Williamson 2018-04-09 21:38:32 UTC

Since qemu 2.12.0 rc2 - qemu-2.12.0-0.6.rc2.fc29 - landed in Fedora Rawhide, just about all of our openQA-automated tests of Rawhide guests which run with qxl / SPICE graphics in the guest have died partway in, always shortly after the test switches from the installer (an X environment) to a console on a tty. qemu is, I think, hanging. There are always some errors like this right around the time of the hang:

[2018-04-09T20:13:42.0736 UTC] [debug] QEMU: id 0, group 0, virt start 0, virt end ffffffffffffffff, generation 0, delta 0
[2018-04-09T20:13:42.0736 UTC] [debug] QEMU: id 1, group 1, virt start 7f42dbc00000, virt end 7f42dfbfe000, generation 0, delta 7f42dbc00000
[2018-04-09T20:13:42.0736 UTC] [debug] QEMU: id 2, group 1, virt start 7f42d7a00000, virt end 7f42dba00000, generation 0, delta 7f42d7a00000
[2018-04-09T20:13:42.0736 UTC] [debug] QEMU:
[2018-04-09T20:13:42.0736 UTC] [debug] QEMU: (process:45812): Spice-CRITICAL **: memslot.c:111:memslot_get_virt: slot_id 218 too big, addr=da8e21fbda8e21fb

or occasionally like this:

[2018-04-09T20:13:58.0717 UTC] [debug] QEMU: id 0, group 0, virt start 0, virt end ffffffffffffffff, generation 0, delta 0
[2018-04-09T20:13:58.0720 UTC] [debug] QEMU: id 1, group 1, virt start 7ff093c00000, virt end 7ff097bfe000, generation 0, delta 7ff093c00000
[2018-04-09T20:13:58.0720 UTC] [debug] QEMU: id 2, group 1, virt start 7ff08fa00000, virt end 7ff093a00000, generation 0, delta 7ff08fa00000
[2018-04-09T20:13:58.0720 UTC] [debug] QEMU:
[2018-04-09T20:13:58.0720 UTC] [debug] QEMU: (process:25622): Spice-WARNING **: memslot.c:68:memslot_validate_virt: virtual address out of range
[2018-04-09T20:13:58.0720 UTC] [debug] QEMU: virt=0x0+0x18 slot_id=0 group_id=1
[2018-04-09T20:13:58.0721 UTC] [debug] QEMU: slot=0x0-0x0 delta=0x0
[2018-04-09T20:13:58.0721 UTC] [debug] QEMU:
[2018-04-09T20:13:58.0721 UTC] [debug] QEMU: (process:25622): Spice-WARNING **: display-channel.c:2426:display_channel_validate_surface: invalid surface_id 1048576
[2018-04-09T20:14:14.0728 UTC] [debug] QEMU: id 0, group 0, virt start 0, virt end ffffffffffffffff, generation 0, delta 0
[2018-04-09T20:14:14.0728 UTC] [debug] QEMU: id 1, group 1, virt start 7ff093c00000, virt end 7ff097bfe000, generation 0, delta 7ff093c00000
[2018-04-09T20:14:14.0728 UTC] [debug] QEMU: id 2, group 1, virt start 7ff08fa00000, virt end 7ff093a00000, generation 0, delta 7ff08fa00000
[2018-04-09T20:14:14.0728 UTC] [debug] QEMU:
[2018-04-09T20:14:14.0728 UTC] [debug] QEMU: (process:25622): Spice-CRITICAL **: memslot.c:122:memslot_get_virt: address generation is not valid, group_id 1, slot_id 0, gen 110, slot_gen 0

The same tests running on Fedora 28 guests on the same hosts are not hanging, and the same tests were not hanging right before the qemu package got updated, so this seems very strongly tied to the new qemu.

This is a downstream copy of https://bugs.launchpad.net/qemu/+bug/1762558 ; I wanted to have the bug tracked in both places to make sure the RH virt folks see it, and for openQA linking and tracking of getting the fix landed in Rawhide.

Comment 1 Adam Williamson 2018-04-09 21:39:59 UTC

Before anyone asks, SPICE has not been changed since last year.

Comment 2 Gerd Hoffmann 2018-04-10 07:24:32 UTC

Bug 1540919 is unlikely to be a duplicate, it is a live migration issue and I don't think fedora qa does that.

Bug 1520729 looks simliar, but is pretty hard to reproduce.  So maybe something changed in qemu to trigger this more easily now.  Any chance these crashes happen with older qemu too, just with much lower frequency?

Couldn't trigger it on my workstation though.  Tried wayland, xorg + qxl, xorg + modesetting.  All running fine.  Hmm.

Any chance you can run qemu with tracing enabled in autoqa (see bug 1520729 comment 18)?

Comment 3 Adam Williamson 2018-04-10 17:37:07 UTC

"Couldn't trigger it on my workstation though.  Tried wayland, xorg + qxl, xorg + modesetting.  All running fine.  Hmm."

Did you try booting an installer image to the graphical installer, then switching to a tty? That seems to be the consistent trigger for the failure.

"Any chance these crashes happen with older qemu too, just with much lower frequency?"

I mean, it's *possible*, but I definitely don't recall ever seeing it before, and I look at a lot of failures.

"Any chance you can run qemu with tracing enabled in autoqa (see bug 1520729 comment 18)?"

Yeah, I can tweak os-autoinst to do that, I think. I'll try it and get back to you.

Comment 4 Adam Williamson 2018-04-10 17:39:45 UTC

You know...it occurs to me that I'm *clearly* not thinking straight in assigning this to qemu based on the Rawhide qemu package, because we're not *using* that. Sigh. I'm an idiot.

It's definitely to do with something in Rawhide, but that something is very unlikely to be qemu - the qemu we run is from the worker host environment, not from the image tested, of course. So the fact that this is only happening in tests of Rawhide images basically rules out stuff that comes from the host environment. The thing that changed must be something in the guest environment, somehow. I'll have to look at what else changed between 2018-04-02 and 2018-04-07. Leaving this assigned to qemu for now just because it *does* involve qemu/spice somehow and I don't know what the real trigger is yet, but I was certainly wrong to suggest the Rawhide update to qemu was the trigger. D'oh.

Comment 5 Adam Williamson 2018-04-10 18:48:42 UTC

I'm actually able to reproduce this on my F28 test system by simply running:

/usr/bin/qemu-kvm -soundhw ac97 -global isa-fdc.driveA= -vga qxl -m 2048 -cpu qemu64 -netdev user,id=qanet0 -device virtio-net,netdev=qanet0,mac=52:54:00:12:34:56 -device virtio-scsi-pci,id=scsi0 -drive media=cdrom,if=none,id=cd0,format=raw,file=/share/data/isos/29/nightlies/Fedora-AtomicWorkstation-ostree-x86_64-Rawhide-20180409.n.0.iso -device scsi-cd,drive=cd0 -boot once=d,menu=on,splash-time=5000 -device usb-ehci -device usb-tablet -smp 2 -enable-kvm -no-shutdown

Which results in this, among other errors:

(qemu-system-x86_64:9441): Spice-WARNING **: 11:44:39.330: display-channel.c:2426:display_channel_validate_surface: invalid surface_id 67108864
id 0, group 0, virt start 0, virt end ffffffffffffffff, generation 0, delta 0
id 1, group 1, virt start 7feb1bc00000, virt end 7feb1fbfe000, generation 0, delta 7feb1bc00000
id 2, group 1, virt start 7feb17a00000, virt end 7feb1ba00000, generation 0, delta 7feb17a00000

(qemu-system-x86_64:9441): Spice-WARNING **: 11:44:41.197: memslot.c:68:memslot_validate_virt: virtual address out of range
    virt=0x0+0x18 slot_id=0 group_id=1
    slot=0x0-0x0 delta=0x0

(qemu-system-x86_64:9441): Spice-WARNING **: 11:44:41.197: display-channel.c:2426:display_channel_validate_surface: invalid surface_id 524288
id 0, group 0, virt start 0, virt end ffffffffffffffff, generation 0, delta 0
id 1, group 1, virt start 7feb1bc00000, virt end 7feb1fbfe000, generation 0, delta 7feb1bc00000
id 2, group 1, virt start 7feb17a00000, virt end 7feb1ba00000, generation 0, delta 7feb17a00000

(qemu-system-x86_64:9441): Spice-CRITICAL **: 11:44:55.050: memslot.c:122:memslot_get_virt: address generation is not valid, group_id 1, slot_id 0, gen 7, slot_gen 0

and then a backtrace. I'll try again with tracing and attach full outputs.

The ISO used in the command can be found at https://kojipkgs.fedoraproject.org/compose/rawhide/Fedora-Rawhide-20180409.n.0/compose/AtomicWorkstation/x86_64/iso/Fedora-AtomicWorkstation-ostree-x86_64-Rawhide-20180409.n.0.iso - any installer image from that compose would likely trigger the bug, though.

Comment 6 Adam Williamson 2018-04-10 21:58:19 UTC

I'm about to reboot to try the tracing thing (since it requires my booted kernel and kernel-devel package to be in sync, sigh).

Meanwhile, one obvious candidate for what changed in Rawhide guests between 2018-04-02 and 2018-04-07 is the kernel - it went from kernel-4.16.0-0.rc7.git1.1.fc29 to kernel-4.17.0-0.rc0.git4.1.fc29 .

The other obvious suspect, given that the installer environment still runs on X not Wayland, is that the X server got a bump from xorg-x11-server-1.19.6-5.fc28 to xorg-x11-server-1.19.99.903-1.fc29 (1.20 RC3). xorg-x11-drv-qxl was rebuilt for the server bump.

Comment 7 Adam Williamson 2018-04-10 23:21:51 UTC

Can't get the tracing to work:

[root@adam adamw]# stap -e 'probe qemu.kvm.simpletrace.qxl* {}' -x 20177 >/tmp/trace
semantic error: while resolving probe point: identifier 'qemu' at <input>:1:7
        source: probe qemu.kvm.simpletrace.qxl* {}
                      ^

semantic error: probe point mismatch (similar: system, user): identifier 'kvm' at :1:12
        source: probe qemu.kvm.simpletrace.qxl* {}
                           ^

Pass 2: analysis failed.  [man error::pass2]

I'm having trouble getting it to *crash* again locally, but it reliably is broken - basically, it seems to render extremely slowly once you get into X, if using qxl graphics. The anaconda 'welcome' screen appears without the quit / continue buttons, then it takes several minutes before they appear. After clicking on 'Continue', it takes several minutes for the 'this is a pre-release' warning dialog to appear. After clicking that it takes ages for it to go away and the hub to show up, etc.

Booting with -vga virtio or -vga std, this doesn't happen at all, everything renders quite fast.

When using qxl, errors like this appear over and over:

(qemu-system-x86_64:20938): Spice-WARNING **: 16:00:07.686: red-parse-qxl.c:1109:red_get_native_drawable: unknown type 64

(qemu-system-x86_64:20938): Spice-WARNING **: 16:00:17.409: red-parse-qxl.c:1109:red_get_native_drawable: unknown type 232

(qemu-system-x86_64:20938): Spice-WARNING **: 16:00:17.730: red-parse-qxl.c:1109:red_get_native_drawable: unknown type 232

(qemu-system-x86_64:20938): Spice-WARNING **: 16:00:17.731: red-parse-qxl.c:1109:red_get_native_drawable: unknown type 255

(qemu-system-x86_64:20938): Spice-WARNING **: 16:00:17.740: red-parse-qxl.c:1109:red_get_native_drawable: unknown type 232

(qemu-system-x86_64:20938): Spice-WARNING **: 16:00:17.742: red-parse-qxl.c:1109:red_get_native_drawable: unknown type 255

and occasionally stuff like:

(qemu-system-x86_64:20938): Spice-WARNING **: 15:59:32.892: memslot.c:68:memslot_validate_virt: virtual address out of range
    virt=0x0+0x18 slot_id=0 group_id=1
    slot=0x0-0x0 delta=0x0

I'm now building test images based on the 20180402.n.0 compose but with various single packages updated, trying to isolate the cause of the problem.

Comment 8 Adam Williamson 2018-04-11 00:57:12 UTC

A test image based on 20180402.n.0 with the xorg-x11-server update included works fine. A test image based on 20180402.n.0 with the kernel update included displays the buggy behaviour. Based on this, I'm re-assigning this to the kernel. Will try a test with the latest kernel (there've been a couple of builds in koji lately) later.

Comment 9 Adam Williamson 2018-04-12 22:34:53 UTC

Also affects a test image based on 20180402.n.0 with kernel-4.17.0-0.rc0.git7.1.fc29 .

Comment 10 Adam Williamson 2018-04-13 01:44:12 UTC

So the only thing that appears to have changed in the qxl driver between kernel 4.16rc7 and kernel 4.17rc4 is this:

https://patchwork.freedesktop.org/patch/211552/

well, that makes things easy! Let's just revert that patch, and...

...crap. It's still broken. So this still looks like something in the kernel, but not the *obvious* thing. Something else that changed in the kernel broke this. But I don't know what.

My results are reproducible, I've checked. I still have all the test ISOs, and the x11 test ISO still works fine, all the ISOs with updated kernels (including the one with kernel-4.17.0-0.rc8.git7.1.fc29 but with the above change reverted) still don't work right.

Comment 11 Adam Williamson 2018-04-13 02:09:20 UTC

It kinda looks to me like basically SPICE is getting unexpected results when it deals with these here 'memslot' things:

https://github.com/SPICE/spice/blob/master/server/memslot.c

but I really don't have the expertise at this point to figure out what bit is putting stuff into said slots for SPICE to pull out and try and do stuff with (the stuff that's causing these errors), or why it's now possibly doing something spice isn't expecting...

Comment 12 Gerd Hoffmann 2018-04-13 06:47:59 UTC

Seems the host gets corrupted qxl commands from the guest.
Can you try without xorg-x11-drv-qxl and see what happens then?

Comment 13 Gerd Hoffmann 2018-04-13 08:10:33 UTC

(In reply to Gerd Hoffmann from comment #12)
> Seems the host gets corrupted qxl commands from the guest.
> Can you try without xorg-x11-drv-qxl and see what happens then?

Ok, scratch that, most likely is is something in the kernel.
Installed the rawhide kernel on my F27 guest, now I see all
kinds if qxl issues too.  That is with wayland, so xorg not
involved.

F27 kernel (4.15.15-300.fc27) is fine.
Self-compiled 4.16.2 is fine too.
Rawhide (4.17.0-0.rc0.git7.1.fc29) is broken.

Comment 14 Gerd Hoffmann 2018-04-13 08:33:19 UTC

> https://patchwork.freedesktop.org/patch/211552/

There is more ...

kraxel@sirius ~/projects/linux (master)# git log --oneline v4.16..master -- drivers/gpu/drm/qxl
1c7095d283 Merge airlied/drm-next into drm-misc-next
2793c1d77a drm/qxl: Replace drm_gem_object_reference/unreference() with _get/put()
dde5da2379 drm/ttm: add bo as parameter to the ttm_tt_create callback
724daa4fd6 drm/ttm: drop persistent_swap_storage from ttm_bo_init and co
231cdafc75 drm/ttm: drop ttm->dummy_read_page
3839263362 drm/ttm: drop bo->glob
2a7b464f84 drm/qxl: remove ttm_pool_* wrappers
74c0167f8b Merge drm-next into drm-intel-next-queued
54156da893 Merge airlied/drm-next into drm-misc-next

Probably one of the ttm patches.  I'll go for a stupid bisect though to figure, stay tuned.

Comment 15 Gerd Hoffmann 2018-04-13 15:51:03 UTC

bisect didn't have any useful results.

And now I ended up with two 4.16.2 kernels (different configs, different compilers) where one works and one doesn't, so it isn't a code difference ...

Comment 16 Adam Williamson 2018-04-13 16:17:32 UTC

Oh fun :(

So the smallest delta I currently have is the one I mentioned above:

kernel-4.16.0-0.rc7.git1.1.fc29 to kernel-4.17.0-0.rc0.git4.1.fc29

Here are the differences between the packages they were built with:

-DEBUG util.py:439:   fedora-release               noarch  29-0.1                       build   26 k
+DEBUG util.py:439:   fedora-release               noarch  29-0.2                       build   26 k
@@ -13 +13 @@
-DEBUG util.py:439:   info                         x86_64  6.5-3.fc28                   build  197 k
+DEBUG util.py:439:   info                         x86_64  6.5-4.fc29                   build  197 k
@@ -18 +18 @@
-DEBUG util.py:439:   sed                          x86_64  4.4-7.fc29                   build  290 k
+DEBUG util.py:439:   sed                          x86_64  4.5-1.fc29                   build  297 k
@@ -25 +25 @@
-DEBUG util.py:439:   annobin                      x86_64  5.1-1.fc29                   build   67 k
+DEBUG util.py:439:   annobin                      x86_64  5.2-1.fc29                   build   68 k
@@ -52 +52 @@
-DEBUG util.py:439:   gdb-headless                 x86_64  8.1-11.fc29                  build  3.6 M
+DEBUG util.py:439:   gdb-headless                 x86_64  8.1-14.fc29                  build  3.6 M
@@ -67 +67 @@
-DEBUG util.py:439:   kernel-headers               x86_64  4.16.0-0.rc7.git0.1.fc29     build  1.2 M
+DEBUG util.py:439:   kernel-headers               x86_64  4.17.0-0.rc0.git1.1.fc29     build  1.2 M
@@ -69 +69 @@
-DEBUG util.py:439:   krb5-libs                    x86_64  1.16-17.fc29                 build  874 k
+DEBUG util.py:439:   krb5-libs                    x86_64  1.16-20.fc29                 build  874 k
@@ -88 +88 @@
-DEBUG util.py:439:   libidn2                      x86_64  2.0.4-3.fc28                 build   99 k
+DEBUG util.py:439:   libidn2                      x86_64  2.0.4-7.fc29                 build   73 k
@@ -95 +95 @@
-DEBUG util.py:439:   libpkgconf                   x86_64  1.4.1-3.fc28                 build   33 k
+DEBUG util.py:439:   libpkgconf                   x86_64  1.4.2-1.fc29                 build   34 k
@@ -115 +115 @@
-DEBUG util.py:439:   libxml2                      x86_64  2.9.7-4.fc29                 build  694 k
+DEBUG util.py:439:   libxml2                      x86_64  2.9.8-1.fc29                 build  693 k
@@ -127 +127 @@
-DEBUG util.py:439:   openssl-libs                 x86_64  1:1.1.0g-6.fc29              build  1.3 M
+DEBUG util.py:439:   openssl-libs                 x86_64  1:1.1.0h-3.fc29              build  1.3 M
@@ -134,3 +134,3 @@
-DEBUG util.py:439:   pkgconf                      x86_64  1.4.1-3.fc28                 build   37 k
-DEBUG util.py:439:   pkgconf-m4                   noarch  1.4.1-3.fc28                 build   16 k
-DEBUG util.py:439:   pkgconf-pkg-config           x86_64  1.4.1-3.fc28                 build   14 k
+DEBUG util.py:439:   pkgconf                      x86_64  1.4.2-1.fc29                 build   37 k
+DEBUG util.py:439:   pkgconf-m4                   noarch  1.4.2-1.fc29                 build   16 k
+DEBUG util.py:439:   pkgconf-pkg-config           x86_64  1.4.2-1.fc29                 build   14 k
@@ -139,2 +139,2 @@
-DEBUG util.py:439:   python-srpm-macros           noarch  3-26.fc28                    build   10 k
-DEBUG util.py:439:   python3-libs                 x86_64  3.6.4-20.fc29                build  7.9 M
+DEBUG util.py:439:   python-srpm-macros           noarch  3-28.fc29                    build   11 k
+DEBUG util.py:439:   python3-libs                 x86_64  3.6.5-1.fc29                 build  7.8 M
@@ -142 +142 @@
-DEBUG util.py:439:   readline                     x86_64  7.0-9.fc29                   build  219 k
+DEBUG util.py:439:   readline                     x86_64  7.0-10.fc29                  build  198 k
@@ -160 +160 @@
-DEBUG util.py:439:   git                     x86_64 2.17.0-0.2.rc2.fc29                 build 219 k
+DEBUG util.py:439:   git                     x86_64 2.17.0-1.fc29                       build 219 k
@@ -166,2 +166,2 @@
-DEBUG util.py:439:   openssl                 x86_64 1:1.1.0g-6.fc29                     build 578 k
-DEBUG util.py:439:   openssl-devel           x86_64 1:1.1.0g-6.fc29                     build 1.9 M
+DEBUG util.py:439:   openssl                 x86_64 1:1.1.0h-3.fc29                     build 580 k
+DEBUG util.py:439:   openssl-devel           x86_64 1:1.1.0h-3.fc29                     build 1.9 M
@@ -177,2 +177,2 @@
-DEBUG util.py:439:   device-mapper           x86_64 1.02.146-4.fc28                     build 365 k
-DEBUG util.py:439:   device-mapper-libs      x86_64 1.02.146-4.fc28                     build 396 k
+DEBUG util.py:439:   device-mapper           x86_64 1.02.146-5.fc29                     build 365 k
+DEBUG util.py:439:   device-mapper-libs      x86_64 1.02.146-5.fc29                     build 396 k
@@ -184,2 +184,2 @@
-DEBUG util.py:439:   git-core                x86_64 2.17.0-0.2.rc2.fc29                 build 4.0 M
-DEBUG util.py:439:   git-core-doc            noarch 2.17.0-0.2.rc2.fc29                 build 2.3 M
+DEBUG util.py:439:   git-core                x86_64 2.17.0-1.fc29                       build 4.0 M
+DEBUG util.py:439:   git-core-doc            noarch 2.17.0-1.fc29                       build 2.3 M
@@ -191 +191 @@
-DEBUG util.py:439:   krb5-devel              x86_64 1.16-17.fc29                        build 542 k
+DEBUG util.py:439:   krb5-devel              x86_64 1.16-20.fc29                        build 542 k
@@ -196 +196 @@
-DEBUG util.py:439:   libkadm5                x86_64 1.16-17.fc29                        build 180 k
+DEBUG util.py:439:   libkadm5                x86_64 1.16-20.fc29                        build 180 k
@@ -199 +199 @@
-DEBUG util.py:439:   libsecret               x86_64 0.18.5-7.fc28                       build 159 k
+DEBUG util.py:439:   libsecret               x86_64 0.18.6-1.fc29                       build 161 k
@@ -209,2 +209,2 @@
-DEBUG util.py:439:   openssh                 x86_64 7.6p1-7.fc29                        build 505 k
-DEBUG util.py:439:   openssh-clients         x86_64 7.6p1-7.fc29                        build 679 k
+DEBUG util.py:439:   openssh                 x86_64 7.7p1-1.fc29                        build 483 k
+DEBUG util.py:439:   openssh-clients         x86_64 7.7p1-1.fc29                        build 683 k
@@ -228 +228 @@
-DEBUG util.py:439:   perl-Git                noarch 2.17.0-0.2.rc2.fc29                 build  74 k
+DEBUG util.py:439:   perl-Git                noarch 2.17.0-1.fc29                       build  74 k
@@ -256 +256 @@
-DEBUG util.py:439:   python3                 x86_64 3.6.4-20.fc29                       build  71 k
+DEBUG util.py:439:   python3                 x86_64 3.6.5-1.fc29                        build  71 k

Not at all sure which of those it's likely to be (if it's any of these and not something even more odd), but annobin and gdb seem at least plausible candidates?

Comment 17 Gerd Hoffmann 2018-04-13 17:15:15 UTC

compiler seems not to be the difference.

Have two kernel configs, one good, one bad.
4.16.2 kernel, f27 gcc.

Comment 18 Gerd Hoffmann 2018-04-13 17:16:28 UTC

Created attachment 1421455 [details]
good config

Comment 19 Gerd Hoffmann 2018-04-13 17:17:17 UTC

Created attachment 1421456 [details]
bad config

Comment 20 Gerd Hoffmann 2018-04-13 17:18:50 UTC

Created attachment 1421457 [details]
bad config (for real this time)

Comment 21 Adam Williamson 2018-04-13 18:29:06 UTC

I don't see the same differences in the kernel configs for the relevant official kernel builds, though?

There are definitely differences between the 4.16.0-0.rc7.git1.1 and 4.17.0-0.rc0.git4.1 configs, but they don't look like *those* differences...

Here's the diff between the x86_64 configs for those two kernel builds:

[adamw@adam kernel (master %)]$ diff -u0 kernel-4.15.fc29/linux-4.16.0-0.rc7.git1.1.fc29.x86_64/configs/kernel-4.16.0-x86_64.config kernel-4.16.fc29/linux-4.17.0-0.rc0.git4.1.fc29.x86_64/configs/kernel-4.17.0-x86_64.config 
--- kernel-4.15.fc29/linux-4.16.0-0.rc7.git1.1.fc29.x86_64/configs/kernel-4.16.0-x86_64.config	2018-04-13 10:43:54.924453231 -0700
+++ kernel-4.16.fc29/linux-4.17.0-0.rc0.git4.1.fc29.x86_64/configs/kernel-4.17.0-x86_64.config	2018-04-13 10:49:21.473930901 -0700
@@ -4 +4 @@
-# Linux/x86_64 4.16.0-rc7 Kernel Configuration
+# Linux/x86_64 4.16.0 Kernel Configuration
@@ -118 +117,0 @@
-# CONFIG_NO_HZ_FULL_ALL is not set
@@ -214 +212,0 @@
-# CONFIG_SYSCTL_SYSCALL is not set
@@ -245 +242,0 @@
-# CONFIG_PC104 is not set
@@ -254 +250,0 @@
-# CONFIG_SLUB_MEMCG_SYSFS_ON is not set
@@ -276 +271,0 @@
-# CONFIG_HAVE_64BIT_ALIGNED_ACCESS is not set
@@ -322 +316,0 @@
-CONFIG_THIN_ARCHIVES=y
@@ -343,2 +336,0 @@
-# CONFIG_HAVE_ARCH_HASH is not set
-# CONFIG_ISA_BUS_API is not set
@@ -347 +338,0 @@
-# CONFIG_CPU_NO_EFFICIENT_FFS is not set
@@ -350,2 +340,0 @@
-# CONFIG_ARCH_OPTIONAL_KERNEL_RWX is not set
-# CONFIG_ARCH_OPTIONAL_KERNEL_RWX_DEFAULT is not set
@@ -356 +344,0 @@
-CONFIG_ARCH_HAS_PHYS_TO_DMA=y
@@ -365 +352,0 @@
-# CONFIG_HAVE_GENERIC_DMA_COHERENT is not set
@@ -465 +451,0 @@
-CONFIG_X86_FAST_FEATURE_TESTS=y
@@ -555 +540,0 @@
-# CONFIG_VM86 is not set
@@ -695,0 +681 @@
+CONFIG_DYNAMIC_MEMORY_LAYOUT=y
@@ -752,0 +739 @@
+CONFIG_ACPI_TAD=m
@@ -764 +750,0 @@
-# CONFIG_ACPI_CUSTOM_DSDT is not set
@@ -776 +761,0 @@
-# CONFIG_ACPI_REDUCED_HARDWARE_ONLY is not set
@@ -842 +826,0 @@
-# CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED is not set
@@ -852,0 +837 @@
+CONFIG_MMCONF_FAM10H=y
@@ -941 +925,0 @@
-# CONFIG_HAVE_AOUT is not set
@@ -989,0 +974 @@
+CONFIG_IP_MROUTE_COMMON=y
@@ -1126,2 +1111,2 @@
-CONFIG_NF_TABLES_INET=m
-CONFIG_NF_TABLES_NETDEV=m
+CONFIG_NF_TABLES_INET=y
+CONFIG_NF_TABLES_NETDEV=y
@@ -1313 +1298 @@
-CONFIG_NF_TABLES_IPV4=m
+CONFIG_NF_TABLES_IPV4=y
@@ -1318 +1303 @@
-CONFIG_NF_TABLES_ARP=m
+CONFIG_NF_TABLES_ARP=y
@@ -1361 +1346 @@
-CONFIG_NF_TABLES_IPV6=m
+CONFIG_NF_TABLES_IPV6=y
@@ -1396 +1381 @@
-CONFIG_NF_TABLES_BRIDGE=m
+CONFIG_NF_TABLES_BRIDGE=y
@@ -1436,0 +1422 @@
+CONFIG_TIPC_DIAG=m
@@ -1559,0 +1546 @@
+CONFIG_NET_EMATCH_IPT=m
@@ -1748 +1734,0 @@
-CONFIG_BT_HCIBTUART=m
@@ -1752,0 +1739 @@
+CONFIG_BT_HCIRSI=m
@@ -1782 +1768,0 @@
-# CONFIG_MAC80211_RC_MINSTREL_VHT is not set
@@ -1874 +1859,0 @@
-# CONFIG_GENERIC_CPU_DEVICES is not set
@@ -1925,3 +1909,0 @@
-# CONFIG_MTD_MAP_BANK_WIDTH_8 is not set
-# CONFIG_MTD_MAP_BANK_WIDTH_16 is not set
-# CONFIG_MTD_MAP_BANK_WIDTH_32 is not set
@@ -1930,2 +1911,0 @@
-# CONFIG_MTD_CFI_I4 is not set
-# CONFIG_MTD_CFI_I8 is not set
@@ -1981 +1960,0 @@
-# CONFIG_PARPORT_GSC is not set
@@ -2003 +1981,0 @@
-# CONFIG_BLK_DEV_COW_COMMON is not set
@@ -2145,4 +2122,0 @@
-# CONFIG_CXL_BASE is not set
-# CONFIG_CXL_AFU_DRIVER_OPS is not set
-# CONFIG_CXL_LIB is not set
-# CONFIG_OCXL_BASE is not set
@@ -2253,2 +2226,0 @@
-# CONFIG_SCSI_EATA is not set
-# CONFIG_SCSI_FUTURE_DOMAIN is not set
@@ -2300 +2271,0 @@
-# CONFIG_ATA_NONSTANDARD is not set
@@ -2544,0 +2516 @@
+CONFIG_NET_DSA_MV88E6XXX_PTP=y
@@ -2591,2 +2562,0 @@
-CONFIG_B44_PCICORE_AUTOSELECT=y
-CONFIG_B44_PCI=y
@@ -2665,0 +2636 @@
+CONFIG_ICE=m
@@ -2716,0 +2688 @@
+# CONFIG_NET_VENDOR_NI is not set
@@ -2777 +2748,0 @@
-# CONFIG_SMSC911X_ARCH_HOOKS is not set
@@ -2980 +2950,0 @@
-CONFIG_B43_PCICORE_AUTOSELECT=y
@@ -2993 +2962,0 @@
-CONFIG_B43LEGACY_PCICORE_AUTOSELECT=y
@@ -3140,0 +3110 @@
+CONFIG_RSI_COEX=y
@@ -3179,0 +3150 @@
+CONFIG_IEEE802154_MCR20A=m
@@ -3287,2 +3257,0 @@
-# CONFIG_GIGASET_I4L is not set
-# CONFIG_GIGASET_DUMMYLL is not set
@@ -3424,0 +3394 @@
+CONFIG_JOYSTICK_PXRC=m
@@ -3635 +3604,0 @@
-# CONFIG_SERIAL_8250_FSL is not set
@@ -3795 +3763,0 @@
-# CONFIG_I2C_PXA_PCI is not set
@@ -3853 +3821 @@
-CONFIG_PPS=m
+CONFIG_PPS=y
@@ -3871 +3839 @@
-CONFIG_PTP_1588_CLOCK=m
+CONFIG_PTP_1588_CLOCK=y
@@ -3917,0 +3886,2 @@
+# CONFIG_GPIO_WINBOND is not set
+# CONFIG_GPIO_WS16C48 is not set
@@ -4251,0 +4222 @@
+# CONFIG_EBC_C384_WDT is not set
@@ -4313,2 +4283,0 @@
-CONFIG_SSB_DRIVER_PCICORE_POSSIBLE=y
-CONFIG_SSB_DRIVER_PCICORE=y
@@ -4411 +4379,0 @@
-# CONFIG_MFD_TMIO is not set
@@ -4425,0 +4394 @@
+# CONFIG_REGULATOR_88PG86X is not set
@@ -4471,0 +4441 @@
+CONFIG_IR_IMON_DECODER=m
@@ -4475,0 +4446 @@
+CONFIG_IR_IMON_RAW=m
@@ -4929,0 +4901,5 @@
+
+#
+# Media SPI Adapters
+#
+CONFIG_CXD2880_SPI_DRV=m
@@ -5103 +5078,0 @@
-CONFIG_DVB_SP2=m
@@ -5114,0 +5090,6 @@
+# Common Interface (EN50221) controller drivers
+#
+CONFIG_DVB_CXD2099=m
+CONFIG_DVB_SP2=m
+
+#
@@ -5117 +5097,0 @@
-# CONFIG_DVB_DUMMY_FE is not set
@@ -5221 +5200,0 @@
-# CONFIG_DRM_LIB_RANDOM is not set
@@ -5230 +5208,0 @@
-# CONFIG_FB_DDC is not set
@@ -5235 +5212,0 @@
-# CONFIG_FB_CFB_REV_PIXELS_IN_BYTE is not set
@@ -5239 +5215,0 @@
-# CONFIG_FB_PROVIDE_GET_FB_UNMAPPED_AREA is not set
@@ -5243,2 +5218,0 @@
-# CONFIG_FB_SVGALIB is not set
-# CONFIG_FB_MACMODES is not set
@@ -5393 +5366,0 @@
-# CONFIG_SND_OPL4_LIB_SEQ is not set
@@ -5546,0 +5520 @@
+CONFIG_SND_SOC_AMD_CZ_DA7219MX98357_MACH=m
@@ -5591,0 +5566 @@
+CONFIG_SND_SOC_INTEL_CHT_BSW_NAU8824_MACH=m
@@ -5601,0 +5577 @@
+CONFIG_SND_SOC_INTEL_KBL_DA7219_MAX98357A_MACH=m
@@ -5620 +5596 @@
-# CONFIG_SND_SOC_ADAU7002 is not set
+CONFIG_SND_SOC_ADAU7002=m
@@ -5621,0 +5598 @@
+CONFIG_SND_SOC_AK4458=m
@@ -5625,0 +5603 @@
+CONFIG_SND_SOC_AK5558=m
@@ -5626,0 +5605 @@
+CONFIG_SND_SOC_BD28623=m
@@ -5660,0 +5640 @@
+CONFIG_SND_SOC_MAX9867=m
@@ -5665,0 +5646,2 @@
+CONFIG_SND_SOC_PCM1789=m
+CONFIG_SND_SOC_PCM1789_I2C=m
@@ -5681 +5662,0 @@
-# CONFIG_SND_SOC_RT5514_SPI_BUILTIN is not set
@@ -5706,0 +5688 @@
+CONFIG_SND_SOC_TDA7419=m
@@ -5738,0 +5721 @@
+CONFIG_SND_SOC_MAX9759=m
@@ -5782,0 +5766 @@
+CONFIG_HID_ELAN=m
@@ -5789,0 +5774 @@
+# CONFIG_HID_GOOGLE_HAMMER is not set
@@ -5914 +5898,0 @@
-CONFIG_USB_ISP1362_HCD=m
@@ -6093,0 +6078,6 @@
+
+#
+# USB Type-C Multiplexer/DeMultiplexer Switch support
+#
+CONFIG_TYPEC_MUX_PI3USB30532=m
+CONFIG_USB_ROLES_INTEL_XHCI=m
@@ -6095,0 +6086 @@
+CONFIG_USB_ROLE_SWITCH=m
@@ -6186,0 +6178 @@
+CONFIG_LEDS_MLXREG=m
@@ -6507 +6498,0 @@
-# CONFIG_IRDA is not set
@@ -6597,4 +6587,0 @@
-
-#
-# Triggers - standalone
-#
@@ -6611 +6597,0 @@
-# CONFIG_DVB_CXD2099 is not set
@@ -6635,0 +6622 @@
+# CONFIG_MTK_MMC is not set
@@ -6731 +6717,0 @@
-# CONFIG_COMMON_CLK_NXP is not set
@@ -6733,2 +6718,0 @@
-# CONFIG_COMMON_CLK_PXA is not set
-# CONFIG_COMMON_CLK_PIC32 is not set
@@ -6743,5 +6726,0 @@
-# CONFIG_ATMEL_PIT is not set
-# CONFIG_SH_TIMER_CMT is not set
-# CONFIG_SH_TIMER_MTU2 is not set
-# CONFIG_SH_TIMER_TMU is not set
-# CONFIG_EM_TIMER_STI is not set
@@ -6807 +6785,0 @@
-# CONFIG_SUNXI_SRAM is not set
@@ -7088,0 +7067 @@
+CONFIG_LV0104CS=m
@@ -7144,0 +7124 @@
+CONFIG_AD5272=m
@@ -7147,0 +7128 @@
+CONFIG_MCP4018=m
@@ -7195,0 +7177 @@
+CONFIG_MLX90632=m
@@ -7222 +7203,0 @@
-# CONFIG_ARM_GIC_V3_ITS is not set
@@ -7225,10 +7205,0 @@
-# CONFIG_RESET_ATH79 is not set
-# CONFIG_RESET_AXS10X is not set
-# CONFIG_RESET_BERLIN is not set
-# CONFIG_RESET_IMX7 is not set
-# CONFIG_RESET_LANTIQ is not set
-# CONFIG_RESET_LPC18XX is not set
-# CONFIG_RESET_MESON is not set
-# CONFIG_RESET_PISTACHIO is not set
-# CONFIG_RESET_SIMPLE is not set
-# CONFIG_RESET_SUNXI is not set
@@ -7236,2 +7206,0 @@
-# CONFIG_RESET_ZYNQ is not set
-# CONFIG_RESET_TEGRA_BPMP is not set
@@ -7282 +7251,5 @@
-CONFIG_NVMEM=m
+CONFIG_NVMEM=y
+
+#
+# HW tracing support
+#
@@ -7286,4 +7258,0 @@
-CONFIG_FSI=m
-CONFIG_FSI_MASTER_GPIO=m
-CONFIG_FSI_MASTER_HUB=m
-CONFIG_FSI_SCOM=m
@@ -7764,0 +7734,3 @@
+CONFIG_LOCK_DEBUGGING_SUPPORT=y
+CONFIG_PROVE_LOCKING=y
+CONFIG_LOCK_STAT=y
@@ -7768 +7740,2 @@
-# CONFIG_DEBUG_WW_MUTEX_SLOWPATH is not set
+CONFIG_DEBUG_WW_MUTEX_SLOWPATH=y
+CONFIG_DEBUG_RWSEMS=y
@@ -7770 +7742,0 @@
-CONFIG_PROVE_LOCKING=y
@@ -7772 +7743,0 @@
-CONFIG_LOCK_STAT=y
@@ -7910 +7880,0 @@
-# CONFIG_ARCH_WANTS_UBSAN_NO_NULL is not set
@@ -8056 +8025,0 @@
-CONFIG_CRYPTO_ABLK_HELPER=m
@@ -8073,0 +8043 @@
+CONFIG_CRYPTO_CFB=m
@@ -8157,0 +8128,2 @@
+CONFIG_CRYPTO_SM4=m
+CONFIG_CRYPTO_SPECK=m
@@ -8194 +8165,0 @@
-# CONFIG_CRYPTO_DEV_FSL_CAAM_CRYPTO_API_DESC is not set
@@ -8209,0 +8181 @@
+CONFIG_CRYPTO_DEV_CHELSIO_TLS=m
@@ -8261 +8232,0 @@
-# CONFIG_HAVE_ARCH_BITREVERSE is not set
@@ -8286 +8256,0 @@
-# CONFIG_AUDIT_ARCH_COMPAT_GENERIC is not set
@@ -8330 +8300 @@
-# CONFIG_DMA_DIRECT_OPS is not set
+CONFIG_DMA_DIRECT_OPS=y
@@ -8352 +8321,0 @@
-# CONFIG_SG_SPLIT is not set

Comment 22 Gerd Hoffmann 2018-04-16 13:14:15 UTC

One of the locking debug options seems to trigger it, I'll go figure which one.

Comment 23 Gerd Hoffmann 2018-04-16 15:20:09 UTC

Down to this:

--- dot.config.bad	2018-04-16 17:19:10.054098868 +0200
+++ dot.config.good	2018-04-16 16:57:31.774301769 +0200
@@ -4448,12 +4448,10 @@ CONFIG_SCHEDSTATS=y
 CONFIG_DEBUG_RT_MUTEXES=y
 CONFIG_DEBUG_SPINLOCK=y
 CONFIG_DEBUG_MUTEXES=y
-CONFIG_DEBUG_WW_MUTEX_SLOWPATH=y
-CONFIG_DEBUG_LOCK_ALLOC=y
+# CONFIG_DEBUG_WW_MUTEX_SLOWPATH is not set
+# CONFIG_DEBUG_LOCK_ALLOC is not set
 # CONFIG_PROVE_LOCKING is not set
-CONFIG_LOCKDEP=y
 # CONFIG_LOCK_STAT is not set
-# CONFIG_DEBUG_LOCKDEP is not set
 # CONFIG_DEBUG_ATOMIC_SLEEP is not set
 # CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set
 # CONFIG_LOCK_TORTURE_TEST is not set

Comment 24 Adam Williamson 2018-04-16 17:19:08 UTC

Gerd: the only one of those pairs that appears in the Fedora package diff I posted is CONFIG_DEBUG_WW_MUTEX_SLOWPATH , so that sounds like the most likely suspect?

Comment 25 Gerd Hoffmann 2018-04-17 06:54:37 UTC

/me looks at qxl_release_map + qxl_release_unmap.
Asking myself how did that ever work?

union qxl_release_info *qxl_release_map(struct qxl_device *qdev,
					struct qxl_release *release)
{
	void *ptr;
	union qxl_release_info *info;
	struct qxl_bo_list *entry = list_first_entry(&release->bos, struct qxl_bo_list, tv.head);
	struct qxl_bo *bo = to_qxl_bo(entry->tv.bo);

	ptr = qxl_bo_kmap_atomic_page(qdev, bo, release->release_offset & PAGE_SIZE);
	if (!ptr)
		return NULL;
	info = ptr + (release->release_offset & ~PAGE_SIZE);
	return info;
}

s/PAGE_SIZE/PAGE_MASK/ ...

Comment 26 Gerd Hoffmann 2018-04-17 10:36:11 UTC

> Asking myself how did that ever work?

The answer to that one seems to be "release_offset is always smaller than PAGE_SIZE".  And, of course, fixing that issue didn't fix this bug.

Comment 27 Gerd Hoffmann 2018-04-17 20:44:55 UTC

Created attachment 1423252 [details]
0001-qxl-fix-qxl_release_-map-unmap.patch

Comment 28 Gerd Hoffmann 2018-04-17 20:45:42 UTC

Created attachment 1423253 [details]
0002-qxl-keep-separate-release_bo-pointer.patch

Comment 29 Christophe Fergeau 2018-04-20 14:46:50 UTC

*** Bug 1570046 has been marked as a duplicate of this bug. ***

Comment 30 Justin M. Forbes 2018-04-20 19:02:19 UTC

These patches are included in kernel-4.17.0-0.rc1.git2.1.fc29 and newer kernels. Can someone verify that they fix the issue?

Comment 31 Adam Williamson 2018-04-20 23:02:10 UTC

Things look a lot better in today's openQA compose testing indeed. A couple of tests still hit this bug, but I think they're ones which got the old kernel from a network install. I think we can call this fixed, I'll re-open if it turns out it really is still happening in future tests. Thanks guys!

Comment 32 Adam Williamson 2018-04-23 15:47:29 UTC

Unfortunately another test ran into this yesterday:

https://openqa.fedoraproject.org/tests/228191

and it hit it during anaconda, which means it was definitely with a 'fixed' kernel. However, it really does seem to be happening a lot less often now. Is there perhaps some remaining corner case where this can still happen, it's just a lot less likely?

Comment 33 Adam Williamson 2018-04-23 15:48:24 UTC

From the logs of that test:

[2018-04-22T14:26:41.0664 UTC] [debug] QEMU: (process:42029): Spice-WARNING **: display-channel.c:2426:display_channel_validate_surface: invalid surface_id 67108864
[2018-04-22T14:26:41.0713 UTC] [debug] MATCH(login_gdm:0.00)
[2018-04-22T14:26:41.0718 UTC] [debug] MATCH(text_console_login:0.00)
[2018-04-22T14:26:41.0736 UTC] [debug] MATCH(login_sddm-20171016:0.00)
[2018-04-22T14:26:41.0736 UTC] [debug] no match: 288.5s
[2018-04-22T14:26:42.0718 UTC] [debug] MATCH(login_gdm:0.00)
[2018-04-22T14:26:42.0723 UTC] [debug] MATCH(text_console_login:0.00)
[2018-04-22T14:26:42.0740 UTC] [debug] MATCH(login_sddm-20171016:0.00)
[2018-04-22T14:26:42.0740 UTC] [debug] no match: 287.5s
[2018-04-22T14:26:43.0323 UTC] [debug] QEMU: id 0, group 0, virt start 0, virt end ffffffffffffffff, generation 0, delta 0
[2018-04-22T14:26:43.0323 UTC] [debug] QEMU: id 1, group 1, virt start 7f829fc00000, virt end 7f82a3bfe000, generation 0, delta 7f829fc00000
[2018-04-22T14:26:43.0323 UTC] [debug] QEMU: id 2, group 1, virt start 7f829ba00000, virt end 7f829fa00000, generation 0, delta 7f829ba00000
[2018-04-22T14:26:43.0323 UTC] [debug] QEMU: 
[2018-04-22T14:26:43.0323 UTC] [debug] QEMU: (process:42029): Spice-CRITICAL **: memslot.c:111:memslot_get_virt: slot_id 54 too big, addr=364d6cff354c6cff

The 'match' and 'no match' lines are from openQA, the 'QEMU' lines are passed through from qemu.

Comment 34 Gerd Hoffmann 2018-04-24 09:45:12 UTC

(In reply to Adam Williamson from comment #32)
> Unfortunately another test ran into this yesterday:
> 
> https://openqa.fedoraproject.org/tests/228191
> 
> and it hit it during anaconda, which means it was definitely with a 'fixed'
> kernel. However, it really does seem to be happening a lot less often now.
> Is there perhaps some remaining corner case where this can still happen,
> it's just a lot less likely?

Probably.

Is this a kernel regression, or does it happen with older kernels too?
Does it crash on the same place all the time?
The linked test case seem to crash shortly after qxl driver load.

Comment 35 Gwyn Ciesla 2018-04-24 13:52:17 UTC

My VM has been stable since booting to kernel-4.17.0-0.rc1.git3.1.fc29.x86_64 on the guest. Is there anything I can provide to assist?

Comment 36 Adam Williamson 2018-04-24 15:04:28 UTC

Gwyn: well, if it happens to you again, I guess that would be useful info :P It's definitely happening a *lot* less in openQA now - before this was happening to every single test that used qxl as the driver, now it's like 2% of them.

Gerd: I don't recall ever seeing this exact crash before the time I filed this bug. However, there is a similar case which has been happening occasionally for *much* longer, that's this one:

https://bugzilla.redhat.com/show_bug.cgi?id=1403343

in that case the critical error message is:

15:23:50.4844 32146 QEMU: (process:32151): Spice-CRITICAL **: display-channel.c:1666:display_channel_update: condition `validate_surface(display, surface_id)' failed

so is it possible this is somehow just that same bug but the error messages have changed, or something? If not, then I think this is new between 2018-04-02 and 2018-04-07 Rawhide composes.

Comment 37 Gerd Hoffmann 2018-05-07 09:08:00 UTC

> It's definitely happening a *lot* less in openQA now - before this was
> happening to every single test that used qxl as the driver, now it's like 2%
> of them.

Might be host side (qemu) issue.
https://patchwork.ozlabs.org/patch/905667/

Comment 38 Oron Peled 2018-07-09 09:18:18 UTC

Just another data point:
 * I have a fully updated f28 host
 * It has 3 fully updated VM's: f27, f28, rawhide
 * AFAICS, the phenomena only appears on the rawhide VM.
 * Most of the time these VM's are down, I boot them up 1-2 times a week for updates. These error messages appear every time on rawhide boot (and keep going endlessly)

It's not new (many months, don't remember how many), but I didn't notice other problems so I expected it will be fixed in one of the kernel updates...

Comment 39 Oron Peled 2018-07-24 21:37:04 UTC

In the last week or two I no longer see these messages in the rawhide VM.
The VM kernels are (the latest is current):
  [root@rawhide boot]# ls -l vmlinuz-4.* | cut -c33-
  Jul 13 19:17 vmlinuz-4.18.0-0.rc4.git4.1.fc29.x86_64
  Jul 17 20:59 vmlinuz-4.18.0-0.rc5.git1.1.fc29.x86_64
  Jul 20 20:18 vmlinuz-4.18.0-0.rc5.git4.1.fc29.x86_64

The host kernels are (latest is running):
  [root@argon boot]# ls -l vmlinuz-4.* | cut -c33-
  Jul 10 16:55 vmlinuz-4.17.5-200.fc28.x86_64
  Jul 11 23:55 vmlinuz-4.17.6-200.fc28.x86_64
  Jul 17 19:54 vmlinuz-4.17.7-200.fc28.x86_64

So one of the changes after 2018-07-09 fixed it.

Comment 40 Oron Peled 2018-07-24 21:37:57 UTC

In the last week or two I no longer see these messages in the rawhide VM (no exact date).

The VM kernels are (the latest is current):
  [root@rawhide boot]# ls -l vmlinuz-4.* | cut -c33-
  Jul 13 19:17 vmlinuz-4.18.0-0.rc4.git4.1.fc29.x86_64
  Jul 17 20:59 vmlinuz-4.18.0-0.rc5.git1.1.fc29.x86_64
  Jul 20 20:18 vmlinuz-4.18.0-0.rc5.git4.1.fc29.x86_64

The host kernels are (latest is running):
  [root@argon boot]# ls -l vmlinuz-4.* | cut -c33-
  Jul 10 16:55 vmlinuz-4.17.5-200.fc28.x86_64
  Jul 11 23:55 vmlinuz-4.17.6-200.fc28.x86_64
  Jul 17 19:54 vmlinuz-4.17.7-200.fc28.x86_64

So one of the changes after 2018-07-09 fixed it.

Comment 41 Niccolò Belli 2018-11-15 13:02:54 UTC

Is it possible that it reappeared with 4.19.1?

Nov 15 13:53:29 host kernel: kauditd_printk_skb: 299 callbacks suppressed
Nov 15 13:53:29 host systemd-coredump[2891]: Process 2669 (qemu-system-x86) of user 65534 dumped core.
Nov 15 13:53:29 host libvirtd[2394]: Unable to read from monitor: Connection reset by peer
Nov 15 13:53:29 host libvirtd[2394]: internal error: qemu unexpectedly closed the monitor: red_qxl_loadvm_commands: 
                                                         id 0, group 0, virt start 0, virt end ffffffffffffffff, generation 0, delta 0
                                                         id 1, group 1, virt start 7fb0a3600000, virt end 7fb0a75fe000, generation 0, delta 7fb0a3600000
                                                         id 2, group 1, virt start 7fb09f200000, virt end 7fb0a3200000, generation 0, delta 7fb09f200000
                                                         
                                                         (process:2669): Spice-CRITICAL **: 13:53:29.361: memslot.c:111:memslot_get_virt: slot_id 172 too big, addr=ac50000000000000
Nov 15 13:53:30 host libvirtd[2394]: operation failed: domain is not running
Nov 15 13:53:30 host libvirtd[2394]: internal error: unexpected async job 6 type expected 0
Nov 15 13:53:30 host libvirtd[2394]: Unable to restore from managed state /var/lib/libvirt/qemu/save/win2k19.save. Maybe the file is corrupted?
Nov 15 13:53:30 host libvirtd[2394]: internal error: Failed to autostart VM 'win2k19': operation failed: domain is not running

Comment 42 Christophe Fergeau 2018-11-15 16:30:35 UTC

This looks like a migration crash, it's different from that bug. It would be good if you could get a backtrace of the crash though.

Comment 43 Adam Williamson 2021-05-03 15:49:03 UTC

Haven't seen this in some time AFAIR.

Note You need to log in before you can comment on or make changes to this bug.

airlied
amit
awilliam
berrange
bskeggs
cfergeau
crobinso
darkbasic
dwmw2
ewk
gwync
hdegoede
ichavero
itamar
jarodwilson
jforbes
jglisse
john.j5live
jonathan
josef
kernel-maint
kraxel
linville
mailinglists35
mchehab
mjg59
oron
pbonzini
rjones
steved
virt-maint