Since qemu 2.12.0 rc2 - qemu-2.12.0-0.6.rc2.fc29 - landed in Fedora Rawhide, just about all of our openQA-automated tests of Rawhide guests which run with qxl / SPICE graphics in the guest have died partway in, always shortly after the test switches from the installer (an X environment) to a console on a tty. qemu is, I think, hanging. There are always some errors like this right around the time of the hang: [2018-04-09T20:13:42.0736 UTC] [debug] QEMU: id 0, group 0, virt start 0, virt end ffffffffffffffff, generation 0, delta 0 [2018-04-09T20:13:42.0736 UTC] [debug] QEMU: id 1, group 1, virt start 7f42dbc00000, virt end 7f42dfbfe000, generation 0, delta 7f42dbc00000 [2018-04-09T20:13:42.0736 UTC] [debug] QEMU: id 2, group 1, virt start 7f42d7a00000, virt end 7f42dba00000, generation 0, delta 7f42d7a00000 [2018-04-09T20:13:42.0736 UTC] [debug] QEMU: [2018-04-09T20:13:42.0736 UTC] [debug] QEMU: (process:45812): Spice-CRITICAL **: memslot.c:111:memslot_get_virt: slot_id 218 too big, addr=da8e21fbda8e21fb or occasionally like this: [2018-04-09T20:13:58.0717 UTC] [debug] QEMU: id 0, group 0, virt start 0, virt end ffffffffffffffff, generation 0, delta 0 [2018-04-09T20:13:58.0720 UTC] [debug] QEMU: id 1, group 1, virt start 7ff093c00000, virt end 7ff097bfe000, generation 0, delta 7ff093c00000 [2018-04-09T20:13:58.0720 UTC] [debug] QEMU: id 2, group 1, virt start 7ff08fa00000, virt end 7ff093a00000, generation 0, delta 7ff08fa00000 [2018-04-09T20:13:58.0720 UTC] [debug] QEMU: [2018-04-09T20:13:58.0720 UTC] [debug] QEMU: (process:25622): Spice-WARNING **: memslot.c:68:memslot_validate_virt: virtual address out of range [2018-04-09T20:13:58.0720 UTC] [debug] QEMU: virt=0x0+0x18 slot_id=0 group_id=1 [2018-04-09T20:13:58.0721 UTC] [debug] QEMU: slot=0x0-0x0 delta=0x0 [2018-04-09T20:13:58.0721 UTC] [debug] QEMU: [2018-04-09T20:13:58.0721 UTC] [debug] QEMU: (process:25622): Spice-WARNING **: display-channel.c:2426:display_channel_validate_surface: invalid surface_id 1048576 [2018-04-09T20:14:14.0728 UTC] [debug] QEMU: id 0, group 0, virt start 0, virt end ffffffffffffffff, generation 0, delta 0 [2018-04-09T20:14:14.0728 UTC] [debug] QEMU: id 1, group 1, virt start 7ff093c00000, virt end 7ff097bfe000, generation 0, delta 7ff093c00000 [2018-04-09T20:14:14.0728 UTC] [debug] QEMU: id 2, group 1, virt start 7ff08fa00000, virt end 7ff093a00000, generation 0, delta 7ff08fa00000 [2018-04-09T20:14:14.0728 UTC] [debug] QEMU: [2018-04-09T20:14:14.0728 UTC] [debug] QEMU: (process:25622): Spice-CRITICAL **: memslot.c:122:memslot_get_virt: address generation is not valid, group_id 1, slot_id 0, gen 110, slot_gen 0 The same tests running on Fedora 28 guests on the same hosts are not hanging, and the same tests were not hanging right before the qemu package got updated, so this seems very strongly tied to the new qemu. This is a downstream copy of https://bugs.launchpad.net/qemu/+bug/1762558 ; I wanted to have the bug tracked in both places to make sure the RH virt folks see it, and for openQA linking and tracking of getting the fix landed in Rawhide.
Before anyone asks, SPICE has not been changed since last year.
Bug 1540919 is unlikely to be a duplicate, it is a live migration issue and I don't think fedora qa does that. Bug 1520729 looks simliar, but is pretty hard to reproduce. So maybe something changed in qemu to trigger this more easily now. Any chance these crashes happen with older qemu too, just with much lower frequency? Couldn't trigger it on my workstation though. Tried wayland, xorg + qxl, xorg + modesetting. All running fine. Hmm. Any chance you can run qemu with tracing enabled in autoqa (see bug 1520729 comment 18)?
"Couldn't trigger it on my workstation though. Tried wayland, xorg + qxl, xorg + modesetting. All running fine. Hmm." Did you try booting an installer image to the graphical installer, then switching to a tty? That seems to be the consistent trigger for the failure. "Any chance these crashes happen with older qemu too, just with much lower frequency?" I mean, it's *possible*, but I definitely don't recall ever seeing it before, and I look at a lot of failures. "Any chance you can run qemu with tracing enabled in autoqa (see bug 1520729 comment 18)?" Yeah, I can tweak os-autoinst to do that, I think. I'll try it and get back to you.
You know...it occurs to me that I'm *clearly* not thinking straight in assigning this to qemu based on the Rawhide qemu package, because we're not *using* that. Sigh. I'm an idiot. It's definitely to do with something in Rawhide, but that something is very unlikely to be qemu - the qemu we run is from the worker host environment, not from the image tested, of course. So the fact that this is only happening in tests of Rawhide images basically rules out stuff that comes from the host environment. The thing that changed must be something in the guest environment, somehow. I'll have to look at what else changed between 2018-04-02 and 2018-04-07. Leaving this assigned to qemu for now just because it *does* involve qemu/spice somehow and I don't know what the real trigger is yet, but I was certainly wrong to suggest the Rawhide update to qemu was the trigger. D'oh.
I'm actually able to reproduce this on my F28 test system by simply running: /usr/bin/qemu-kvm -soundhw ac97 -global isa-fdc.driveA= -vga qxl -m 2048 -cpu qemu64 -netdev user,id=qanet0 -device virtio-net,netdev=qanet0,mac=52:54:00:12:34:56 -device virtio-scsi-pci,id=scsi0 -drive media=cdrom,if=none,id=cd0,format=raw,file=/share/data/isos/29/nightlies/Fedora-AtomicWorkstation-ostree-x86_64-Rawhide-20180409.n.0.iso -device scsi-cd,drive=cd0 -boot once=d,menu=on,splash-time=5000 -device usb-ehci -device usb-tablet -smp 2 -enable-kvm -no-shutdown Which results in this, among other errors: (qemu-system-x86_64:9441): Spice-WARNING **: 11:44:39.330: display-channel.c:2426:display_channel_validate_surface: invalid surface_id 67108864 id 0, group 0, virt start 0, virt end ffffffffffffffff, generation 0, delta 0 id 1, group 1, virt start 7feb1bc00000, virt end 7feb1fbfe000, generation 0, delta 7feb1bc00000 id 2, group 1, virt start 7feb17a00000, virt end 7feb1ba00000, generation 0, delta 7feb17a00000 (qemu-system-x86_64:9441): Spice-WARNING **: 11:44:41.197: memslot.c:68:memslot_validate_virt: virtual address out of range virt=0x0+0x18 slot_id=0 group_id=1 slot=0x0-0x0 delta=0x0 (qemu-system-x86_64:9441): Spice-WARNING **: 11:44:41.197: display-channel.c:2426:display_channel_validate_surface: invalid surface_id 524288 id 0, group 0, virt start 0, virt end ffffffffffffffff, generation 0, delta 0 id 1, group 1, virt start 7feb1bc00000, virt end 7feb1fbfe000, generation 0, delta 7feb1bc00000 id 2, group 1, virt start 7feb17a00000, virt end 7feb1ba00000, generation 0, delta 7feb17a00000 (qemu-system-x86_64:9441): Spice-CRITICAL **: 11:44:55.050: memslot.c:122:memslot_get_virt: address generation is not valid, group_id 1, slot_id 0, gen 7, slot_gen 0 and then a backtrace. I'll try again with tracing and attach full outputs. The ISO used in the command can be found at https://kojipkgs.fedoraproject.org/compose/rawhide/Fedora-Rawhide-20180409.n.0/compose/AtomicWorkstation/x86_64/iso/Fedora-AtomicWorkstation-ostree-x86_64-Rawhide-20180409.n.0.iso - any installer image from that compose would likely trigger the bug, though.
I'm about to reboot to try the tracing thing (since it requires my booted kernel and kernel-devel package to be in sync, sigh). Meanwhile, one obvious candidate for what changed in Rawhide guests between 2018-04-02 and 2018-04-07 is the kernel - it went from kernel-4.16.0-0.rc7.git1.1.fc29 to kernel-4.17.0-0.rc0.git4.1.fc29 . The other obvious suspect, given that the installer environment still runs on X not Wayland, is that the X server got a bump from xorg-x11-server-1.19.6-5.fc28 to xorg-x11-server-1.19.99.903-1.fc29 (1.20 RC3). xorg-x11-drv-qxl was rebuilt for the server bump.
Can't get the tracing to work: [root@adam adamw]# stap -e 'probe qemu.kvm.simpletrace.qxl* {}' -x 20177 >/tmp/trace semantic error: while resolving probe point: identifier 'qemu' at <input>:1:7 source: probe qemu.kvm.simpletrace.qxl* {} ^ semantic error: probe point mismatch (similar: system, user): identifier 'kvm' at :1:12 source: probe qemu.kvm.simpletrace.qxl* {} ^ Pass 2: analysis failed. [man error::pass2] I'm having trouble getting it to *crash* again locally, but it reliably is broken - basically, it seems to render extremely slowly once you get into X, if using qxl graphics. The anaconda 'welcome' screen appears without the quit / continue buttons, then it takes several minutes before they appear. After clicking on 'Continue', it takes several minutes for the 'this is a pre-release' warning dialog to appear. After clicking that it takes ages for it to go away and the hub to show up, etc. Booting with -vga virtio or -vga std, this doesn't happen at all, everything renders quite fast. When using qxl, errors like this appear over and over: (qemu-system-x86_64:20938): Spice-WARNING **: 16:00:07.686: red-parse-qxl.c:1109:red_get_native_drawable: unknown type 64 (qemu-system-x86_64:20938): Spice-WARNING **: 16:00:17.409: red-parse-qxl.c:1109:red_get_native_drawable: unknown type 232 (qemu-system-x86_64:20938): Spice-WARNING **: 16:00:17.730: red-parse-qxl.c:1109:red_get_native_drawable: unknown type 232 (qemu-system-x86_64:20938): Spice-WARNING **: 16:00:17.731: red-parse-qxl.c:1109:red_get_native_drawable: unknown type 255 (qemu-system-x86_64:20938): Spice-WARNING **: 16:00:17.740: red-parse-qxl.c:1109:red_get_native_drawable: unknown type 232 (qemu-system-x86_64:20938): Spice-WARNING **: 16:00:17.742: red-parse-qxl.c:1109:red_get_native_drawable: unknown type 255 and occasionally stuff like: (qemu-system-x86_64:20938): Spice-WARNING **: 15:59:32.892: memslot.c:68:memslot_validate_virt: virtual address out of range virt=0x0+0x18 slot_id=0 group_id=1 slot=0x0-0x0 delta=0x0 I'm now building test images based on the 20180402.n.0 compose but with various single packages updated, trying to isolate the cause of the problem.
A test image based on 20180402.n.0 with the xorg-x11-server update included works fine. A test image based on 20180402.n.0 with the kernel update included displays the buggy behaviour. Based on this, I'm re-assigning this to the kernel. Will try a test with the latest kernel (there've been a couple of builds in koji lately) later.
Also affects a test image based on 20180402.n.0 with kernel-4.17.0-0.rc0.git7.1.fc29 .
So the only thing that appears to have changed in the qxl driver between kernel 4.16rc7 and kernel 4.17rc4 is this: https://patchwork.freedesktop.org/patch/211552/ well, that makes things easy! Let's just revert that patch, and... ...crap. It's still broken. So this still looks like something in the kernel, but not the *obvious* thing. Something else that changed in the kernel broke this. But I don't know what. My results are reproducible, I've checked. I still have all the test ISOs, and the x11 test ISO still works fine, all the ISOs with updated kernels (including the one with kernel-4.17.0-0.rc8.git7.1.fc29 but with the above change reverted) still don't work right.
It kinda looks to me like basically SPICE is getting unexpected results when it deals with these here 'memslot' things: https://github.com/SPICE/spice/blob/master/server/memslot.c but I really don't have the expertise at this point to figure out what bit is putting stuff into said slots for SPICE to pull out and try and do stuff with (the stuff that's causing these errors), or why it's now possibly doing something spice isn't expecting...
Seems the host gets corrupted qxl commands from the guest. Can you try without xorg-x11-drv-qxl and see what happens then?
(In reply to Gerd Hoffmann from comment #12) > Seems the host gets corrupted qxl commands from the guest. > Can you try without xorg-x11-drv-qxl and see what happens then? Ok, scratch that, most likely is is something in the kernel. Installed the rawhide kernel on my F27 guest, now I see all kinds if qxl issues too. That is with wayland, so xorg not involved. F27 kernel (4.15.15-300.fc27) is fine. Self-compiled 4.16.2 is fine too. Rawhide (4.17.0-0.rc0.git7.1.fc29) is broken.
> https://patchwork.freedesktop.org/patch/211552/ There is more ... kraxel@sirius ~/projects/linux (master)# git log --oneline v4.16..master -- drivers/gpu/drm/qxl 1c7095d283 Merge airlied/drm-next into drm-misc-next 2793c1d77a drm/qxl: Replace drm_gem_object_reference/unreference() with _get/put() dde5da2379 drm/ttm: add bo as parameter to the ttm_tt_create callback 724daa4fd6 drm/ttm: drop persistent_swap_storage from ttm_bo_init and co 231cdafc75 drm/ttm: drop ttm->dummy_read_page 3839263362 drm/ttm: drop bo->glob 2a7b464f84 drm/qxl: remove ttm_pool_* wrappers 74c0167f8b Merge drm-next into drm-intel-next-queued 54156da893 Merge airlied/drm-next into drm-misc-next Probably one of the ttm patches. I'll go for a stupid bisect though to figure, stay tuned.
bisect didn't have any useful results. And now I ended up with two 4.16.2 kernels (different configs, different compilers) where one works and one doesn't, so it isn't a code difference ...
Oh fun :( So the smallest delta I currently have is the one I mentioned above: kernel-4.16.0-0.rc7.git1.1.fc29 to kernel-4.17.0-0.rc0.git4.1.fc29 Here are the differences between the packages they were built with: -DEBUG util.py:439: fedora-release noarch 29-0.1 build 26 k +DEBUG util.py:439: fedora-release noarch 29-0.2 build 26 k @@ -13 +13 @@ -DEBUG util.py:439: info x86_64 6.5-3.fc28 build 197 k +DEBUG util.py:439: info x86_64 6.5-4.fc29 build 197 k @@ -18 +18 @@ -DEBUG util.py:439: sed x86_64 4.4-7.fc29 build 290 k +DEBUG util.py:439: sed x86_64 4.5-1.fc29 build 297 k @@ -25 +25 @@ -DEBUG util.py:439: annobin x86_64 5.1-1.fc29 build 67 k +DEBUG util.py:439: annobin x86_64 5.2-1.fc29 build 68 k @@ -52 +52 @@ -DEBUG util.py:439: gdb-headless x86_64 8.1-11.fc29 build 3.6 M +DEBUG util.py:439: gdb-headless x86_64 8.1-14.fc29 build 3.6 M @@ -67 +67 @@ -DEBUG util.py:439: kernel-headers x86_64 4.16.0-0.rc7.git0.1.fc29 build 1.2 M +DEBUG util.py:439: kernel-headers x86_64 4.17.0-0.rc0.git1.1.fc29 build 1.2 M @@ -69 +69 @@ -DEBUG util.py:439: krb5-libs x86_64 1.16-17.fc29 build 874 k +DEBUG util.py:439: krb5-libs x86_64 1.16-20.fc29 build 874 k @@ -88 +88 @@ -DEBUG util.py:439: libidn2 x86_64 2.0.4-3.fc28 build 99 k +DEBUG util.py:439: libidn2 x86_64 2.0.4-7.fc29 build 73 k @@ -95 +95 @@ -DEBUG util.py:439: libpkgconf x86_64 1.4.1-3.fc28 build 33 k +DEBUG util.py:439: libpkgconf x86_64 1.4.2-1.fc29 build 34 k @@ -115 +115 @@ -DEBUG util.py:439: libxml2 x86_64 2.9.7-4.fc29 build 694 k +DEBUG util.py:439: libxml2 x86_64 2.9.8-1.fc29 build 693 k @@ -127 +127 @@ -DEBUG util.py:439: openssl-libs x86_64 1:1.1.0g-6.fc29 build 1.3 M +DEBUG util.py:439: openssl-libs x86_64 1:1.1.0h-3.fc29 build 1.3 M @@ -134,3 +134,3 @@ -DEBUG util.py:439: pkgconf x86_64 1.4.1-3.fc28 build 37 k -DEBUG util.py:439: pkgconf-m4 noarch 1.4.1-3.fc28 build 16 k -DEBUG util.py:439: pkgconf-pkg-config x86_64 1.4.1-3.fc28 build 14 k +DEBUG util.py:439: pkgconf x86_64 1.4.2-1.fc29 build 37 k +DEBUG util.py:439: pkgconf-m4 noarch 1.4.2-1.fc29 build 16 k +DEBUG util.py:439: pkgconf-pkg-config x86_64 1.4.2-1.fc29 build 14 k @@ -139,2 +139,2 @@ -DEBUG util.py:439: python-srpm-macros noarch 3-26.fc28 build 10 k -DEBUG util.py:439: python3-libs x86_64 3.6.4-20.fc29 build 7.9 M +DEBUG util.py:439: python-srpm-macros noarch 3-28.fc29 build 11 k +DEBUG util.py:439: python3-libs x86_64 3.6.5-1.fc29 build 7.8 M @@ -142 +142 @@ -DEBUG util.py:439: readline x86_64 7.0-9.fc29 build 219 k +DEBUG util.py:439: readline x86_64 7.0-10.fc29 build 198 k @@ -160 +160 @@ -DEBUG util.py:439: git x86_64 2.17.0-0.2.rc2.fc29 build 219 k +DEBUG util.py:439: git x86_64 2.17.0-1.fc29 build 219 k @@ -166,2 +166,2 @@ -DEBUG util.py:439: openssl x86_64 1:1.1.0g-6.fc29 build 578 k -DEBUG util.py:439: openssl-devel x86_64 1:1.1.0g-6.fc29 build 1.9 M +DEBUG util.py:439: openssl x86_64 1:1.1.0h-3.fc29 build 580 k +DEBUG util.py:439: openssl-devel x86_64 1:1.1.0h-3.fc29 build 1.9 M @@ -177,2 +177,2 @@ -DEBUG util.py:439: device-mapper x86_64 1.02.146-4.fc28 build 365 k -DEBUG util.py:439: device-mapper-libs x86_64 1.02.146-4.fc28 build 396 k +DEBUG util.py:439: device-mapper x86_64 1.02.146-5.fc29 build 365 k +DEBUG util.py:439: device-mapper-libs x86_64 1.02.146-5.fc29 build 396 k @@ -184,2 +184,2 @@ -DEBUG util.py:439: git-core x86_64 2.17.0-0.2.rc2.fc29 build 4.0 M -DEBUG util.py:439: git-core-doc noarch 2.17.0-0.2.rc2.fc29 build 2.3 M +DEBUG util.py:439: git-core x86_64 2.17.0-1.fc29 build 4.0 M +DEBUG util.py:439: git-core-doc noarch 2.17.0-1.fc29 build 2.3 M @@ -191 +191 @@ -DEBUG util.py:439: krb5-devel x86_64 1.16-17.fc29 build 542 k +DEBUG util.py:439: krb5-devel x86_64 1.16-20.fc29 build 542 k @@ -196 +196 @@ -DEBUG util.py:439: libkadm5 x86_64 1.16-17.fc29 build 180 k +DEBUG util.py:439: libkadm5 x86_64 1.16-20.fc29 build 180 k @@ -199 +199 @@ -DEBUG util.py:439: libsecret x86_64 0.18.5-7.fc28 build 159 k +DEBUG util.py:439: libsecret x86_64 0.18.6-1.fc29 build 161 k @@ -209,2 +209,2 @@ -DEBUG util.py:439: openssh x86_64 7.6p1-7.fc29 build 505 k -DEBUG util.py:439: openssh-clients x86_64 7.6p1-7.fc29 build 679 k +DEBUG util.py:439: openssh x86_64 7.7p1-1.fc29 build 483 k +DEBUG util.py:439: openssh-clients x86_64 7.7p1-1.fc29 build 683 k @@ -228 +228 @@ -DEBUG util.py:439: perl-Git noarch 2.17.0-0.2.rc2.fc29 build 74 k +DEBUG util.py:439: perl-Git noarch 2.17.0-1.fc29 build 74 k @@ -256 +256 @@ -DEBUG util.py:439: python3 x86_64 3.6.4-20.fc29 build 71 k +DEBUG util.py:439: python3 x86_64 3.6.5-1.fc29 build 71 k Not at all sure which of those it's likely to be (if it's any of these and not something even more odd), but annobin and gdb seem at least plausible candidates?
compiler seems not to be the difference. Have two kernel configs, one good, one bad. 4.16.2 kernel, f27 gcc.
Created attachment 1421455 [details] good config
Created attachment 1421456 [details] bad config
Created attachment 1421457 [details] bad config (for real this time)
I don't see the same differences in the kernel configs for the relevant official kernel builds, though? There are definitely differences between the 4.16.0-0.rc7.git1.1 and 4.17.0-0.rc0.git4.1 configs, but they don't look like *those* differences... Here's the diff between the x86_64 configs for those two kernel builds: [adamw@adam kernel (master %)]$ diff -u0 kernel-4.15.fc29/linux-4.16.0-0.rc7.git1.1.fc29.x86_64/configs/kernel-4.16.0-x86_64.config kernel-4.16.fc29/linux-4.17.0-0.rc0.git4.1.fc29.x86_64/configs/kernel-4.17.0-x86_64.config --- kernel-4.15.fc29/linux-4.16.0-0.rc7.git1.1.fc29.x86_64/configs/kernel-4.16.0-x86_64.config 2018-04-13 10:43:54.924453231 -0700 +++ kernel-4.16.fc29/linux-4.17.0-0.rc0.git4.1.fc29.x86_64/configs/kernel-4.17.0-x86_64.config 2018-04-13 10:49:21.473930901 -0700 @@ -4 +4 @@ -# Linux/x86_64 4.16.0-rc7 Kernel Configuration +# Linux/x86_64 4.16.0 Kernel Configuration @@ -118 +117,0 @@ -# CONFIG_NO_HZ_FULL_ALL is not set @@ -214 +212,0 @@ -# CONFIG_SYSCTL_SYSCALL is not set @@ -245 +242,0 @@ -# CONFIG_PC104 is not set @@ -254 +250,0 @@ -# CONFIG_SLUB_MEMCG_SYSFS_ON is not set @@ -276 +271,0 @@ -# CONFIG_HAVE_64BIT_ALIGNED_ACCESS is not set @@ -322 +316,0 @@ -CONFIG_THIN_ARCHIVES=y @@ -343,2 +336,0 @@ -# CONFIG_HAVE_ARCH_HASH is not set -# CONFIG_ISA_BUS_API is not set @@ -347 +338,0 @@ -# CONFIG_CPU_NO_EFFICIENT_FFS is not set @@ -350,2 +340,0 @@ -# CONFIG_ARCH_OPTIONAL_KERNEL_RWX is not set -# CONFIG_ARCH_OPTIONAL_KERNEL_RWX_DEFAULT is not set @@ -356 +344,0 @@ -CONFIG_ARCH_HAS_PHYS_TO_DMA=y @@ -365 +352,0 @@ -# CONFIG_HAVE_GENERIC_DMA_COHERENT is not set @@ -465 +451,0 @@ -CONFIG_X86_FAST_FEATURE_TESTS=y @@ -555 +540,0 @@ -# CONFIG_VM86 is not set @@ -695,0 +681 @@ +CONFIG_DYNAMIC_MEMORY_LAYOUT=y @@ -752,0 +739 @@ +CONFIG_ACPI_TAD=m @@ -764 +750,0 @@ -# CONFIG_ACPI_CUSTOM_DSDT is not set @@ -776 +761,0 @@ -# CONFIG_ACPI_REDUCED_HARDWARE_ONLY is not set @@ -842 +826,0 @@ -# CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED is not set @@ -852,0 +837 @@ +CONFIG_MMCONF_FAM10H=y @@ -941 +925,0 @@ -# CONFIG_HAVE_AOUT is not set @@ -989,0 +974 @@ +CONFIG_IP_MROUTE_COMMON=y @@ -1126,2 +1111,2 @@ -CONFIG_NF_TABLES_INET=m -CONFIG_NF_TABLES_NETDEV=m +CONFIG_NF_TABLES_INET=y +CONFIG_NF_TABLES_NETDEV=y @@ -1313 +1298 @@ -CONFIG_NF_TABLES_IPV4=m +CONFIG_NF_TABLES_IPV4=y @@ -1318 +1303 @@ -CONFIG_NF_TABLES_ARP=m +CONFIG_NF_TABLES_ARP=y @@ -1361 +1346 @@ -CONFIG_NF_TABLES_IPV6=m +CONFIG_NF_TABLES_IPV6=y @@ -1396 +1381 @@ -CONFIG_NF_TABLES_BRIDGE=m +CONFIG_NF_TABLES_BRIDGE=y @@ -1436,0 +1422 @@ +CONFIG_TIPC_DIAG=m @@ -1559,0 +1546 @@ +CONFIG_NET_EMATCH_IPT=m @@ -1748 +1734,0 @@ -CONFIG_BT_HCIBTUART=m @@ -1752,0 +1739 @@ +CONFIG_BT_HCIRSI=m @@ -1782 +1768,0 @@ -# CONFIG_MAC80211_RC_MINSTREL_VHT is not set @@ -1874 +1859,0 @@ -# CONFIG_GENERIC_CPU_DEVICES is not set @@ -1925,3 +1909,0 @@ -# CONFIG_MTD_MAP_BANK_WIDTH_8 is not set -# CONFIG_MTD_MAP_BANK_WIDTH_16 is not set -# CONFIG_MTD_MAP_BANK_WIDTH_32 is not set @@ -1930,2 +1911,0 @@ -# CONFIG_MTD_CFI_I4 is not set -# CONFIG_MTD_CFI_I8 is not set @@ -1981 +1960,0 @@ -# CONFIG_PARPORT_GSC is not set @@ -2003 +1981,0 @@ -# CONFIG_BLK_DEV_COW_COMMON is not set @@ -2145,4 +2122,0 @@ -# CONFIG_CXL_BASE is not set -# CONFIG_CXL_AFU_DRIVER_OPS is not set -# CONFIG_CXL_LIB is not set -# CONFIG_OCXL_BASE is not set @@ -2253,2 +2226,0 @@ -# CONFIG_SCSI_EATA is not set -# CONFIG_SCSI_FUTURE_DOMAIN is not set @@ -2300 +2271,0 @@ -# CONFIG_ATA_NONSTANDARD is not set @@ -2544,0 +2516 @@ +CONFIG_NET_DSA_MV88E6XXX_PTP=y @@ -2591,2 +2562,0 @@ -CONFIG_B44_PCICORE_AUTOSELECT=y -CONFIG_B44_PCI=y @@ -2665,0 +2636 @@ +CONFIG_ICE=m @@ -2716,0 +2688 @@ +# CONFIG_NET_VENDOR_NI is not set @@ -2777 +2748,0 @@ -# CONFIG_SMSC911X_ARCH_HOOKS is not set @@ -2980 +2950,0 @@ -CONFIG_B43_PCICORE_AUTOSELECT=y @@ -2993 +2962,0 @@ -CONFIG_B43LEGACY_PCICORE_AUTOSELECT=y @@ -3140,0 +3110 @@ +CONFIG_RSI_COEX=y @@ -3179,0 +3150 @@ +CONFIG_IEEE802154_MCR20A=m @@ -3287,2 +3257,0 @@ -# CONFIG_GIGASET_I4L is not set -# CONFIG_GIGASET_DUMMYLL is not set @@ -3424,0 +3394 @@ +CONFIG_JOYSTICK_PXRC=m @@ -3635 +3604,0 @@ -# CONFIG_SERIAL_8250_FSL is not set @@ -3795 +3763,0 @@ -# CONFIG_I2C_PXA_PCI is not set @@ -3853 +3821 @@ -CONFIG_PPS=m +CONFIG_PPS=y @@ -3871 +3839 @@ -CONFIG_PTP_1588_CLOCK=m +CONFIG_PTP_1588_CLOCK=y @@ -3917,0 +3886,2 @@ +# CONFIG_GPIO_WINBOND is not set +# CONFIG_GPIO_WS16C48 is not set @@ -4251,0 +4222 @@ +# CONFIG_EBC_C384_WDT is not set @@ -4313,2 +4283,0 @@ -CONFIG_SSB_DRIVER_PCICORE_POSSIBLE=y -CONFIG_SSB_DRIVER_PCICORE=y @@ -4411 +4379,0 @@ -# CONFIG_MFD_TMIO is not set @@ -4425,0 +4394 @@ +# CONFIG_REGULATOR_88PG86X is not set @@ -4471,0 +4441 @@ +CONFIG_IR_IMON_DECODER=m @@ -4475,0 +4446 @@ +CONFIG_IR_IMON_RAW=m @@ -4929,0 +4901,5 @@ + +# +# Media SPI Adapters +# +CONFIG_CXD2880_SPI_DRV=m @@ -5103 +5078,0 @@ -CONFIG_DVB_SP2=m @@ -5114,0 +5090,6 @@ +# Common Interface (EN50221) controller drivers +# +CONFIG_DVB_CXD2099=m +CONFIG_DVB_SP2=m + +# @@ -5117 +5097,0 @@ -# CONFIG_DVB_DUMMY_FE is not set @@ -5221 +5200,0 @@ -# CONFIG_DRM_LIB_RANDOM is not set @@ -5230 +5208,0 @@ -# CONFIG_FB_DDC is not set @@ -5235 +5212,0 @@ -# CONFIG_FB_CFB_REV_PIXELS_IN_BYTE is not set @@ -5239 +5215,0 @@ -# CONFIG_FB_PROVIDE_GET_FB_UNMAPPED_AREA is not set @@ -5243,2 +5218,0 @@ -# CONFIG_FB_SVGALIB is not set -# CONFIG_FB_MACMODES is not set @@ -5393 +5366,0 @@ -# CONFIG_SND_OPL4_LIB_SEQ is not set @@ -5546,0 +5520 @@ +CONFIG_SND_SOC_AMD_CZ_DA7219MX98357_MACH=m @@ -5591,0 +5566 @@ +CONFIG_SND_SOC_INTEL_CHT_BSW_NAU8824_MACH=m @@ -5601,0 +5577 @@ +CONFIG_SND_SOC_INTEL_KBL_DA7219_MAX98357A_MACH=m @@ -5620 +5596 @@ -# CONFIG_SND_SOC_ADAU7002 is not set +CONFIG_SND_SOC_ADAU7002=m @@ -5621,0 +5598 @@ +CONFIG_SND_SOC_AK4458=m @@ -5625,0 +5603 @@ +CONFIG_SND_SOC_AK5558=m @@ -5626,0 +5605 @@ +CONFIG_SND_SOC_BD28623=m @@ -5660,0 +5640 @@ +CONFIG_SND_SOC_MAX9867=m @@ -5665,0 +5646,2 @@ +CONFIG_SND_SOC_PCM1789=m +CONFIG_SND_SOC_PCM1789_I2C=m @@ -5681 +5662,0 @@ -# CONFIG_SND_SOC_RT5514_SPI_BUILTIN is not set @@ -5706,0 +5688 @@ +CONFIG_SND_SOC_TDA7419=m @@ -5738,0 +5721 @@ +CONFIG_SND_SOC_MAX9759=m @@ -5782,0 +5766 @@ +CONFIG_HID_ELAN=m @@ -5789,0 +5774 @@ +# CONFIG_HID_GOOGLE_HAMMER is not set @@ -5914 +5898,0 @@ -CONFIG_USB_ISP1362_HCD=m @@ -6093,0 +6078,6 @@ + +# +# USB Type-C Multiplexer/DeMultiplexer Switch support +# +CONFIG_TYPEC_MUX_PI3USB30532=m +CONFIG_USB_ROLES_INTEL_XHCI=m @@ -6095,0 +6086 @@ +CONFIG_USB_ROLE_SWITCH=m @@ -6186,0 +6178 @@ +CONFIG_LEDS_MLXREG=m @@ -6507 +6498,0 @@ -# CONFIG_IRDA is not set @@ -6597,4 +6587,0 @@ - -# -# Triggers - standalone -# @@ -6611 +6597,0 @@ -# CONFIG_DVB_CXD2099 is not set @@ -6635,0 +6622 @@ +# CONFIG_MTK_MMC is not set @@ -6731 +6717,0 @@ -# CONFIG_COMMON_CLK_NXP is not set @@ -6733,2 +6718,0 @@ -# CONFIG_COMMON_CLK_PXA is not set -# CONFIG_COMMON_CLK_PIC32 is not set @@ -6743,5 +6726,0 @@ -# CONFIG_ATMEL_PIT is not set -# CONFIG_SH_TIMER_CMT is not set -# CONFIG_SH_TIMER_MTU2 is not set -# CONFIG_SH_TIMER_TMU is not set -# CONFIG_EM_TIMER_STI is not set @@ -6807 +6785,0 @@ -# CONFIG_SUNXI_SRAM is not set @@ -7088,0 +7067 @@ +CONFIG_LV0104CS=m @@ -7144,0 +7124 @@ +CONFIG_AD5272=m @@ -7147,0 +7128 @@ +CONFIG_MCP4018=m @@ -7195,0 +7177 @@ +CONFIG_MLX90632=m @@ -7222 +7203,0 @@ -# CONFIG_ARM_GIC_V3_ITS is not set @@ -7225,10 +7205,0 @@ -# CONFIG_RESET_ATH79 is not set -# CONFIG_RESET_AXS10X is not set -# CONFIG_RESET_BERLIN is not set -# CONFIG_RESET_IMX7 is not set -# CONFIG_RESET_LANTIQ is not set -# CONFIG_RESET_LPC18XX is not set -# CONFIG_RESET_MESON is not set -# CONFIG_RESET_PISTACHIO is not set -# CONFIG_RESET_SIMPLE is not set -# CONFIG_RESET_SUNXI is not set @@ -7236,2 +7206,0 @@ -# CONFIG_RESET_ZYNQ is not set -# CONFIG_RESET_TEGRA_BPMP is not set @@ -7282 +7251,5 @@ -CONFIG_NVMEM=m +CONFIG_NVMEM=y + +# +# HW tracing support +# @@ -7286,4 +7258,0 @@ -CONFIG_FSI=m -CONFIG_FSI_MASTER_GPIO=m -CONFIG_FSI_MASTER_HUB=m -CONFIG_FSI_SCOM=m @@ -7764,0 +7734,3 @@ +CONFIG_LOCK_DEBUGGING_SUPPORT=y +CONFIG_PROVE_LOCKING=y +CONFIG_LOCK_STAT=y @@ -7768 +7740,2 @@ -# CONFIG_DEBUG_WW_MUTEX_SLOWPATH is not set +CONFIG_DEBUG_WW_MUTEX_SLOWPATH=y +CONFIG_DEBUG_RWSEMS=y @@ -7770 +7742,0 @@ -CONFIG_PROVE_LOCKING=y @@ -7772 +7743,0 @@ -CONFIG_LOCK_STAT=y @@ -7910 +7880,0 @@ -# CONFIG_ARCH_WANTS_UBSAN_NO_NULL is not set @@ -8056 +8025,0 @@ -CONFIG_CRYPTO_ABLK_HELPER=m @@ -8073,0 +8043 @@ +CONFIG_CRYPTO_CFB=m @@ -8157,0 +8128,2 @@ +CONFIG_CRYPTO_SM4=m +CONFIG_CRYPTO_SPECK=m @@ -8194 +8165,0 @@ -# CONFIG_CRYPTO_DEV_FSL_CAAM_CRYPTO_API_DESC is not set @@ -8209,0 +8181 @@ +CONFIG_CRYPTO_DEV_CHELSIO_TLS=m @@ -8261 +8232,0 @@ -# CONFIG_HAVE_ARCH_BITREVERSE is not set @@ -8286 +8256,0 @@ -# CONFIG_AUDIT_ARCH_COMPAT_GENERIC is not set @@ -8330 +8300 @@ -# CONFIG_DMA_DIRECT_OPS is not set +CONFIG_DMA_DIRECT_OPS=y @@ -8352 +8321,0 @@ -# CONFIG_SG_SPLIT is not set
One of the locking debug options seems to trigger it, I'll go figure which one.
Down to this: --- dot.config.bad 2018-04-16 17:19:10.054098868 +0200 +++ dot.config.good 2018-04-16 16:57:31.774301769 +0200 @@ -4448,12 +4448,10 @@ CONFIG_SCHEDSTATS=y CONFIG_DEBUG_RT_MUTEXES=y CONFIG_DEBUG_SPINLOCK=y CONFIG_DEBUG_MUTEXES=y -CONFIG_DEBUG_WW_MUTEX_SLOWPATH=y -CONFIG_DEBUG_LOCK_ALLOC=y +# CONFIG_DEBUG_WW_MUTEX_SLOWPATH is not set +# CONFIG_DEBUG_LOCK_ALLOC is not set # CONFIG_PROVE_LOCKING is not set -CONFIG_LOCKDEP=y # CONFIG_LOCK_STAT is not set -# CONFIG_DEBUG_LOCKDEP is not set # CONFIG_DEBUG_ATOMIC_SLEEP is not set # CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set # CONFIG_LOCK_TORTURE_TEST is not set
Gerd: the only one of those pairs that appears in the Fedora package diff I posted is CONFIG_DEBUG_WW_MUTEX_SLOWPATH , so that sounds like the most likely suspect?
/me looks at qxl_release_map + qxl_release_unmap. Asking myself how did that ever work? union qxl_release_info *qxl_release_map(struct qxl_device *qdev, struct qxl_release *release) { void *ptr; union qxl_release_info *info; struct qxl_bo_list *entry = list_first_entry(&release->bos, struct qxl_bo_list, tv.head); struct qxl_bo *bo = to_qxl_bo(entry->tv.bo); ptr = qxl_bo_kmap_atomic_page(qdev, bo, release->release_offset & PAGE_SIZE); if (!ptr) return NULL; info = ptr + (release->release_offset & ~PAGE_SIZE); return info; } s/PAGE_SIZE/PAGE_MASK/ ...
> Asking myself how did that ever work? The answer to that one seems to be "release_offset is always smaller than PAGE_SIZE". And, of course, fixing that issue didn't fix this bug.
Created attachment 1423252 [details] 0001-qxl-fix-qxl_release_-map-unmap.patch
Created attachment 1423253 [details] 0002-qxl-keep-separate-release_bo-pointer.patch
*** Bug 1570046 has been marked as a duplicate of this bug. ***
These patches are included in kernel-4.17.0-0.rc1.git2.1.fc29 and newer kernels. Can someone verify that they fix the issue?
Things look a lot better in today's openQA compose testing indeed. A couple of tests still hit this bug, but I think they're ones which got the old kernel from a network install. I think we can call this fixed, I'll re-open if it turns out it really is still happening in future tests. Thanks guys!
Unfortunately another test ran into this yesterday: https://openqa.fedoraproject.org/tests/228191 and it hit it during anaconda, which means it was definitely with a 'fixed' kernel. However, it really does seem to be happening a lot less often now. Is there perhaps some remaining corner case where this can still happen, it's just a lot less likely?
From the logs of that test: [2018-04-22T14:26:41.0664 UTC] [debug] QEMU: (process:42029): Spice-WARNING **: display-channel.c:2426:display_channel_validate_surface: invalid surface_id 67108864 [2018-04-22T14:26:41.0713 UTC] [debug] MATCH(login_gdm:0.00) [2018-04-22T14:26:41.0718 UTC] [debug] MATCH(text_console_login:0.00) [2018-04-22T14:26:41.0736 UTC] [debug] MATCH(login_sddm-20171016:0.00) [2018-04-22T14:26:41.0736 UTC] [debug] no match: 288.5s [2018-04-22T14:26:42.0718 UTC] [debug] MATCH(login_gdm:0.00) [2018-04-22T14:26:42.0723 UTC] [debug] MATCH(text_console_login:0.00) [2018-04-22T14:26:42.0740 UTC] [debug] MATCH(login_sddm-20171016:0.00) [2018-04-22T14:26:42.0740 UTC] [debug] no match: 287.5s [2018-04-22T14:26:43.0323 UTC] [debug] QEMU: id 0, group 0, virt start 0, virt end ffffffffffffffff, generation 0, delta 0 [2018-04-22T14:26:43.0323 UTC] [debug] QEMU: id 1, group 1, virt start 7f829fc00000, virt end 7f82a3bfe000, generation 0, delta 7f829fc00000 [2018-04-22T14:26:43.0323 UTC] [debug] QEMU: id 2, group 1, virt start 7f829ba00000, virt end 7f829fa00000, generation 0, delta 7f829ba00000 [2018-04-22T14:26:43.0323 UTC] [debug] QEMU: [2018-04-22T14:26:43.0323 UTC] [debug] QEMU: (process:42029): Spice-CRITICAL **: memslot.c:111:memslot_get_virt: slot_id 54 too big, addr=364d6cff354c6cff The 'match' and 'no match' lines are from openQA, the 'QEMU' lines are passed through from qemu.
(In reply to Adam Williamson from comment #32) > Unfortunately another test ran into this yesterday: > > https://openqa.fedoraproject.org/tests/228191 > > and it hit it during anaconda, which means it was definitely with a 'fixed' > kernel. However, it really does seem to be happening a lot less often now. > Is there perhaps some remaining corner case where this can still happen, > it's just a lot less likely? Probably. Is this a kernel regression, or does it happen with older kernels too? Does it crash on the same place all the time? The linked test case seem to crash shortly after qxl driver load.
My VM has been stable since booting to kernel-4.17.0-0.rc1.git3.1.fc29.x86_64 on the guest. Is there anything I can provide to assist?
Gwyn: well, if it happens to you again, I guess that would be useful info :P It's definitely happening a *lot* less in openQA now - before this was happening to every single test that used qxl as the driver, now it's like 2% of them. Gerd: I don't recall ever seeing this exact crash before the time I filed this bug. However, there is a similar case which has been happening occasionally for *much* longer, that's this one: https://bugzilla.redhat.com/show_bug.cgi?id=1403343 in that case the critical error message is: 15:23:50.4844 32146 QEMU: (process:32151): Spice-CRITICAL **: display-channel.c:1666:display_channel_update: condition `validate_surface(display, surface_id)' failed so is it possible this is somehow just that same bug but the error messages have changed, or something? If not, then I think this is new between 2018-04-02 and 2018-04-07 Rawhide composes.
> It's definitely happening a *lot* less in openQA now - before this was > happening to every single test that used qxl as the driver, now it's like 2% > of them. Might be host side (qemu) issue. https://patchwork.ozlabs.org/patch/905667/
Just another data point: * I have a fully updated f28 host * It has 3 fully updated VM's: f27, f28, rawhide * AFAICS, the phenomena only appears on the rawhide VM. * Most of the time these VM's are down, I boot them up 1-2 times a week for updates. These error messages appear every time on rawhide boot (and keep going endlessly) It's not new (many months, don't remember how many), but I didn't notice other problems so I expected it will be fixed in one of the kernel updates...
In the last week or two I no longer see these messages in the rawhide VM. The VM kernels are (the latest is current): [root@rawhide boot]# ls -l vmlinuz-4.* | cut -c33- Jul 13 19:17 vmlinuz-4.18.0-0.rc4.git4.1.fc29.x86_64 Jul 17 20:59 vmlinuz-4.18.0-0.rc5.git1.1.fc29.x86_64 Jul 20 20:18 vmlinuz-4.18.0-0.rc5.git4.1.fc29.x86_64 The host kernels are (latest is running): [root@argon boot]# ls -l vmlinuz-4.* | cut -c33- Jul 10 16:55 vmlinuz-4.17.5-200.fc28.x86_64 Jul 11 23:55 vmlinuz-4.17.6-200.fc28.x86_64 Jul 17 19:54 vmlinuz-4.17.7-200.fc28.x86_64 So one of the changes after 2018-07-09 fixed it.
In the last week or two I no longer see these messages in the rawhide VM (no exact date). The VM kernels are (the latest is current): [root@rawhide boot]# ls -l vmlinuz-4.* | cut -c33- Jul 13 19:17 vmlinuz-4.18.0-0.rc4.git4.1.fc29.x86_64 Jul 17 20:59 vmlinuz-4.18.0-0.rc5.git1.1.fc29.x86_64 Jul 20 20:18 vmlinuz-4.18.0-0.rc5.git4.1.fc29.x86_64 The host kernels are (latest is running): [root@argon boot]# ls -l vmlinuz-4.* | cut -c33- Jul 10 16:55 vmlinuz-4.17.5-200.fc28.x86_64 Jul 11 23:55 vmlinuz-4.17.6-200.fc28.x86_64 Jul 17 19:54 vmlinuz-4.17.7-200.fc28.x86_64 So one of the changes after 2018-07-09 fixed it.
Is it possible that it reappeared with 4.19.1? Nov 15 13:53:29 host kernel: kauditd_printk_skb: 299 callbacks suppressed Nov 15 13:53:29 host systemd-coredump[2891]: Process 2669 (qemu-system-x86) of user 65534 dumped core. Nov 15 13:53:29 host libvirtd[2394]: Unable to read from monitor: Connection reset by peer Nov 15 13:53:29 host libvirtd[2394]: internal error: qemu unexpectedly closed the monitor: red_qxl_loadvm_commands: id 0, group 0, virt start 0, virt end ffffffffffffffff, generation 0, delta 0 id 1, group 1, virt start 7fb0a3600000, virt end 7fb0a75fe000, generation 0, delta 7fb0a3600000 id 2, group 1, virt start 7fb09f200000, virt end 7fb0a3200000, generation 0, delta 7fb09f200000 (process:2669): Spice-CRITICAL **: 13:53:29.361: memslot.c:111:memslot_get_virt: slot_id 172 too big, addr=ac50000000000000 Nov 15 13:53:30 host libvirtd[2394]: operation failed: domain is not running Nov 15 13:53:30 host libvirtd[2394]: internal error: unexpected async job 6 type expected 0 Nov 15 13:53:30 host libvirtd[2394]: Unable to restore from managed state /var/lib/libvirt/qemu/save/win2k19.save. Maybe the file is corrupted? Nov 15 13:53:30 host libvirtd[2394]: internal error: Failed to autostart VM 'win2k19': operation failed: domain is not running
This looks like a migration crash, it's different from that bug. It would be good if you could get a backtrace of the crash though.
Haven't seen this in some time AFAIR.