Bug 1421788
Summary: | migration/spice: assert with slot_id 112 too big, addr=7000000000000000 | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Dan Kenigsberg <danken> |
Component: | qemu-kvm-rhev | Assignee: | Dr. David Alan Gilbert <dgilbert> |
Status: | CLOSED ERRATA | QA Contact: | huiqingding <huding> |
Severity: | medium | Docs Contact: | |
Priority: | unspecified | ||
Version: | 7.3 | CC: | areis, chayang, dgilbert, dprezhev, huding, juzhang, knoel, kraxel, michal.skrivanek, michen, mrezanin, mzamazal, qzhang, virt-maint, xianwang |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | qemu-kvm-rhev-2.9.0-1.el7 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2017-08-01 23:44:45 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Dan Kenigsberg
2017-02-13 16:40:01 UTC
based on log excerpt, bug may be related to spice: 2017-02-12T09:51:06.170627Z qemu-kvm: warning: CPU(s) not present in any NUMA nodes: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 2017-02-12T09:51:06.171213Z qemu-kvm: warning: All CPU(s) up to maxcpus should be described in NUMA config red_dispatcher_loadvm_commands: id 0, group 0, virt start 0, virt end ffffffffffffffff, generation 0, delta 0 id 1, group 1, virt start 7fcf8a600000, virt end 7fcf8e5fe000, generation 0, delta 7fcf8a600000 id 2, group 1, virt start 7fcf88200000, virt end 7fcf8a200000, generation 0, delta 7fcf88200000 ((null):12343): Spice-CRITICAL **: red_memslots.c:123:get_virt: slot_id 112 too big, addr=7000000000000000 2017-02-12 09:51:43.195+0000: shutting down 2017-02-12 09:52:25.417+ Hi Dan, Hmm unfortunately the core dump is truncated so I can't get a backtrace out of it. I'd only seen in a postcopy case - were you using postcopy? Dave Hi Dan, Hmm unfortunately the core dump is truncated so I can't get a backtrace out of it. But yes, that spice message looks familiar, I'd only seen in a postcopy case - were you using postcopy? Dave Also, see the thread: https://lists.freedesktop.org/archives/spice-devel/2016-December/034295.html that's the discussion when I had the problem with postcopy, but the bug went away when I tried to test the suggested patch. Milan, can you look in bug 1421589 and tell if for some reason post-copy migration was used? (to answer comment 4) (In reply to Dan Kenigsberg from comment #10) > Milan, can you look in bug 1421589 and tell if for some reason post-copy > migration was used? (to answer comment 4) mzamazal already answered me on irc - it wasn't postcopy (In reply to Dr. David Alan Gilbert from comment #12) > (In reply to Dan Kenigsberg from comment #10) > > Milan, can you look in bug 1421589 and tell if for some reason post-copy > > migration was used? (to answer comment 4) > > mzamazal already answered me on irc - it wasn't postcopy To be precise, I only said that it was quite unlikely being postcopy. Dan, we can be sure if you tell us what oVirt/RHV version do you use (oVirt <= 4.0.* doesn't have postcopy), or provide us with vdsm.log (or simply grep it for "post-copy") in case you use 4.1. (In reply to Milan Zamazal from comment #14) Milan, bug 1421589 which I referred you to should have all the logs, saying that it was rhev-4.1-beta, with a 4.0 source host and a 4.1 destination host. Unless we have a horrible bug, Engine should never request postcopy in this condition; I asked you to verify that. I see, I can't see the logs there but you are right, it can't be post-copy. Even if Engine requested postcopy, it wouldn't be actually triggered since there is no support for it on the 4.0 source host. I can't reproduce this bug - even though I know I've hit it myself and this is a separate case of it. We've tried using the same image template as the one that crashed. I've also tried with my own f24 image. Both in lots of different states of what was running. Hi Gerd, I've been digging about a bit in the qxl code; can you explain to me why qxl_track_command only sets qxl->guest_cursor on a QXL_CURSOR_SET? My concern is if a QXL_CURSOR_HIDE happens (after a SET), does that leave qxl->guest_cursor pointing at potentially garbage that could trigger this bug? I can see that the guest_cursor is non-NULL during a migrate after a QXL_CURSOR_HIDE. (Not that I can trigger the failure) > My concern is if a QXL_CURSOR_HIDE happens (after a SET), does that leave
> qxl->guest_cursor pointing at potentially garbage that could trigger this
> bug?
Very plausible.
Thanks and I see the patch you posted; looking we also have: https://bugzilla.redhat.com/show_bug.cgi?id=1290039 and https://bugzilla.redhat.com/show_bug.cgi?id=1210536 which look like similar backtraces they refer to FAF https://retrace.fedoraproject.org/faf/reports/430337/ although curiously that has no f24 or f25 hits, which makes you wonder if there's another thing that fixed it somehow. ah, no, here are the f25 versions: https://retrace.fedoraproject.org/faf/problems/bthash/?bth=3f4b726cc33210a2d48eb4597096a3527fe234ed&bth=bb9782ecb205ad23175b009565273c66a0661a96 Gerd's patch is now: dbb5fb8d3519130559b10fa4e1395e4486c633f8 in upstream qemu Since we have no way of testing this I'm going to mark this as fixed in 2.9 and we'll pick it up in a release. We could ask for a backport - my suspicion is that other customers are hitting it. Dave Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:2392 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:2392 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:2392 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:2392 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:2392 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:2392 |