Description of problem:
After upgrade from qemu-kvm-rhev-2.3.0-31.el7_2.3.x86_64 to qemu-kvm-rhev-2.6.0-28.el7_3.3.x86_64 on destination host, migration failed to to core dump.
Unfortunately, I cannot tell if qemu-kvm-rhev-2.3.0-31 ends with successful migration.
From the logs
Version-Release number of selected component (if applicable):
Core dump and /var/log/libvirtd/qemu/vm.log attached privately.
based on log excerpt, bug may be related to spice:
2017-02-12T09:51:06.170627Z qemu-kvm: warning: CPU(s) not present in any NUMA nodes: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
2017-02-12T09:51:06.171213Z qemu-kvm: warning: All CPU(s) up to maxcpus should be described in NUMA config
id 0, group 0, virt start 0, virt end ffffffffffffffff, generation 0, delta 0
id 1, group 1, virt start 7fcf8a600000, virt end 7fcf8e5fe000, generation 0, delta 7fcf8a600000
id 2, group 1, virt start 7fcf88200000, virt end 7fcf8a200000, generation 0, delta 7fcf88200000
((null):12343): Spice-CRITICAL **: red_memslots.c:123:get_virt: slot_id 112 too big, addr=7000000000000000
2017-02-12 09:51:43.195+0000: shutting down
Hmm unfortunately the core dump is truncated so I can't get a backtrace out of it.
I'd only seen in a postcopy case - were you using postcopy?
Hmm unfortunately the core dump is truncated so I can't get a backtrace out of it. But yes, that spice message looks familiar, I'd only seen in a postcopy case - were you using postcopy?
Also, see the thread:
that's the discussion when I had the problem with postcopy, but the bug went away when I tried to test the suggested patch.
Milan, can you look in bug 1421589 and tell if for some reason post-copy migration was used? (to answer comment 4)
(In reply to Dan Kenigsberg from comment #10)
> Milan, can you look in bug 1421589 and tell if for some reason post-copy
> migration was used? (to answer comment 4)
mzamazal already answered me on irc - it wasn't postcopy
(In reply to Dr. David Alan Gilbert from comment #12)
> (In reply to Dan Kenigsberg from comment #10)
> > Milan, can you look in bug 1421589 and tell if for some reason post-copy
> > migration was used? (to answer comment 4)
> mzamazal already answered me on irc - it wasn't postcopy
To be precise, I only said that it was quite unlikely being postcopy. Dan, we can be sure if you tell us what oVirt/RHV version do you use (oVirt <= 4.0.* doesn't have postcopy), or provide us with vdsm.log (or simply grep it for "post-copy") in case you use 4.1.
(In reply to Milan Zamazal from comment #14)
Milan, bug 1421589 which I referred you to should have all the logs, saying that it was rhev-4.1-beta, with a 4.0 source host and a 4.1 destination host. Unless we have a horrible bug, Engine should never request postcopy in this condition; I asked you to verify that.
I see, I can't see the logs there but you are right, it can't be post-copy. Even if Engine requested postcopy, it wouldn't be actually triggered since there is no support for it on the 4.0 source host.
I can't reproduce this bug - even though I know I've hit it myself and this is a separate case of it.
We've tried using the same image template as the one that crashed.
I've also tried with my own f24 image.
Both in lots of different states of what was running.
I've been digging about a bit in the qxl code; can you explain to me why qxl_track_command only sets qxl->guest_cursor on a QXL_CURSOR_SET?
My concern is if a QXL_CURSOR_HIDE happens (after a SET), does that leave qxl->guest_cursor pointing at potentially garbage that could trigger this bug?
I can see that the guest_cursor is non-NULL during a migrate after a QXL_CURSOR_HIDE.
(Not that I can trigger the failure)
> My concern is if a QXL_CURSOR_HIDE happens (after a SET), does that leave
> qxl->guest_cursor pointing at potentially garbage that could trigger this
Thanks and I see the patch you posted;
looking we also have:
which look like similar backtraces
they refer to FAF https://retrace.fedoraproject.org/faf/reports/430337/
although curiously that has no f24 or f25 hits, which makes you wonder if there's another thing that fixed it somehow.
ah, no, here are the f25 versions:
Gerd's patch is now:
dbb5fb8d3519130559b10fa4e1395e4486c633f8 in upstream qemu
Since we have no way of testing this I'm going to mark this as fixed in 2.9 and we'll pick it up in a release.
We could ask for a backport - my suspicion is that other customers are hitting it.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.