Bug 1421788

Summary:	migration/spice: assert with slot_id 112 too big, addr=7000000000000000
Product:	Red Hat Enterprise Linux 7	Reporter:	Dan Kenigsberg <danken>
Component:	qemu-kvm-rhev	Assignee:	Dr. David Alan Gilbert <dgilbert>
Status:	CLOSED ERRATA	QA Contact:	huiqingding <huding>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	7.3	CC:	areis, chayang, dgilbert, dprezhev, huding, juzhang, knoel, kraxel, michal.skrivanek, michen, mrezanin, mzamazal, qzhang, virt-maint, xianwang
Target Milestone:	rc
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	qemu-kvm-rhev-2.9.0-1.el7	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-08-01 23:44:45 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Dan Kenigsberg 2017-02-13 16:40:01 UTC

Description of problem:
After upgrade from qemu-kvm-rhev-2.3.0-31.el7_2.3.x86_64 to qemu-kvm-rhev-2.6.0-28.el7_3.3.x86_64 on destination host, migration failed to to core dump.

Unfortunately, I cannot tell if qemu-kvm-rhev-2.3.0-31 ends with successful migration.

From the logs

Version-Release number of selected component (if applicable):
libvirt-daemon-2.0.0-10.el7_3.4.x86_64
spice-server-0.12.4-20.el7_3.x86_64
qemu-kvm-rhev-2.6.0-28.el7_3.3.x86_64
kernel-3.10.0-514.6.1.el7.x86_64

How reproducible:
Once

Actual results:
Core dump and /var/log/libvirtd/qemu/vm.log attached privately.

Comment 2 Dan Kenigsberg 2017-02-13 16:42:26 UTC

based on log excerpt, bug may be related to spice:

2017-02-12T09:51:06.170627Z qemu-kvm: warning: CPU(s) not present in any NUMA nodes: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
2017-02-12T09:51:06.171213Z qemu-kvm: warning: All CPU(s) up to maxcpus should be described in NUMA config
red_dispatcher_loadvm_commands: 
id 0, group 0, virt start 0, virt end ffffffffffffffff, generation 0, delta 0
id 1, group 1, virt start 7fcf8a600000, virt end 7fcf8e5fe000, generation 0, delta 7fcf8a600000
id 2, group 1, virt start 7fcf88200000, virt end 7fcf8a200000, generation 0, delta 7fcf88200000
((null):12343): Spice-CRITICAL **: red_memslots.c:123:get_virt: slot_id 112 too big, addr=7000000000000000
2017-02-12 09:51:43.195+0000: shutting down
2017-02-12 09:52:25.417+

Comment 4 Dr. David Alan Gilbert 2017-02-14 10:17:47 UTC

Hi Dan,
  Hmm unfortunately the core dump is truncated so I can't get a backtrace out of it.
  I'd only seen in a postcopy case - were you using postcopy?

Dave

Comment 5 Dr. David Alan Gilbert 2017-02-14 10:18:20 UTC

Hi Dan,
  Hmm unfortunately the core dump is truncated so I can't get a backtrace out of it.  But yes, that spice message looks familiar,  I'd only seen in a postcopy case - were you using postcopy?

Dave

Comment 7 Dr. David Alan Gilbert 2017-02-14 10:58:05 UTC

Also, see the thread:
https://lists.freedesktop.org/archives/spice-devel/2016-December/034295.html

that's the discussion when I had the problem with postcopy, but the bug went away when I tried to test the suggested patch.

Comment 10 Dan Kenigsberg 2017-02-14 16:55:55 UTC

Milan, can you look in bug 1421589 and tell if for some reason post-copy migration was used? (to answer comment 4)

Comment 12 Dr. David Alan Gilbert 2017-02-14 17:22:25 UTC

(In reply to Dan Kenigsberg from comment #10)
> Milan, can you look in bug 1421589 and tell if for some reason post-copy
> migration was used? (to answer comment 4)

mzamazal already answered me on irc - it wasn't postcopy

Comment 14 Milan Zamazal 2017-02-15 08:14:39 UTC

(In reply to Dr. David Alan Gilbert from comment #12)
> (In reply to Dan Kenigsberg from comment #10)
> > Milan, can you look in bug 1421589 and tell if for some reason post-copy
> > migration was used? (to answer comment 4)
> 
> mzamazal already answered me on irc - it wasn't postcopy

To be precise, I only said that it was quite unlikely being postcopy. Dan, we can be sure if you tell us what oVirt/RHV version do you use (oVirt <= 4.0.* doesn't have postcopy), or provide us with vdsm.log (or simply grep it for "post-copy") in case you use 4.1.

Comment 15 Dan Kenigsberg 2017-02-15 08:46:43 UTC

(In reply to Milan Zamazal from comment #14)

Milan, bug 1421589 which I referred you to should have all the logs, saying that it was rhev-4.1-beta, with a 4.0 source host and a 4.1 destination host. Unless we have a horrible bug, Engine should never request postcopy in this condition; I asked you to verify that.

Comment 16 Milan Zamazal 2017-02-15 09:18:18 UTC

I see, I can't see the logs there but you are right, it can't be post-copy. Even if Engine requested postcopy, it wouldn't be actually triggered since there is no support for it on the 4.0 source host.

Comment 23 Dr. David Alan Gilbert 2017-03-01 13:30:30 UTC

I can't reproduce this bug - even though I know I've hit it myself and this is a separate case of it.
We've tried using the same image template as the one that crashed.
I've also tried with my own f24 image.
Both in lots of different states of what was running.

Comment 24 Dr. David Alan Gilbert 2017-03-01 18:23:09 UTC

Hi Gerd,
  I've been digging about a bit in the qxl code; can you explain to me why qxl_track_command only sets qxl->guest_cursor on a QXL_CURSOR_SET?

  My concern is if a QXL_CURSOR_HIDE happens (after a SET), does that leave qxl->guest_cursor pointing at potentially garbage that could trigger this bug?

I can see that the guest_cursor is non-NULL during a migrate after a QXL_CURSOR_HIDE.

(Not that I can trigger the failure)

Comment 25 Gerd Hoffmann 2017-03-06 08:31:39 UTC

>   My concern is if a QXL_CURSOR_HIDE happens (after a SET), does that leave
> qxl->guest_cursor pointing at potentially garbage that could trigger this
> bug?

Very plausible.

Comment 26 Dr. David Alan Gilbert 2017-03-06 11:46:46 UTC

Thanks and I see the patch you posted;
looking we also have:

https://bugzilla.redhat.com/show_bug.cgi?id=1290039
and
https://bugzilla.redhat.com/show_bug.cgi?id=1210536
which look like similar backtraces

they refer to FAF https://retrace.fedoraproject.org/faf/reports/430337/
although curiously that has no f24 or f25 hits, which makes you wonder if there's another thing that fixed it somehow.

Comment 27 Dr. David Alan Gilbert 2017-03-06 11:52:05 UTC

ah, no, here are the f25 versions:
https://retrace.fedoraproject.org/faf/problems/bthash/?bth=3f4b726cc33210a2d48eb4597096a3527fe234ed&bth=bb9782ecb205ad23175b009565273c66a0661a96

Comment 28 Dr. David Alan Gilbert 2017-03-09 19:36:28 UTC

Gerd's patch is now:

dbb5fb8d3519130559b10fa4e1395e4486c633f8 in upstream qemu

Since we have no way of testing this I'm going to mark this as fixed in 2.9 and we'll pick it up in a release.

We could ask for a backport  - my suspicion is that other customers are hitting it.

Dave

Comment 45 errata-xmlrpc 2017-08-01 23:44:45 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:2392

Comment 46 errata-xmlrpc 2017-08-02 01:22:26 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:2392

Comment 47 errata-xmlrpc 2017-08-02 02:14:23 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:2392

Comment 48 errata-xmlrpc 2017-08-02 02:55:11 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:2392

Comment 49 errata-xmlrpc 2017-08-02 03:19:49 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:2392

Comment 50 errata-xmlrpc 2017-08-02 03:37:31 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:2392