Bug 1439147

Summary:	migration completed but dst host prompt "qemu-kvm: Unknown savevm section or instance 'pci@800000020000000:06.0/ehci' 0"
Product:	Red Hat Enterprise Linux 7	Reporter:	xianwang <xianwang>
Component:	qemu-kvm-rhev	Assignee:	Peter Xu <peterx>
Status:	CLOSED WONTFIX	QA Contact:	xianwang <xianwang>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	7.4	CC:	dgilbert, hhuang, michen, quintela, qzhang, thuth, virt-maint, xianwang
Target Milestone:	rc
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-06-01 06:17:31 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description xianwang 2017-04-05 10:41:10 UTC

Description of problem:
src host is rhel7.3.z, and boot a guest with usb-ehci controller, dst host is rhel7.4, without usb-echi controller, after migration, on src host, migration status is completed while on dst host, qemu prompt "qemu-kvm: Unknown savevm section or instance 'pci@800000020000000:06.0/ehci' 0" and quit qemu automatically.I think this phenomenon is not reasonable.

Version-Release number of selected component (if applicable):
src host
3.10.0-514.19.1.el7.ppc64le
qemu-kvm-rhev-2.6.0-28.el7_3.9.ppc64le
SLOF-20160223-6.gitdbbfda4.el7.noarch

dst host:
3.10.0-632.el7.ppc64le
qemu-img-rhev-2.9.0-0.el7.patchwork201703291116.ppc64le
SLOF-20170303-1.git66d250e.el7.noarch

How reproducible:
2/2

Steps to Reproduce:
1.Boot a guest in src host with usb-ehci controller
#/usr/libexec/qemu-kvm -device usb-ehci,id=usb1,bus=pci.0,addr=06 -M pseries-rhel7.3.0 -monitor stdio

2.Boot a guest in dst host without usb-ehci controller but appending "incoming tcp:0:5801"
#/usr/libexec/qemu-kvm -monitor stdio -M pseries-rhel7.3.0 -incoming tcp:0:5801

3.check the status of migration

Actual results:
in src host:
(qemu) info status 
VM status: paused (postmigrate)
(qemu) info migrate
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off postcopy-ram: off 
Migration status: completed
total time: 157 milliseconds
downtime: 40 milliseconds
setup: 1 milliseconds
transferred ram: 6020 kbytes
throughput: 499.33 mbps
remaining ram: 0 kbytes
total ram: 540736 kbytes
duplicate: 134088 pages
skipped: 0 pages
normal: 1208 pages
normal bytes: 4832 kbytes
dirty sync count: 3

in dst host:
(qemu) qemu-kvm: Unknown savevm section or instance 'pci@800000020000000:06.0/ehci' 0
qemu-kvm: load of migration failed: Invalid argument
[root@ibm-p8-rhevm-10 ~]#

Expected results:
Due to usb-ehci controller is not supported for rhel74, so, for this situation, I am not sure what is the expected results.
Additional info:

Comment 5 Thomas Huth 2017-04-06 09:05:03 UTC

I just tried the same on x86, and I get the same behavior there: If I start the source qemu-kvm with "-device usb-ehci" and the destination qemu-kvm without that device, and then migrate from the source to the destination, the destination QEMU also aborts with "Unknown savevm section or instance '0000:00:04.0/ehci'", while the source says "Migration status: completed".

So this issue is not really specific to ppc64, i.e. if you do the mistake of not specifying exactly the same devices at the source and destination, QEMU should not report "Migration status: completed" here and this should be fixed.

Concerning the original idea, how to migrate a ppc64 guest with EHCI controller to RHEL 7.4 (where we've removed EHCI, see BZ 1410674): You can not directly migrate such a guest. You've got to shut it down once, remove the EHCI controller from the configuration and replace it with OHCI or XHCI, then you can migrate it after starting it again. Sounds cumbersome, but it should not be an issue in real life: Since EHCI has never been a default controller in QEMU and libvirt for ppc64, we do not expect that anybody is really using EHCI for their ppc64 guests in the wild.

Comment 6 Dr. David Alan Gilbert 2017-04-06 11:58:30 UTC

Hmm, so it's right that the destination fails to load it;  hmm as to whether the source should have an error.  There's no explicit failure mechanism to pass an error back from the destination - in general it normally fails because the destination detects failure before the source has transmitted it's last piece of data.

Comment 7 Peter Xu 2017-04-10 12:13:57 UTC

(In reply to Dr. David Alan Gilbert from comment #6)
> Hmm, so it's right that the destination fails to load it;  hmm as to whether
> the source should have an error.  There's no explicit failure mechanism to
> pass an error back from the destination - in general it normally fails
> because the destination detects failure before the source has transmitted
> it's last piece of data.

Since now we have the return path codes for postcopy... would it be a good idea we start leveraging that from now? E.g., 

  1. enable the return path even without postcopy, if "-M migration-ack=true" 
     is setup (we create that new bit, false by default for compatibility).

  2. add one more step right before migration finish (the QEMU_VM_EOF tag),
     to let destination notify source that "it's good and ready to start the
     migrated VM". We can have this ack even for postcopy, to be a mark that
     "destination side agrees source to drop the data it has".

  3. source machine should not destroy the VM until it receives that ack.

Would above make any sense?

Comment 8 Peter Xu 2017-04-12 03:32:34 UTC

Since this bug should not exist in general use (libvirt should make sure command line will be the same on both source/destination, so this bug should not affect libvirt users), and the fix won't be very straightforward (I need to at least get ack on above idea to continue), I'll suggest postpone this bug to 7.5 if no one disagree.

Comment 9 Juan Quintela 2017-04-17 09:13:22 UTC

Hi

We can do something like Peter Xu proposed, but that is 7.5 material.  My understanding is that this error can't never happen if you use libvirt (i.e. missing a device on destination that we know is on source).  We have never allowed migration wihout being both sides having exactly the same arguments.

I would close this bug for 7.4 and moved it to 7.5 or so while we decided what to do on upstream.

Later, Juan.

Comment 10 Peter Xu 2017-06-01 06:17:31 UTC

I posted a RFC series for the bug:

https://lists.gnu.org/archive/html/qemu-devel/2017-05/msg04508.html

The whole idea is that, if we want the source to know whether destination is good, we need the return channel. So let's see whether we can enable it unconditionally (even without postcopy).

However we encountered a problem on knowing "whether return path is valid" along the way. The patch assumed that "socket typed transport are the only ones that support duplex communication, but seems that's not enough.

The problem is that there are some types of migration transport (currently, we have tcp, unix, fd, exec, rdma, ...) that may not really support return path at all. One example is "exec: cat > out", which is actually migrating source VM to a file. In this kind of migration we don't really have a bi-direction channel, then we cannot guarantee we will have a return path. If we are without it, we can never know whether destination has finished the migration successfully or not.

However we cannot also simply drop support for that, since there are still some exec typed transport that might have duplex channels. One example would be: "socat tcp-listen:XXXX tcp-connect:HOST:YYYY".

Maybe one day we can pick up this issue again (general enablement of return path during migration), but now I see no good solution for this. For this specific issue, it will only happen without libvirt (as mentioned above), so let's close this and make it WONTFIX for now.

Peter