Bug 1439147
Summary: | migration completed but dst host prompt "qemu-kvm: Unknown savevm section or instance 'pci@800000020000000:06.0/ehci' 0" | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | xianwang <xianwang> |
Component: | qemu-kvm-rhev | Assignee: | Peter Xu <peterx> |
Status: | CLOSED WONTFIX | QA Contact: | xianwang <xianwang> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 7.4 | CC: | dgilbert, hhuang, michen, quintela, qzhang, thuth, virt-maint, xianwang |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2017-06-01 06:17:31 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
xianwang
2017-04-05 10:41:10 UTC
I just tried the same on x86, and I get the same behavior there: If I start the source qemu-kvm with "-device usb-ehci" and the destination qemu-kvm without that device, and then migrate from the source to the destination, the destination QEMU also aborts with "Unknown savevm section or instance '0000:00:04.0/ehci'", while the source says "Migration status: completed". So this issue is not really specific to ppc64, i.e. if you do the mistake of not specifying exactly the same devices at the source and destination, QEMU should not report "Migration status: completed" here and this should be fixed. Concerning the original idea, how to migrate a ppc64 guest with EHCI controller to RHEL 7.4 (where we've removed EHCI, see BZ 1410674): You can not directly migrate such a guest. You've got to shut it down once, remove the EHCI controller from the configuration and replace it with OHCI or XHCI, then you can migrate it after starting it again. Sounds cumbersome, but it should not be an issue in real life: Since EHCI has never been a default controller in QEMU and libvirt for ppc64, we do not expect that anybody is really using EHCI for their ppc64 guests in the wild. Hmm, so it's right that the destination fails to load it; hmm as to whether the source should have an error. There's no explicit failure mechanism to pass an error back from the destination - in general it normally fails because the destination detects failure before the source has transmitted it's last piece of data. (In reply to Dr. David Alan Gilbert from comment #6) > Hmm, so it's right that the destination fails to load it; hmm as to whether > the source should have an error. There's no explicit failure mechanism to > pass an error back from the destination - in general it normally fails > because the destination detects failure before the source has transmitted > it's last piece of data. Since now we have the return path codes for postcopy... would it be a good idea we start leveraging that from now? E.g., 1. enable the return path even without postcopy, if "-M migration-ack=true" is setup (we create that new bit, false by default for compatibility). 2. add one more step right before migration finish (the QEMU_VM_EOF tag), to let destination notify source that "it's good and ready to start the migrated VM". We can have this ack even for postcopy, to be a mark that "destination side agrees source to drop the data it has". 3. source machine should not destroy the VM until it receives that ack. Would above make any sense? Since this bug should not exist in general use (libvirt should make sure command line will be the same on both source/destination, so this bug should not affect libvirt users), and the fix won't be very straightforward (I need to at least get ack on above idea to continue), I'll suggest postpone this bug to 7.5 if no one disagree. Hi We can do something like Peter Xu proposed, but that is 7.5 material. My understanding is that this error can't never happen if you use libvirt (i.e. missing a device on destination that we know is on source). We have never allowed migration wihout being both sides having exactly the same arguments. I would close this bug for 7.4 and moved it to 7.5 or so while we decided what to do on upstream. Later, Juan. I posted a RFC series for the bug: https://lists.gnu.org/archive/html/qemu-devel/2017-05/msg04508.html The whole idea is that, if we want the source to know whether destination is good, we need the return channel. So let's see whether we can enable it unconditionally (even without postcopy). However we encountered a problem on knowing "whether return path is valid" along the way. The patch assumed that "socket typed transport are the only ones that support duplex communication, but seems that's not enough. The problem is that there are some types of migration transport (currently, we have tcp, unix, fd, exec, rdma, ...) that may not really support return path at all. One example is "exec: cat > out", which is actually migrating source VM to a file. In this kind of migration we don't really have a bi-direction channel, then we cannot guarantee we will have a return path. If we are without it, we can never know whether destination has finished the migration successfully or not. However we cannot also simply drop support for that, since there are still some exec typed transport that might have duplex channels. One example would be: "socat tcp-listen:XXXX tcp-connect:HOST:YYYY". Maybe one day we can pick up this issue again (general enablement of return path during migration), but now I see no good solution for this. For this specific issue, it will only happen without libvirt (as mentioned above), so let's close this and make it WONTFIX for now. Peter |