Bug 1425003

Summary: virsh save doesn't work after canceled postcopy migration
Product: Red Hat Enterprise Linux 7 Reporter: Milan Zamazal <mzamazal>
Component: libvirtAssignee: Jiri Denemark <jdenemar>
Status: CLOSED ERRATA QA Contact: zhe peng <zpeng>
Severity: high Docs Contact:
Priority: unspecified    
Version: 7.3CC: dyuan, jdenemar, rbalakri, xuzhang, zpeng
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: libvirt-3.2.0-4.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-08-01 17:21:45 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
libvirtd.log none

Description Milan Zamazal 2017-02-20 10:49:20 UTC
Description of problem:

When `virsh migrate' is called with --postcopy and the migration is canceled, then `virsh save' doesn't work.

Version-Release number of selected component (if applicable):

libvirt-2.0.0-10.el7_3.4.x86_64

How reproducible:

100%

Steps to Reproduce:

1. Start a VM.
2. Start to migrate it with a post-copy flag.
3. Cancel the migration before it completes.
4. Try to save the VM.

Actual results:

You receive an error like

  error: Failed to save domain dummy to /tmp/xxx
  error: operation failed: domain save job: unexpectedly failed

libvirt contains an error like

  2017-02-20T10:36:38.761085Z qemu-kvm: socket_writev_buffer: Got err=32 for (32768/18446744073709551615)
  Unable to open return-path for postcopy

Expected results:

The VM is saved.

Additional info:

The bug is similar to https://bugzilla.redhat.com/1374718, it just differs in that the migration is canceled.

Comment 1 Jiri Denemark 2017-04-05 13:10:59 UTC
Patches sent upstream for review: https://www.redhat.com/archives/libvir-list/2017-April/msg00219.html

Comment 2 Jiri Denemark 2017-04-07 13:36:15 UTC
Fixed upstream by

commit 8be3ccd047e17c4998c669da2a63c3956e1f5225
Refs: v3.2.0-77-g8be3ccd04
Author:     Jiri Denemark <jdenemar>
AuthorDate: Wed Apr 5 13:05:25 2017 +0200
Commit:     Jiri Denemark <jdenemar>
CommitDate: Fri Apr 7 13:43:37 2017 +0200

    qemu: Properly reset all migration capabilities

    So far only QEMU_MONITOR_MIGRATION_CAPS_POSTCOPY was reset, but only in
    a single code path leaving post-copy enabled in quite a few cases.

    https://bugzilla.redhat.com/show_bug.cgi?id=1425003

    Signed-off-by: Jiri Denemark <jdenemar>

Comment 5 zhe peng 2017-04-25 03:21:26 UTC
I can still reproduce this with build:
libvirt-3.2.0-3.el7.x86_64
qemu-kvm-rhev-2.9.0-1.el7.x86_64

step:
 1. Canceled post-copy migration by client(Ctrl+C)
#virsh migrate rhel7 qemu+ssh://$targethost/system --postcopy --live --verbose 
Migration: [ 67 %]^Cerror: operation aborted: migration job: canceled by client

# virsh save rhel7 /tmp/rhel7.save 
error: Failed to save domain rhel7 to /tmp/rhel7.save
error: operation failed: domain save job: unexpectedly failed

cat /var/log/libvirt/qemu/rhel7.log
2017-04-25 08:17:37.082+0000: initiating migration
RP: Received invalid message 0x0000 length 0x0000
RP: Received invalid message 0x0000 length 0x0000

Comment 6 zhe peng 2017-04-25 03:22:22 UTC
Created attachment 1273799 [details]
libvirtd.log

Comment 7 Jiri Denemark 2017-04-25 08:17:41 UTC
Can you check if it works in the following scenarios?

1. start a fresh domain and run "virsh save"
2. start a fresh domain, start a migration (without --postcopy), cancel the migration, and run "virsh save"

And could you also test with older qemu-kvm-rhev packages (such as 2.8.0-*)?

Comment 8 Jiri Denemark 2017-04-25 08:58:51 UTC
I analyzed the logs and it seems libvirt does not properly reset postcopy capability once migration is canceled. Which would mean there is a bug in the patches which were supposed to fix this issue.

Feel free to confirm it by responding to the questions in comment 7.

Comment 9 zhe peng 2017-04-25 09:01:34 UTC
scenario 1:
start a fresh domain and save
# virsh save rhel7 /tmp/rhel7.save

Domain rhel7 saved to /tmp/rhel7.save

scenario 2:
if without postcopy, domain can be saved.


and i test with qemu-kvm-rhev-2.8.0-5.el7.x86_64
scenario 1, guest can be saved without error
scenario 2.
behavior same with qemu-kvm-rhev-2.9.

Comment 10 Jiri Denemark 2017-04-26 20:00:27 UTC
The additional patch sent upstream for review: https://www.redhat.com/archives/libvir-list/2017-April/msg01323.html

BTW, it should work even without this patch for migrations started with --p2p option.

Comment 11 Jiri Denemark 2017-04-27 12:04:19 UTC
Fixed upstream by

commit eeb2feb9fbb66ea9026edc6451018fb3b94ffa58
Refs: v3.2.0-273-geeb2feb9f
Author:     Jiri Denemark <jdenemar>
AuthorDate: Wed Apr 26 21:46:28 2017 +0200
Commit:     Jiri Denemark <jdenemar>
CommitDate: Thu Apr 27 13:55:46 2017 +0200

    qemu: Properly reset non-p2p migration

    While peer-to-peer migration enters the Confirm phase even if the
    Perform phase fails, the client which initiated a non-p2p migration will
    never call virDomainMigrateConfirm* API if the Perform phase failed.
    Thus we need to explicitly reset migration before reporting a failure
    from the Perform phase API.

    https://bugzilla.redhat.com/show_bug.cgi?id=1425003

    Signed-off-by: Jiri Denemark <jdenemar>

Comment 13 zhe peng 2017-05-04 06:48:14 UTC
verify with build:
libvirt-3.2.0-4.el7.x86_64
qemu-kvm-rhev-2.8.0-5.el7.x86_64

step:
 1. Canceled post-copy migration by client(Ctrl+C)
#virsh migrate rhel7 qemu+ssh://$targethost/system --postcopy --live --verbose 
Migration: [ 67 %]^Cerror: operation aborted: migration job: canceled by client

# virsh save rhel7 /tmp/rhel7.save 

Domain rhel7 saved to /tmp/rhel7.save

# virsh restore /tmp/rhel7.save
Domain restored from /tmp/rhel7.save

do migration again
# virsh migrate rhel7 qemu+ssh://$targethost/system --postcopy --live --verbose
Migration: [100 %]

 2.do p2p migration with/without postcopy, all can save guest.
# virsh migrate rhel7 qemu+ssh://$targethost/system --p2p --postcopy --live --verbose
Migration: [ 75 %]^Cerror: operation aborted: migration job: canceled by client

# virsh save rhel7 /tmp/rhel7.save

Domain rhel7 saved to /tmp/rhel7.save

3.# virsh migrate rhel7 qemu+ssh://$targethost/system --postcopy --postcopy-after-precopy --live --verbose
Migration: [ 80 %]^Cerror: operation aborted: migration job: canceled by client

[root@ibm-x3250m6-04 ~]# virsh save rhel7 /tmp/rhel7.save

Domain rhel7 saved to /tmp/rhel7.save

move to verified.

Comment 14 errata-xmlrpc 2017-08-01 17:21:45 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1846

Comment 15 errata-xmlrpc 2017-08-02 00:01:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1846