Bug 1090093

Summary: Improve error reporting when QEMU fails with incoming migration
Product: Red Hat Enterprise Linux 7 Reporter: Daniel Berrangé <berrange>
Component: libvirtAssignee: Jiri Denemark <jdenemar>
Status: CLOSED ERRATA QA Contact: Virtualization Bugs <virt-bugs>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 7.0CC: dyuan, fjin, jdenemar, lcheng, mzhan, rbalakri, ydu, zhwang, zpeng
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: libvirt-1.2.17-2.el7 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-11-19 05:45:54 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Daniel Berrangé 2014-04-22 14:47:29 UTC
Description of problem:
Users frequently report problems like this

  "I am trying to migrate a VM from one host to another using '

    # virsh migrate --persistent --xml /tmp/rex.xml --copy-storage-all --verbose --live rex qemu+ssh://neptune/system

  but am getting this error:

   error: Unable to read from monitor: Connection reset by peer"

we then have to ask for their /var/log/libvirt/qemu/$VMNAME.log from the target host which contains the useful error message information


  "Length mismatch: 0000:00:03.0/virtio-net-pci.rom: 10000 in != 20000
  qemu: warning: error while loading state for instance 0x0 of device 'ram'
  load of migration failed"

Ideally QEMU would have a nice "migrate-incoming' monitor command we could use to get nice error reports in the normal channel, but in the absence of that libvirt could still improve its error reporting here. In the case where we have a -incoming arg provided, we need to pull the error out of the log files and use that, discarding the monitor error message


NB, this user wasn't using RHEL-7, but I'm reporting this against RHEL-7 since I think fixing this would improve our supportability.

Comment 2 Jiri Denemark 2014-09-08 11:38:42 UTC
This is now fixed upstream by v1.2.8-52-g0389060:

commit 03890605dc981f46ac72d17e4ac7db0da1b88e97
Author: Jiri Denemark <jdenemar>
Date:   Sat Sep 6 01:16:20 2014 +0200

    qemu: Propagate QEMU errors during incoming migrations
    
    When QEMU fails during incoming migration after we successfully started
    it (i.e., during Perform or Finish phase), we report a rather unhelpful
    message
    
        Unable to read from monitor: Connection reset by peer
    
    We already have a code that takes error messages from QEMU's error
    output but we disable it once QEMU successfully starts. This patch
    postpones this until the end of Finish phase during incoming migration
    so that we can report a much better error message:
    
        internal error: early end of file from monitor: possible problem:
        Unknown savevm section or instance '0000:00:05.0/virtio-balloon' 0
        load of migration failed
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1090093
    
    Signed-off-by: Jiri Denemark <jdenemar>

Comment 6 zhenfeng wang 2014-12-15 07:49:11 UTC
Hi jiri
I just try to reproduce this bug with the comment1's steps, since there weren't details steps, so i do some attempts to reproduce this bug, can you help check that whether they are enough to reproduce this bug, if not, can you give me some advise? thanks a lot. 

reproduce pkg info
libvirt-1.1.1-29.el7.x86_64
qemu-kvm-rhev-1.5.3-60.el7_0.11.x86_64

steps
1.Prepare a running guest
# virsh list
 Id    Name                           State
----------------------------------------------------
 5     rhel7                          running

2.Generate the guest's xml
#virsh dumpxml rhel7 >rhel7.xml

3.Do storage migration from source to the target host
#virsh migrate --live --copy-storage-all rhel7 qemu+ssh://$target_ip/system --verbose --xml rhel7.xml 

4.if we kill the qemu proccess at the beginning of the migration on the target host, found it report the following error , also the guest will get crashed on the source host [Note: if didn't hit the issue, just repeat the step 3~4]

# time virsh migrate --live --copy-storage-all rhel7 qemu+ssh://$target_ip/system --verbose --xml rhel7.xml 
root@$target_ip's password: 
error: Unable to read from monitor: Connection reset by peer

# virsh list
 Id    Name                           State
----------------------------------------------------

5.check the qemu log on the source host
nbd.c:nbd_receive_reply():L746: read failed
block job error in device 'drive-virtio-disk0': Input/output error (5)
block job error in device 'drive-virtio-disk0': Input/output error (5)
block job error in device 'drive-virtio-disk0': Input/output error (5)
nbd.c:nbd_receive_reply():L746: read failed
nbd.c:nbd_receive_reply():L746: read failed
block job error in device 'drive-virtio-disk0': Input/output error (5)
block job error in device 'drive-virtio-disk0': Input/output error (5)
block job error in device 'drive-virtio-disk0': Input/output error (5)
block job error in device 'drive-virtio-disk0': Input/output error (5)
block job error in device 'drive-virtio-disk0': Input/output error (5)
nbd.c:nbd_receive_reply():L746: read failed
block job error in device 'drive-virtio-disk0': Input/output error (5)
nbd.c:nbd_receive_reply():L746: read failed
nbd.c:nbd_receive_reply():L746: read failed
block job error in device 'drive-virtio-disk0': Input/output error (5)
block job error in device 'drive-virtio-disk0': Input/output error (5)
block job error in device 'drive-virtio-disk0': Input/output error (5)
block job error in device 'drive-virtio-disk0': Input/output error (5)
block job error in device 'drive-virtio-disk0': Input/output error (5)
block job error in device 'drive-virtio-disk0': Input/output error (5)
block job error in device 'drive-virtio-disk0': Input/output error (5)
block job error in device 'drive-virtio-disk0': Input/output error (5)
block job error in device 'drive-virtio-disk0': Input/output error (5)
block job error in device 'drive-virtio-disk0': Input/output error (5)
nbd.c:nbd_receive_reply():L746: read failed
block job error in device 'drive-virtio-disk0': Operation not permitted (1)
2014-12-15 06:35:46.480+0000: shutting down
2014-12-15 06:36:22.411+0000: starting up

6.if we kill the qemu process at the end of the migration on the target host,
  Found it report the following error on the source host
# virsh migrate --live --copy-storage-all rhel7 qemu+ssh://$target_ip/system --verbose --xml rhel7s.xml 
root@$target_ip's password: 
Migration: [ 81 %]error: operation failed: migration job: unexpectedly failed

# virsh migrate --live --copy-storage-all rhel7 qemu+ssh://$target_ip/system --verbose --xml rhel7s.xml 
root@target_ip's password: 
Migration: [ 83 %]error: operation failed: migration job: unexpectedly failed



Update the libvirt and qemu-kvm-rhev packet to the latest, didn't get that error info anymore , also the guest didn't crash

pkginfo
libvirt-1.2.8-10.el7.x86_64
qemu-kvm-rhev-2.1.2-16.el7.x86_64

steps
1. Re-excute the reproduce steps with the new libvirt and qemu-kvm-rhev packet, get the following error, not sure whether they are expect or not. jiri, can you help confirm it? thanks


2.if we kill the qemu process at the beginning of the migration on the target host, found it always report the following error on the source host, however, it
didn't get guest crashed anymore

# virsh migrate --live --copy-storage-all rhel7 qemu+ssh://$target_ip/system --verbose --xml rhel7s.xml 
root@$target_ip's password: 
error: unable to connect to server at 'rhel7f:49152': Connection refused

# virsh migrate --live --copy-storage-all rhel7 qemu+ssh://$target_ip/system --verbose --xml rhel7s.xml 
root@$target_ip's password: 
error: unable to connect to server at 'rhel7f:49152': Connection refused

# virsh list
 Id    Name                           State
----------------------------------------------------
 2     rhel7                          running

3.if we kill the qemu proccess at the end of the migration on the target host, found it always report the following error, was this error the expect one ?

# virsh migrate --live --copy-storage-all rhel7 qemu+ssh://$target_ip/system --verbose --xml rhel7s.xml 
root@$target_ip's password: 
Migration: [ 79 %]error: operation failed: migration job: unexpectedly failed

# virsh list
 Id    Name                           State
----------------------------------------------------
 2     rhel7                          running

# virsh migrate --live --copy-storage-all rhel7 qemu+ssh://$target_ip/system --verbose --xml rhel7s.xml 
root@$target_ip's password: 
Migration: [ 82 %]error: operation failed: migration job: unexpectedly failed

# virsh list
 Id    Name                           State
----------------------------------------------------
 2     rhel7                          running

Comment 7 zhenfeng wang 2015-01-09 09:31:52 UTC
hi jiri
can you help check the comment6? thanks

Comment 8 Jiri Denemark 2015-01-09 16:23:25 UTC
Killing QEMU does not cause error messages to be written to QEMU's stderr so there's nothing libvirt could consume and provide back to a user. It's better if QEMU fails by itself for some reason. It may be a bit hard to cause such failure without hacking around so I made a wrapper script for qemu-kvm. On a destination host, I moved /usr/libexec/qemu-kvm to /usr/libexec/qemu-kvm.orig and created a shell script /usr/libexec/qemu-kvm which changes something in the command line and executes /usr/libexec/qemu-kvm.orig with the amended command line arguments.

Comment 9 zhenfeng wang 2015-01-15 13:18:02 UTC
hi Jiri
Thanks for your advise, I try to research it today, however still couldn't reproduce the issue. In comment 8 i saw you said "create a shell script /usr/libexec/qemu-kvm which changes something in the command line and executes /usr/libexec/qemu-kvm.orig with the amended command line arguments". Can you show me what i need change in the command line or give me a reproducer? thanks.  The following steps were my reproduce steps

source 
1.Prepare a normal guest in the source host, the source host was 7.0 env
#virsh dumpxml rhel7
--
    <video>
      <model type='vga' vram='65536' heads='1'/>
      <alias name='video0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
    </video>
    <memballoon model='none'>
      <alias name='balloon0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/>
    </memballoon>



2.run the follwoing script on the target, the target host was 7.1 env
#mv  /usr/libexec/qemu-kvm  /usr/libexec/qemu-kvm.orig
write shell script for qemu-kvm
#cat /usr/libexec/qemu-kvm
#!/bin/bash
exec /usr/libexec/qemu-kvm.orig -M pc -cpu SandyBridge -enable-kvm -m 4096 -smp 4,sockets=1,cores=4,threads=1 -name rhel7 -rtc base=localtime,clock=host,driftfix=slew  -global PIIX4_PM.disable_s3=0 -global PIIX4_PM.disable_s4=0  -vga std  -monitor stdio  -boot menu=on   -drive file=/var/lib/libvirt/images/rhel7.0.qcow2,if=none,id=drive-virtio-disk0,format=qcow2,cache=none -device virtio-blk-pci,scsi=off,drive=drive-virtio-disk0,id=virtio-disk0 -device virtio-balloon-pci  -incoming tcp:0:49152

#chmod 777 /usr/libexec/qemu-kvm
# /usr/libexec/qemu-kvm

3.migrate the guest to the target, check the error info
virsh migrate --live rhel7 qemu+ssh://$target_ip/system --verbose --persistent --copy-storage-all --xml rhel7mig.xml

Comment 10 Jiri Denemark 2015-01-15 14:01:03 UTC
It's hard to say what you changed because you didn't provide the original command line but in general, it's much better to change the arguments dynamically because libvirt also starts QEMU when probing its capabilities. So something like the following could work:

    #!/bin/bash

    args=()
    i=0
    while [[ $# -gt 0 ]]; do
        if [[ "$1" == 4096 ]]; then
            args[i]=1024
        elif [[ "$1" == virtio-balloon-pci ]]; then
            i=$[i-2]
        else
            args[i]="$1"
        fi
        i=$[i+1]
        shift
    done

    echo "${args[@]}"

The script should change the amount of memory to 1024 MB and it should
completely remove -device virtio-balloon-pci arguments.

Comment 11 zhenfeng wang 2015-01-16 09:33:17 UTC
Hi jiri
thanks for your patient reply, I get a lot from your comment and I have new cognition to this bug. 
Before the guest actualy migrate data, it will first copy the guest's migrable xml from the source to the target libvirtd, then the target libvirtd will start qemu process base the migrable guest's xml. so if i want to reproduce this issue, i have to change the tartget qemu commandline with your shell script before it actually start migration, right? if it is, I think what i've done in comment9 was incorrect, since the libvirt will also starts qemu when migrate the guest with libvirt command, so we can't specify the qemu comdline in the script directly, we should get the qemu cmdline which start by libvirtd, then change it with your shell scripts, however, I'm not clear how does it work while the libvirt start the qemu process in the target host, and how to change the qemu process with your shell scripts detailedly, can you show me that? thanks

Comment 12 Jiri Denemark 2015-01-16 09:55:27 UTC
Oops, sorry, the last line

    echo "${args[@]}"

should be replaced with

    exec /usr/libexec/qemu-kvm.orig "${args[@]}"

Then you can just use the script as /usr/libexec/qemu-kvm

Comment 13 zhenfeng wang 2015-01-19 09:05:06 UTC
Hi jiri
Thanks for your confirmation, currently, I could get an unexpected error while migrate the guest with your script, however the error was different with the comment0. On the other hand, the unexepcted error info still exsiting even update the libvirt version to the latest one libvirt-1.2.8-13.el7. can you help check it? thanks

Reproduce this bug with libvirt-1.2.8-1

1.Prepare the migration env
2.Prepare a running guest with 4G mem and virtio-ballon-pci enabled in the guest's xml

#virsh dumpxml rhel7
--
<memory unit='KiB'>4194304</memory>
  <currentMemory unit='KiB'>4194304</currentMemory>
--
    <memballoon model='virtio'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/>
    </memballoon>

#virsh dumpxml rhel7f >rhel7.xml

3.Prepare the script in the target host
#mv /usr/libexec/qemu-kvm /usr/libexec/qemu-kvm.orig

create a shell script named qemu-kvm in /usr/libexec directory
# cat /usr/libexec/qemu-kvm
#!/bin/bash

    args=()
    i=0
    while [[ $# -gt 0 ]]; do
        if [[ "$1" == 4096 ]]; then
            args[i]=1024
        elif [[ "$1" == virtio-balloon-pci ]]; then
            i=$[i-2]
        else
            args[i]="$1"
        fi
        i=$[i+1]
        shift
    done
exec /usr/libexec/qemu-kvm.orig "${args[@]}"
# ll -Z /usr/libexec/qemu-kvm
-rwxr-xr-x. root root system_u:object_r:qemu_exec_t:s0 /usr/libexec/qemu-kvm

4.Migrate the guest from the source to the target, will get an unexpected error here
# virsh migrate --live rhel7 --copy-storage-all --persistent --xml rhel7.xml qemu+ssh://$target_ip/system --verbose
root@$target_ip's password: 
error: operation failed: migration job: unexpectedly failed

5.check the qemu log in the target host, Found some detailed error info
#cat /var/log/libvirt/qemu/rhel7.log
/usr/libexec/qemu-kvm -name rhel7 -S -machine pc-i440fx-rhel7.1.0,accel=kvm,usb=off -cpu Westmere -m 4096  
--
-incoming tcp:[::]:49152 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x7 -msg timestamp=on
char device redirected to /dev/pts/2 (label charserial0)
2015-01-19T08:55:06.489659Z qemu-kvm.orig: Length mismatch: pc.ram: 100000000 in != 40000000
2015-01-19T08:55:06.489895Z qemu-kvm.orig: Ack, bad migration stream!
2015-01-19T08:55:06.489911Z qemu-kvm.orig: Illegal RAM offset 87667612e767000
qemu: warning: error while loading state for instance 0x0 of device 'ram'
2015-01-19T08:55:06.489951Z qemu-kvm.orig: load of migration failed: Invalid argument
2015-01-19 08:55:06.522+0000: shutting down



Verify this bug with libvirt-1.2.8-13, still couldn't get the clear error in in source host

1.Excute step 1~3 in the reproduce steps
2.migrate the guest to the target , still get the unexpected error 
# virsh migrate --live rhel7 --copy-storage-all --persistent --xml rhel7.xml qemu+ssh://10.66.106.40/system --verbose
root.106.40's password: 
error: operation failed: migration job: unexpectedly failed

3.check the qemu log in the target host, Found some detailed error info

#cat /var/log/libvirt/qemu/rhel7.log

#cat /var/log/libvirt/qemu/rhel7.log
/usr/libexec/qemu-kvm -name rhel7 -S -machine pc-i440fx-rhel7.1.0,accel=kvm,usb=off -cpu Westmere -m 4096  
--
 -incoming tcp:[::]:49152 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x7 -msg timestamp=on
char device redirected to /dev/pts/2 (label charserial0)
2015-01-19T09:02:39.827950Z qemu-kvm.orig: -netdev tap,fd=23,id=hostnet0: TUNGETIFF ioctl() failed: Inappropriate ioctl for device
TUNSETOFFLOAD ioctl() failed: Inappropriate ioctl for device
2015-01-19T09:02:40.086578Z qemu-kvm.orig: Length mismatch: pc.ram: 100000000 in != 40000000
2015-01-19T09:02:40.086655Z qemu-kvm.orig: Ack, bad migration stream!
2015-01-19T09:02:40.086690Z qemu-kvm.orig: Illegal RAM offset 87667612e767000
qemu: warning: error while loading state for instance 0x0 of device 'ram'
2015-01-19T09:02:40.086736Z qemu-kvm.orig: load of migration failed: Invalid argument
2015-01-19 09:02:40.095+0000: shutting down

Comment 14 Jiri Denemark 2015-01-23 09:41:58 UTC
Hmm, so apparently I fixed one of the paths to provide a reasonable error message but there are more paths the source can see an error during migration and these are not covered. For example, the source reports

    error: operation failed: migration job: unexpectedly failed

while the destination knows more:

    2015-01-23 09:39:29.214+0000: 8871: error : qemuMonitorIO:662 : internal error: End of file from monitor
    2015-01-23 09:39:29.214+0000: 8871: error : qemuMonitorIO:697 : internal error: early end of file from monitor: possible problem:
    2015-01-23T09:39:29.201284Z qemu-system-x86_64.orig: Length mismatch: pc.ram: 20000000 in != 10000000
    2015-01-23T09:39:29.201603Z qemu-system-x86_64.orig: Ack, bad migration stream!
    2015-01-23T09:39:29.201629Z qemu-system-x86_64.orig: Illegal RAM offset 87667612e767000
    qemu: warning: error while loading state for instance 0x0 of device 'ram'
    2015-01-23T09:39:29.201754Z qemu-system-x86_64.orig: load of migration failed: Invalid argument

Or the source reports

    error: Requested operation is not valid: domain 'qemu' is not processing incoming migration

while the destination knows more again

    2015-01-23 09:31:10.827+0000: 8871: error : qemuMonitorIO:662 : internal error: End of file from monitor
    2015-01-23 09:31:10.827+0000: 8871: error : qemuMonitorIO:697 : internal error: early end of file from monitor: possible problem:
    Unknown savevm section or instance '0000:00:05.0/virtio-balloon' 0
    2015-01-23T09:31:10.818913Z qemu-system-x86_64.orig: load of migration failed: Invalid argument

Comment 17 Jiri Denemark 2015-06-30 14:25:43 UTC
*** Bug 1176771 has been marked as a duplicate of this bug. ***

Comment 18 Jiri Denemark 2015-07-10 11:11:54 UTC
Fixed upstream as of v1.2.17-87-g2e7cea2 by:

commit 04d5fb2e0a92f78eca61de59971eba7fb36c6c1d
Author: Jiri Denemark <jdenemar>
Date:   Fri Jul 3 19:35:06 2015 +0200

    qemu: Drop LFs at the end of error from QEMU log
    
    Libvirt's error messages do not end with a LF. However, when reading the
    error from QEMU log, we would read the LF from the log and keep it in
    the message.
    
    Signed-off-by: Jiri Denemark <jdenemar>

commit a7c22a1f2949e781aef4aa82afda407efd344f95
Author: Jiri Denemark <jdenemar>
Date:   Thu Jul 2 14:21:27 2015 +0200

    Introduce virHashAtomic
    
    This is a self-locking wrapper around virHashTable. Only a limited set
    of APIs are implemented now (the ones which are used in the following
    patch) as more can be added on demand.
    
    Signed-off-by: Jiri Denemark <jdenemar>

commit 9d0a2af6c230ce3007b3504907ccdba09cdc4cf2
Author: Jiri Denemark <jdenemar>
Date:   Fri Jul 10 08:44:41 2015 +0200

    Introduce virErrorCopyNew
    
    A helper function for copying error objects.
    
    Signed-off-by: Jiri Denemark <jdenemar>

commit e68f395fcbae0267368f9974cc49f582cc83c752
Author: Jiri Denemark <jdenemar>
Date:   Thu Jul 2 08:26:48 2015 +0200

    qemu: Remember incoming migration errors
    
    If QEMU fails during incoming migration, the domain disappears including
    a possibly useful error message read from QEMU log file. Let's remember
    the error in virQEMUDriver so that Finish can report more than just "no
    such domain".
    
    Signed-off-by: Jiri Denemark <jdenemar>

commit 44c42b564dc61ddc8fdfc5b6ca4ca969bcffc7d8
Author: Jiri Denemark <jdenemar>
Date:   Thu Jul 2 22:32:54 2015 +0200

    qemu: Don't report false error from MigrateFinish
    
    virDomainMigrateFinish* APIs were unfortunately designed to return the
    pointer to the domain on destination and NULL on error. This looks OK in
    normal cases but the same API is also called when we know migration
    failed and thus we expect Finish to return NULL even if it actually did
    all it was supposed to do without any error. The call is defined to
    return nonnull domain pointer over RPC, which means returning NULL will
    always result in an error being send. If this was not in fact an error,
    the API itself wouldn't set anything to the thread local virError, which
    makes the RPC layer come up with it's own "Library function returned
    error but did not set virError" error.
    
    This is quite confusing and also hard to detect by the caller. This
    patch adds a special error code which can be used to check that Finish
    successfully aborted migration.
    
    Signed-off-by: Jiri Denemark <jdenemar>

commit 2e7cea24355328102c40dd127329ddf47d55a3e2
Author: Jiri Denemark <jdenemar>
Date:   Thu Jul 2 21:46:56 2015 +0200

    qemu: Use error from Finish instead of "unexpectedly failed"
    
    When QEMU exits on destination during migration, the source reports
    either success (if the failure happened at the very end) or unhelpful
    "unexpectedly failed" error message. However, the Finish API called on
    the destination may report a real error so let's use it instead of the
    generic one.
    
    Signed-off-by: Jiri Denemark <jdenemar>

Comment 21 Fangge Jin 2015-09-09 03:33:36 UTC
Reproduce this bug on build libvirt-1.2.17-1.el7.x86_64

Steps are same as comment #13:
# virsh migrate --live mig1 qemu+ssh://10.66.4.208/system --verbose
root.4.208's password: 
error: operation failed: migration job: unexpectedly failed


Verify pass on build libvirt-1.2.17-8.el7.x86_64

# virsh migrate --live mig1 qemu+ssh://10.66.4.208/system --verbose
root.4.208's password: 
error: internal error: early end of file from monitor: possible problem:
2015-09-09T03:21:08.220185Z qemu-kvm.orig: Length mismatch: pc.ram: 0x100000000 in != 0x40000000: Invalid argument
2015-09-09T03:21:08.220247Z qemu-kvm.orig: error while loading state for instance 0x0 of device 'ram'
2015-09-09T03:21:08.220323Z qemu-kvm.orig: load of migration failed: Invalid argument

Comment 23 errata-xmlrpc 2015-11-19 05:45:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-2202.html