Bug 1090093
| Summary: | Improve error reporting when QEMU fails with incoming migration | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Daniel Berrangé <berrange> |
| Component: | libvirt | Assignee: | Jiri Denemark <jdenemar> |
| Status: | CLOSED ERRATA | QA Contact: | Virtualization Bugs <virt-bugs> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 7.0 | CC: | dyuan, fjin, jdenemar, lcheng, mzhan, rbalakri, ydu, zhwang, zpeng |
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | libvirt-1.2.17-2.el7 | Doc Type: | Bug Fix |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2015-11-19 05:45:54 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
This is now fixed upstream by v1.2.8-52-g0389060:
commit 03890605dc981f46ac72d17e4ac7db0da1b88e97
Author: Jiri Denemark <jdenemar>
Date: Sat Sep 6 01:16:20 2014 +0200
qemu: Propagate QEMU errors during incoming migrations
When QEMU fails during incoming migration after we successfully started
it (i.e., during Perform or Finish phase), we report a rather unhelpful
message
Unable to read from monitor: Connection reset by peer
We already have a code that takes error messages from QEMU's error
output but we disable it once QEMU successfully starts. This patch
postpones this until the end of Finish phase during incoming migration
so that we can report a much better error message:
internal error: early end of file from monitor: possible problem:
Unknown savevm section or instance '0000:00:05.0/virtio-balloon' 0
load of migration failed
https://bugzilla.redhat.com/show_bug.cgi?id=1090093
Signed-off-by: Jiri Denemark <jdenemar>
Hi jiri I just try to reproduce this bug with the comment1's steps, since there weren't details steps, so i do some attempts to reproduce this bug, can you help check that whether they are enough to reproduce this bug, if not, can you give me some advise? thanks a lot. reproduce pkg info libvirt-1.1.1-29.el7.x86_64 qemu-kvm-rhev-1.5.3-60.el7_0.11.x86_64 steps 1.Prepare a running guest # virsh list Id Name State ---------------------------------------------------- 5 rhel7 running 2.Generate the guest's xml #virsh dumpxml rhel7 >rhel7.xml 3.Do storage migration from source to the target host #virsh migrate --live --copy-storage-all rhel7 qemu+ssh://$target_ip/system --verbose --xml rhel7.xml 4.if we kill the qemu proccess at the beginning of the migration on the target host, found it report the following error , also the guest will get crashed on the source host [Note: if didn't hit the issue, just repeat the step 3~4] # time virsh migrate --live --copy-storage-all rhel7 qemu+ssh://$target_ip/system --verbose --xml rhel7.xml root@$target_ip's password: error: Unable to read from monitor: Connection reset by peer # virsh list Id Name State ---------------------------------------------------- 5.check the qemu log on the source host nbd.c:nbd_receive_reply():L746: read failed block job error in device 'drive-virtio-disk0': Input/output error (5) block job error in device 'drive-virtio-disk0': Input/output error (5) block job error in device 'drive-virtio-disk0': Input/output error (5) nbd.c:nbd_receive_reply():L746: read failed nbd.c:nbd_receive_reply():L746: read failed block job error in device 'drive-virtio-disk0': Input/output error (5) block job error in device 'drive-virtio-disk0': Input/output error (5) block job error in device 'drive-virtio-disk0': Input/output error (5) block job error in device 'drive-virtio-disk0': Input/output error (5) block job error in device 'drive-virtio-disk0': Input/output error (5) nbd.c:nbd_receive_reply():L746: read failed block job error in device 'drive-virtio-disk0': Input/output error (5) nbd.c:nbd_receive_reply():L746: read failed nbd.c:nbd_receive_reply():L746: read failed block job error in device 'drive-virtio-disk0': Input/output error (5) block job error in device 'drive-virtio-disk0': Input/output error (5) block job error in device 'drive-virtio-disk0': Input/output error (5) block job error in device 'drive-virtio-disk0': Input/output error (5) block job error in device 'drive-virtio-disk0': Input/output error (5) block job error in device 'drive-virtio-disk0': Input/output error (5) block job error in device 'drive-virtio-disk0': Input/output error (5) block job error in device 'drive-virtio-disk0': Input/output error (5) block job error in device 'drive-virtio-disk0': Input/output error (5) block job error in device 'drive-virtio-disk0': Input/output error (5) nbd.c:nbd_receive_reply():L746: read failed block job error in device 'drive-virtio-disk0': Operation not permitted (1) 2014-12-15 06:35:46.480+0000: shutting down 2014-12-15 06:36:22.411+0000: starting up 6.if we kill the qemu process at the end of the migration on the target host, Found it report the following error on the source host # virsh migrate --live --copy-storage-all rhel7 qemu+ssh://$target_ip/system --verbose --xml rhel7s.xml root@$target_ip's password: Migration: [ 81 %]error: operation failed: migration job: unexpectedly failed # virsh migrate --live --copy-storage-all rhel7 qemu+ssh://$target_ip/system --verbose --xml rhel7s.xml root@target_ip's password: Migration: [ 83 %]error: operation failed: migration job: unexpectedly failed Update the libvirt and qemu-kvm-rhev packet to the latest, didn't get that error info anymore , also the guest didn't crash pkginfo libvirt-1.2.8-10.el7.x86_64 qemu-kvm-rhev-2.1.2-16.el7.x86_64 steps 1. Re-excute the reproduce steps with the new libvirt and qemu-kvm-rhev packet, get the following error, not sure whether they are expect or not. jiri, can you help confirm it? thanks 2.if we kill the qemu process at the beginning of the migration on the target host, found it always report the following error on the source host, however, it didn't get guest crashed anymore # virsh migrate --live --copy-storage-all rhel7 qemu+ssh://$target_ip/system --verbose --xml rhel7s.xml root@$target_ip's password: error: unable to connect to server at 'rhel7f:49152': Connection refused # virsh migrate --live --copy-storage-all rhel7 qemu+ssh://$target_ip/system --verbose --xml rhel7s.xml root@$target_ip's password: error: unable to connect to server at 'rhel7f:49152': Connection refused # virsh list Id Name State ---------------------------------------------------- 2 rhel7 running 3.if we kill the qemu proccess at the end of the migration on the target host, found it always report the following error, was this error the expect one ? # virsh migrate --live --copy-storage-all rhel7 qemu+ssh://$target_ip/system --verbose --xml rhel7s.xml root@$target_ip's password: Migration: [ 79 %]error: operation failed: migration job: unexpectedly failed # virsh list Id Name State ---------------------------------------------------- 2 rhel7 running # virsh migrate --live --copy-storage-all rhel7 qemu+ssh://$target_ip/system --verbose --xml rhel7s.xml root@$target_ip's password: Migration: [ 82 %]error: operation failed: migration job: unexpectedly failed # virsh list Id Name State ---------------------------------------------------- 2 rhel7 running Killing QEMU does not cause error messages to be written to QEMU's stderr so there's nothing libvirt could consume and provide back to a user. It's better if QEMU fails by itself for some reason. It may be a bit hard to cause such failure without hacking around so I made a wrapper script for qemu-kvm. On a destination host, I moved /usr/libexec/qemu-kvm to /usr/libexec/qemu-kvm.orig and created a shell script /usr/libexec/qemu-kvm which changes something in the command line and executes /usr/libexec/qemu-kvm.orig with the amended command line arguments. hi Jiri Thanks for your advise, I try to research it today, however still couldn't reproduce the issue. In comment 8 i saw you said "create a shell script /usr/libexec/qemu-kvm which changes something in the command line and executes /usr/libexec/qemu-kvm.orig with the amended command line arguments". Can you show me what i need change in the command line or give me a reproducer? thanks. The following steps were my reproduce steps source 1.Prepare a normal guest in the source host, the source host was 7.0 env #virsh dumpxml rhel7 -- <video> <model type='vga' vram='65536' heads='1'/> <alias name='video0'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/> </video> <memballoon model='none'> <alias name='balloon0'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/> </memballoon> 2.run the follwoing script on the target, the target host was 7.1 env #mv /usr/libexec/qemu-kvm /usr/libexec/qemu-kvm.orig write shell script for qemu-kvm #cat /usr/libexec/qemu-kvm #!/bin/bash exec /usr/libexec/qemu-kvm.orig -M pc -cpu SandyBridge -enable-kvm -m 4096 -smp 4,sockets=1,cores=4,threads=1 -name rhel7 -rtc base=localtime,clock=host,driftfix=slew -global PIIX4_PM.disable_s3=0 -global PIIX4_PM.disable_s4=0 -vga std -monitor stdio -boot menu=on -drive file=/var/lib/libvirt/images/rhel7.0.qcow2,if=none,id=drive-virtio-disk0,format=qcow2,cache=none -device virtio-blk-pci,scsi=off,drive=drive-virtio-disk0,id=virtio-disk0 -device virtio-balloon-pci -incoming tcp:0:49152 #chmod 777 /usr/libexec/qemu-kvm # /usr/libexec/qemu-kvm 3.migrate the guest to the target, check the error info virsh migrate --live rhel7 qemu+ssh://$target_ip/system --verbose --persistent --copy-storage-all --xml rhel7mig.xml It's hard to say what you changed because you didn't provide the original command line but in general, it's much better to change the arguments dynamically because libvirt also starts QEMU when probing its capabilities. So something like the following could work:
#!/bin/bash
args=()
i=0
while [[ $# -gt 0 ]]; do
if [[ "$1" == 4096 ]]; then
args[i]=1024
elif [[ "$1" == virtio-balloon-pci ]]; then
i=$[i-2]
else
args[i]="$1"
fi
i=$[i+1]
shift
done
echo "${args[@]}"
The script should change the amount of memory to 1024 MB and it should
completely remove -device virtio-balloon-pci arguments.
Hi jiri thanks for your patient reply, I get a lot from your comment and I have new cognition to this bug. Before the guest actualy migrate data, it will first copy the guest's migrable xml from the source to the target libvirtd, then the target libvirtd will start qemu process base the migrable guest's xml. so if i want to reproduce this issue, i have to change the tartget qemu commandline with your shell script before it actually start migration, right? if it is, I think what i've done in comment9 was incorrect, since the libvirt will also starts qemu when migrate the guest with libvirt command, so we can't specify the qemu comdline in the script directly, we should get the qemu cmdline which start by libvirtd, then change it with your shell scripts, however, I'm not clear how does it work while the libvirt start the qemu process in the target host, and how to change the qemu process with your shell scripts detailedly, can you show me that? thanks Oops, sorry, the last line
echo "${args[@]}"
should be replaced with
exec /usr/libexec/qemu-kvm.orig "${args[@]}"
Then you can just use the script as /usr/libexec/qemu-kvm
Hi jiri Thanks for your confirmation, currently, I could get an unexpected error while migrate the guest with your script, however the error was different with the comment0. On the other hand, the unexepcted error info still exsiting even update the libvirt version to the latest one libvirt-1.2.8-13.el7. can you help check it? thanks Reproduce this bug with libvirt-1.2.8-1 1.Prepare the migration env 2.Prepare a running guest with 4G mem and virtio-ballon-pci enabled in the guest's xml #virsh dumpxml rhel7 -- <memory unit='KiB'>4194304</memory> <currentMemory unit='KiB'>4194304</currentMemory> -- <memballoon model='virtio'> <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/> </memballoon> #virsh dumpxml rhel7f >rhel7.xml 3.Prepare the script in the target host #mv /usr/libexec/qemu-kvm /usr/libexec/qemu-kvm.orig create a shell script named qemu-kvm in /usr/libexec directory # cat /usr/libexec/qemu-kvm #!/bin/bash args=() i=0 while [[ $# -gt 0 ]]; do if [[ "$1" == 4096 ]]; then args[i]=1024 elif [[ "$1" == virtio-balloon-pci ]]; then i=$[i-2] else args[i]="$1" fi i=$[i+1] shift done exec /usr/libexec/qemu-kvm.orig "${args[@]}" # ll -Z /usr/libexec/qemu-kvm -rwxr-xr-x. root root system_u:object_r:qemu_exec_t:s0 /usr/libexec/qemu-kvm 4.Migrate the guest from the source to the target, will get an unexpected error here # virsh migrate --live rhel7 --copy-storage-all --persistent --xml rhel7.xml qemu+ssh://$target_ip/system --verbose root@$target_ip's password: error: operation failed: migration job: unexpectedly failed 5.check the qemu log in the target host, Found some detailed error info #cat /var/log/libvirt/qemu/rhel7.log /usr/libexec/qemu-kvm -name rhel7 -S -machine pc-i440fx-rhel7.1.0,accel=kvm,usb=off -cpu Westmere -m 4096 -- -incoming tcp:[::]:49152 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x7 -msg timestamp=on char device redirected to /dev/pts/2 (label charserial0) 2015-01-19T08:55:06.489659Z qemu-kvm.orig: Length mismatch: pc.ram: 100000000 in != 40000000 2015-01-19T08:55:06.489895Z qemu-kvm.orig: Ack, bad migration stream! 2015-01-19T08:55:06.489911Z qemu-kvm.orig: Illegal RAM offset 87667612e767000 qemu: warning: error while loading state for instance 0x0 of device 'ram' 2015-01-19T08:55:06.489951Z qemu-kvm.orig: load of migration failed: Invalid argument 2015-01-19 08:55:06.522+0000: shutting down Verify this bug with libvirt-1.2.8-13, still couldn't get the clear error in in source host 1.Excute step 1~3 in the reproduce steps 2.migrate the guest to the target , still get the unexpected error # virsh migrate --live rhel7 --copy-storage-all --persistent --xml rhel7.xml qemu+ssh://10.66.106.40/system --verbose root.106.40's password: error: operation failed: migration job: unexpectedly failed 3.check the qemu log in the target host, Found some detailed error info #cat /var/log/libvirt/qemu/rhel7.log #cat /var/log/libvirt/qemu/rhel7.log /usr/libexec/qemu-kvm -name rhel7 -S -machine pc-i440fx-rhel7.1.0,accel=kvm,usb=off -cpu Westmere -m 4096 -- -incoming tcp:[::]:49152 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x7 -msg timestamp=on char device redirected to /dev/pts/2 (label charserial0) 2015-01-19T09:02:39.827950Z qemu-kvm.orig: -netdev tap,fd=23,id=hostnet0: TUNGETIFF ioctl() failed: Inappropriate ioctl for device TUNSETOFFLOAD ioctl() failed: Inappropriate ioctl for device 2015-01-19T09:02:40.086578Z qemu-kvm.orig: Length mismatch: pc.ram: 100000000 in != 40000000 2015-01-19T09:02:40.086655Z qemu-kvm.orig: Ack, bad migration stream! 2015-01-19T09:02:40.086690Z qemu-kvm.orig: Illegal RAM offset 87667612e767000 qemu: warning: error while loading state for instance 0x0 of device 'ram' 2015-01-19T09:02:40.086736Z qemu-kvm.orig: load of migration failed: Invalid argument 2015-01-19 09:02:40.095+0000: shutting down Hmm, so apparently I fixed one of the paths to provide a reasonable error message but there are more paths the source can see an error during migration and these are not covered. For example, the source reports
error: operation failed: migration job: unexpectedly failed
while the destination knows more:
2015-01-23 09:39:29.214+0000: 8871: error : qemuMonitorIO:662 : internal error: End of file from monitor
2015-01-23 09:39:29.214+0000: 8871: error : qemuMonitorIO:697 : internal error: early end of file from monitor: possible problem:
2015-01-23T09:39:29.201284Z qemu-system-x86_64.orig: Length mismatch: pc.ram: 20000000 in != 10000000
2015-01-23T09:39:29.201603Z qemu-system-x86_64.orig: Ack, bad migration stream!
2015-01-23T09:39:29.201629Z qemu-system-x86_64.orig: Illegal RAM offset 87667612e767000
qemu: warning: error while loading state for instance 0x0 of device 'ram'
2015-01-23T09:39:29.201754Z qemu-system-x86_64.orig: load of migration failed: Invalid argument
Or the source reports
error: Requested operation is not valid: domain 'qemu' is not processing incoming migration
while the destination knows more again
2015-01-23 09:31:10.827+0000: 8871: error : qemuMonitorIO:662 : internal error: End of file from monitor
2015-01-23 09:31:10.827+0000: 8871: error : qemuMonitorIO:697 : internal error: early end of file from monitor: possible problem:
Unknown savevm section or instance '0000:00:05.0/virtio-balloon' 0
2015-01-23T09:31:10.818913Z qemu-system-x86_64.orig: load of migration failed: Invalid argument
*** Bug 1176771 has been marked as a duplicate of this bug. *** Fixed upstream as of v1.2.17-87-g2e7cea2 by:
commit 04d5fb2e0a92f78eca61de59971eba7fb36c6c1d
Author: Jiri Denemark <jdenemar>
Date: Fri Jul 3 19:35:06 2015 +0200
qemu: Drop LFs at the end of error from QEMU log
Libvirt's error messages do not end with a LF. However, when reading the
error from QEMU log, we would read the LF from the log and keep it in
the message.
Signed-off-by: Jiri Denemark <jdenemar>
commit a7c22a1f2949e781aef4aa82afda407efd344f95
Author: Jiri Denemark <jdenemar>
Date: Thu Jul 2 14:21:27 2015 +0200
Introduce virHashAtomic
This is a self-locking wrapper around virHashTable. Only a limited set
of APIs are implemented now (the ones which are used in the following
patch) as more can be added on demand.
Signed-off-by: Jiri Denemark <jdenemar>
commit 9d0a2af6c230ce3007b3504907ccdba09cdc4cf2
Author: Jiri Denemark <jdenemar>
Date: Fri Jul 10 08:44:41 2015 +0200
Introduce virErrorCopyNew
A helper function for copying error objects.
Signed-off-by: Jiri Denemark <jdenemar>
commit e68f395fcbae0267368f9974cc49f582cc83c752
Author: Jiri Denemark <jdenemar>
Date: Thu Jul 2 08:26:48 2015 +0200
qemu: Remember incoming migration errors
If QEMU fails during incoming migration, the domain disappears including
a possibly useful error message read from QEMU log file. Let's remember
the error in virQEMUDriver so that Finish can report more than just "no
such domain".
Signed-off-by: Jiri Denemark <jdenemar>
commit 44c42b564dc61ddc8fdfc5b6ca4ca969bcffc7d8
Author: Jiri Denemark <jdenemar>
Date: Thu Jul 2 22:32:54 2015 +0200
qemu: Don't report false error from MigrateFinish
virDomainMigrateFinish* APIs were unfortunately designed to return the
pointer to the domain on destination and NULL on error. This looks OK in
normal cases but the same API is also called when we know migration
failed and thus we expect Finish to return NULL even if it actually did
all it was supposed to do without any error. The call is defined to
return nonnull domain pointer over RPC, which means returning NULL will
always result in an error being send. If this was not in fact an error,
the API itself wouldn't set anything to the thread local virError, which
makes the RPC layer come up with it's own "Library function returned
error but did not set virError" error.
This is quite confusing and also hard to detect by the caller. This
patch adds a special error code which can be used to check that Finish
successfully aborted migration.
Signed-off-by: Jiri Denemark <jdenemar>
commit 2e7cea24355328102c40dd127329ddf47d55a3e2
Author: Jiri Denemark <jdenemar>
Date: Thu Jul 2 21:46:56 2015 +0200
qemu: Use error from Finish instead of "unexpectedly failed"
When QEMU exits on destination during migration, the source reports
either success (if the failure happened at the very end) or unhelpful
"unexpectedly failed" error message. However, the Finish API called on
the destination may report a real error so let's use it instead of the
generic one.
Signed-off-by: Jiri Denemark <jdenemar>
Reproduce this bug on build libvirt-1.2.17-1.el7.x86_64 Steps are same as comment #13: # virsh migrate --live mig1 qemu+ssh://10.66.4.208/system --verbose root.4.208's password: error: operation failed: migration job: unexpectedly failed Verify pass on build libvirt-1.2.17-8.el7.x86_64 # virsh migrate --live mig1 qemu+ssh://10.66.4.208/system --verbose root.4.208's password: error: internal error: early end of file from monitor: possible problem: 2015-09-09T03:21:08.220185Z qemu-kvm.orig: Length mismatch: pc.ram: 0x100000000 in != 0x40000000: Invalid argument 2015-09-09T03:21:08.220247Z qemu-kvm.orig: error while loading state for instance 0x0 of device 'ram' 2015-09-09T03:21:08.220323Z qemu-kvm.orig: load of migration failed: Invalid argument Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-2202.html |
Description of problem: Users frequently report problems like this "I am trying to migrate a VM from one host to another using ' # virsh migrate --persistent --xml /tmp/rex.xml --copy-storage-all --verbose --live rex qemu+ssh://neptune/system but am getting this error: error: Unable to read from monitor: Connection reset by peer" we then have to ask for their /var/log/libvirt/qemu/$VMNAME.log from the target host which contains the useful error message information "Length mismatch: 0000:00:03.0/virtio-net-pci.rom: 10000 in != 20000 qemu: warning: error while loading state for instance 0x0 of device 'ram' load of migration failed" Ideally QEMU would have a nice "migrate-incoming' monitor command we could use to get nice error reports in the normal channel, but in the absence of that libvirt could still improve its error reporting here. In the case where we have a -incoming arg provided, we need to pull the error out of the log files and use that, discarding the monitor error message NB, this user wasn't using RHEL-7, but I'm reporting this against RHEL-7 since I think fixing this would improve our supportability.