Bug 1090093
Summary: | Improve error reporting when QEMU fails with incoming migration | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Daniel Berrangé <berrange> |
Component: | libvirt | Assignee: | Jiri Denemark <jdenemar> |
Status: | CLOSED ERRATA | QA Contact: | Virtualization Bugs <virt-bugs> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 7.0 | CC: | dyuan, fjin, jdenemar, lcheng, mzhan, rbalakri, ydu, zhwang, zpeng |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | libvirt-1.2.17-2.el7 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2015-11-19 05:45:54 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Daniel Berrangé
2014-04-22 14:47:29 UTC
This is now fixed upstream by v1.2.8-52-g0389060: commit 03890605dc981f46ac72d17e4ac7db0da1b88e97 Author: Jiri Denemark <jdenemar> Date: Sat Sep 6 01:16:20 2014 +0200 qemu: Propagate QEMU errors during incoming migrations When QEMU fails during incoming migration after we successfully started it (i.e., during Perform or Finish phase), we report a rather unhelpful message Unable to read from monitor: Connection reset by peer We already have a code that takes error messages from QEMU's error output but we disable it once QEMU successfully starts. This patch postpones this until the end of Finish phase during incoming migration so that we can report a much better error message: internal error: early end of file from monitor: possible problem: Unknown savevm section or instance '0000:00:05.0/virtio-balloon' 0 load of migration failed https://bugzilla.redhat.com/show_bug.cgi?id=1090093 Signed-off-by: Jiri Denemark <jdenemar> Hi jiri I just try to reproduce this bug with the comment1's steps, since there weren't details steps, so i do some attempts to reproduce this bug, can you help check that whether they are enough to reproduce this bug, if not, can you give me some advise? thanks a lot. reproduce pkg info libvirt-1.1.1-29.el7.x86_64 qemu-kvm-rhev-1.5.3-60.el7_0.11.x86_64 steps 1.Prepare a running guest # virsh list Id Name State ---------------------------------------------------- 5 rhel7 running 2.Generate the guest's xml #virsh dumpxml rhel7 >rhel7.xml 3.Do storage migration from source to the target host #virsh migrate --live --copy-storage-all rhel7 qemu+ssh://$target_ip/system --verbose --xml rhel7.xml 4.if we kill the qemu proccess at the beginning of the migration on the target host, found it report the following error , also the guest will get crashed on the source host [Note: if didn't hit the issue, just repeat the step 3~4] # time virsh migrate --live --copy-storage-all rhel7 qemu+ssh://$target_ip/system --verbose --xml rhel7.xml root@$target_ip's password: error: Unable to read from monitor: Connection reset by peer # virsh list Id Name State ---------------------------------------------------- 5.check the qemu log on the source host nbd.c:nbd_receive_reply():L746: read failed block job error in device 'drive-virtio-disk0': Input/output error (5) block job error in device 'drive-virtio-disk0': Input/output error (5) block job error in device 'drive-virtio-disk0': Input/output error (5) nbd.c:nbd_receive_reply():L746: read failed nbd.c:nbd_receive_reply():L746: read failed block job error in device 'drive-virtio-disk0': Input/output error (5) block job error in device 'drive-virtio-disk0': Input/output error (5) block job error in device 'drive-virtio-disk0': Input/output error (5) block job error in device 'drive-virtio-disk0': Input/output error (5) block job error in device 'drive-virtio-disk0': Input/output error (5) nbd.c:nbd_receive_reply():L746: read failed block job error in device 'drive-virtio-disk0': Input/output error (5) nbd.c:nbd_receive_reply():L746: read failed nbd.c:nbd_receive_reply():L746: read failed block job error in device 'drive-virtio-disk0': Input/output error (5) block job error in device 'drive-virtio-disk0': Input/output error (5) block job error in device 'drive-virtio-disk0': Input/output error (5) block job error in device 'drive-virtio-disk0': Input/output error (5) block job error in device 'drive-virtio-disk0': Input/output error (5) block job error in device 'drive-virtio-disk0': Input/output error (5) block job error in device 'drive-virtio-disk0': Input/output error (5) block job error in device 'drive-virtio-disk0': Input/output error (5) block job error in device 'drive-virtio-disk0': Input/output error (5) block job error in device 'drive-virtio-disk0': Input/output error (5) nbd.c:nbd_receive_reply():L746: read failed block job error in device 'drive-virtio-disk0': Operation not permitted (1) 2014-12-15 06:35:46.480+0000: shutting down 2014-12-15 06:36:22.411+0000: starting up 6.if we kill the qemu process at the end of the migration on the target host, Found it report the following error on the source host # virsh migrate --live --copy-storage-all rhel7 qemu+ssh://$target_ip/system --verbose --xml rhel7s.xml root@$target_ip's password: Migration: [ 81 %]error: operation failed: migration job: unexpectedly failed # virsh migrate --live --copy-storage-all rhel7 qemu+ssh://$target_ip/system --verbose --xml rhel7s.xml root@target_ip's password: Migration: [ 83 %]error: operation failed: migration job: unexpectedly failed Update the libvirt and qemu-kvm-rhev packet to the latest, didn't get that error info anymore , also the guest didn't crash pkginfo libvirt-1.2.8-10.el7.x86_64 qemu-kvm-rhev-2.1.2-16.el7.x86_64 steps 1. Re-excute the reproduce steps with the new libvirt and qemu-kvm-rhev packet, get the following error, not sure whether they are expect or not. jiri, can you help confirm it? thanks 2.if we kill the qemu process at the beginning of the migration on the target host, found it always report the following error on the source host, however, it didn't get guest crashed anymore # virsh migrate --live --copy-storage-all rhel7 qemu+ssh://$target_ip/system --verbose --xml rhel7s.xml root@$target_ip's password: error: unable to connect to server at 'rhel7f:49152': Connection refused # virsh migrate --live --copy-storage-all rhel7 qemu+ssh://$target_ip/system --verbose --xml rhel7s.xml root@$target_ip's password: error: unable to connect to server at 'rhel7f:49152': Connection refused # virsh list Id Name State ---------------------------------------------------- 2 rhel7 running 3.if we kill the qemu proccess at the end of the migration on the target host, found it always report the following error, was this error the expect one ? # virsh migrate --live --copy-storage-all rhel7 qemu+ssh://$target_ip/system --verbose --xml rhel7s.xml root@$target_ip's password: Migration: [ 79 %]error: operation failed: migration job: unexpectedly failed # virsh list Id Name State ---------------------------------------------------- 2 rhel7 running # virsh migrate --live --copy-storage-all rhel7 qemu+ssh://$target_ip/system --verbose --xml rhel7s.xml root@$target_ip's password: Migration: [ 82 %]error: operation failed: migration job: unexpectedly failed # virsh list Id Name State ---------------------------------------------------- 2 rhel7 running Killing QEMU does not cause error messages to be written to QEMU's stderr so there's nothing libvirt could consume and provide back to a user. It's better if QEMU fails by itself for some reason. It may be a bit hard to cause such failure without hacking around so I made a wrapper script for qemu-kvm. On a destination host, I moved /usr/libexec/qemu-kvm to /usr/libexec/qemu-kvm.orig and created a shell script /usr/libexec/qemu-kvm which changes something in the command line and executes /usr/libexec/qemu-kvm.orig with the amended command line arguments. hi Jiri Thanks for your advise, I try to research it today, however still couldn't reproduce the issue. In comment 8 i saw you said "create a shell script /usr/libexec/qemu-kvm which changes something in the command line and executes /usr/libexec/qemu-kvm.orig with the amended command line arguments". Can you show me what i need change in the command line or give me a reproducer? thanks. The following steps were my reproduce steps source 1.Prepare a normal guest in the source host, the source host was 7.0 env #virsh dumpxml rhel7 -- <video> <model type='vga' vram='65536' heads='1'/> <alias name='video0'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/> </video> <memballoon model='none'> <alias name='balloon0'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/> </memballoon> 2.run the follwoing script on the target, the target host was 7.1 env #mv /usr/libexec/qemu-kvm /usr/libexec/qemu-kvm.orig write shell script for qemu-kvm #cat /usr/libexec/qemu-kvm #!/bin/bash exec /usr/libexec/qemu-kvm.orig -M pc -cpu SandyBridge -enable-kvm -m 4096 -smp 4,sockets=1,cores=4,threads=1 -name rhel7 -rtc base=localtime,clock=host,driftfix=slew -global PIIX4_PM.disable_s3=0 -global PIIX4_PM.disable_s4=0 -vga std -monitor stdio -boot menu=on -drive file=/var/lib/libvirt/images/rhel7.0.qcow2,if=none,id=drive-virtio-disk0,format=qcow2,cache=none -device virtio-blk-pci,scsi=off,drive=drive-virtio-disk0,id=virtio-disk0 -device virtio-balloon-pci -incoming tcp:0:49152 #chmod 777 /usr/libexec/qemu-kvm # /usr/libexec/qemu-kvm 3.migrate the guest to the target, check the error info virsh migrate --live rhel7 qemu+ssh://$target_ip/system --verbose --persistent --copy-storage-all --xml rhel7mig.xml It's hard to say what you changed because you didn't provide the original command line but in general, it's much better to change the arguments dynamically because libvirt also starts QEMU when probing its capabilities. So something like the following could work: #!/bin/bash args=() i=0 while [[ $# -gt 0 ]]; do if [[ "$1" == 4096 ]]; then args[i]=1024 elif [[ "$1" == virtio-balloon-pci ]]; then i=$[i-2] else args[i]="$1" fi i=$[i+1] shift done echo "${args[@]}" The script should change the amount of memory to 1024 MB and it should completely remove -device virtio-balloon-pci arguments. Hi jiri thanks for your patient reply, I get a lot from your comment and I have new cognition to this bug. Before the guest actualy migrate data, it will first copy the guest's migrable xml from the source to the target libvirtd, then the target libvirtd will start qemu process base the migrable guest's xml. so if i want to reproduce this issue, i have to change the tartget qemu commandline with your shell script before it actually start migration, right? if it is, I think what i've done in comment9 was incorrect, since the libvirt will also starts qemu when migrate the guest with libvirt command, so we can't specify the qemu comdline in the script directly, we should get the qemu cmdline which start by libvirtd, then change it with your shell scripts, however, I'm not clear how does it work while the libvirt start the qemu process in the target host, and how to change the qemu process with your shell scripts detailedly, can you show me that? thanks Oops, sorry, the last line echo "${args[@]}" should be replaced with exec /usr/libexec/qemu-kvm.orig "${args[@]}" Then you can just use the script as /usr/libexec/qemu-kvm Hi jiri Thanks for your confirmation, currently, I could get an unexpected error while migrate the guest with your script, however the error was different with the comment0. On the other hand, the unexepcted error info still exsiting even update the libvirt version to the latest one libvirt-1.2.8-13.el7. can you help check it? thanks Reproduce this bug with libvirt-1.2.8-1 1.Prepare the migration env 2.Prepare a running guest with 4G mem and virtio-ballon-pci enabled in the guest's xml #virsh dumpxml rhel7 -- <memory unit='KiB'>4194304</memory> <currentMemory unit='KiB'>4194304</currentMemory> -- <memballoon model='virtio'> <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/> </memballoon> #virsh dumpxml rhel7f >rhel7.xml 3.Prepare the script in the target host #mv /usr/libexec/qemu-kvm /usr/libexec/qemu-kvm.orig create a shell script named qemu-kvm in /usr/libexec directory # cat /usr/libexec/qemu-kvm #!/bin/bash args=() i=0 while [[ $# -gt 0 ]]; do if [[ "$1" == 4096 ]]; then args[i]=1024 elif [[ "$1" == virtio-balloon-pci ]]; then i=$[i-2] else args[i]="$1" fi i=$[i+1] shift done exec /usr/libexec/qemu-kvm.orig "${args[@]}" # ll -Z /usr/libexec/qemu-kvm -rwxr-xr-x. root root system_u:object_r:qemu_exec_t:s0 /usr/libexec/qemu-kvm 4.Migrate the guest from the source to the target, will get an unexpected error here # virsh migrate --live rhel7 --copy-storage-all --persistent --xml rhel7.xml qemu+ssh://$target_ip/system --verbose root@$target_ip's password: error: operation failed: migration job: unexpectedly failed 5.check the qemu log in the target host, Found some detailed error info #cat /var/log/libvirt/qemu/rhel7.log /usr/libexec/qemu-kvm -name rhel7 -S -machine pc-i440fx-rhel7.1.0,accel=kvm,usb=off -cpu Westmere -m 4096 -- -incoming tcp:[::]:49152 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x7 -msg timestamp=on char device redirected to /dev/pts/2 (label charserial0) 2015-01-19T08:55:06.489659Z qemu-kvm.orig: Length mismatch: pc.ram: 100000000 in != 40000000 2015-01-19T08:55:06.489895Z qemu-kvm.orig: Ack, bad migration stream! 2015-01-19T08:55:06.489911Z qemu-kvm.orig: Illegal RAM offset 87667612e767000 qemu: warning: error while loading state for instance 0x0 of device 'ram' 2015-01-19T08:55:06.489951Z qemu-kvm.orig: load of migration failed: Invalid argument 2015-01-19 08:55:06.522+0000: shutting down Verify this bug with libvirt-1.2.8-13, still couldn't get the clear error in in source host 1.Excute step 1~3 in the reproduce steps 2.migrate the guest to the target , still get the unexpected error # virsh migrate --live rhel7 --copy-storage-all --persistent --xml rhel7.xml qemu+ssh://10.66.106.40/system --verbose root.106.40's password: error: operation failed: migration job: unexpectedly failed 3.check the qemu log in the target host, Found some detailed error info #cat /var/log/libvirt/qemu/rhel7.log #cat /var/log/libvirt/qemu/rhel7.log /usr/libexec/qemu-kvm -name rhel7 -S -machine pc-i440fx-rhel7.1.0,accel=kvm,usb=off -cpu Westmere -m 4096 -- -incoming tcp:[::]:49152 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x7 -msg timestamp=on char device redirected to /dev/pts/2 (label charserial0) 2015-01-19T09:02:39.827950Z qemu-kvm.orig: -netdev tap,fd=23,id=hostnet0: TUNGETIFF ioctl() failed: Inappropriate ioctl for device TUNSETOFFLOAD ioctl() failed: Inappropriate ioctl for device 2015-01-19T09:02:40.086578Z qemu-kvm.orig: Length mismatch: pc.ram: 100000000 in != 40000000 2015-01-19T09:02:40.086655Z qemu-kvm.orig: Ack, bad migration stream! 2015-01-19T09:02:40.086690Z qemu-kvm.orig: Illegal RAM offset 87667612e767000 qemu: warning: error while loading state for instance 0x0 of device 'ram' 2015-01-19T09:02:40.086736Z qemu-kvm.orig: load of migration failed: Invalid argument 2015-01-19 09:02:40.095+0000: shutting down Hmm, so apparently I fixed one of the paths to provide a reasonable error message but there are more paths the source can see an error during migration and these are not covered. For example, the source reports error: operation failed: migration job: unexpectedly failed while the destination knows more: 2015-01-23 09:39:29.214+0000: 8871: error : qemuMonitorIO:662 : internal error: End of file from monitor 2015-01-23 09:39:29.214+0000: 8871: error : qemuMonitorIO:697 : internal error: early end of file from monitor: possible problem: 2015-01-23T09:39:29.201284Z qemu-system-x86_64.orig: Length mismatch: pc.ram: 20000000 in != 10000000 2015-01-23T09:39:29.201603Z qemu-system-x86_64.orig: Ack, bad migration stream! 2015-01-23T09:39:29.201629Z qemu-system-x86_64.orig: Illegal RAM offset 87667612e767000 qemu: warning: error while loading state for instance 0x0 of device 'ram' 2015-01-23T09:39:29.201754Z qemu-system-x86_64.orig: load of migration failed: Invalid argument Or the source reports error: Requested operation is not valid: domain 'qemu' is not processing incoming migration while the destination knows more again 2015-01-23 09:31:10.827+0000: 8871: error : qemuMonitorIO:662 : internal error: End of file from monitor 2015-01-23 09:31:10.827+0000: 8871: error : qemuMonitorIO:697 : internal error: early end of file from monitor: possible problem: Unknown savevm section or instance '0000:00:05.0/virtio-balloon' 0 2015-01-23T09:31:10.818913Z qemu-system-x86_64.orig: load of migration failed: Invalid argument *** Bug 1176771 has been marked as a duplicate of this bug. *** Fixed upstream as of v1.2.17-87-g2e7cea2 by: commit 04d5fb2e0a92f78eca61de59971eba7fb36c6c1d Author: Jiri Denemark <jdenemar> Date: Fri Jul 3 19:35:06 2015 +0200 qemu: Drop LFs at the end of error from QEMU log Libvirt's error messages do not end with a LF. However, when reading the error from QEMU log, we would read the LF from the log and keep it in the message. Signed-off-by: Jiri Denemark <jdenemar> commit a7c22a1f2949e781aef4aa82afda407efd344f95 Author: Jiri Denemark <jdenemar> Date: Thu Jul 2 14:21:27 2015 +0200 Introduce virHashAtomic This is a self-locking wrapper around virHashTable. Only a limited set of APIs are implemented now (the ones which are used in the following patch) as more can be added on demand. Signed-off-by: Jiri Denemark <jdenemar> commit 9d0a2af6c230ce3007b3504907ccdba09cdc4cf2 Author: Jiri Denemark <jdenemar> Date: Fri Jul 10 08:44:41 2015 +0200 Introduce virErrorCopyNew A helper function for copying error objects. Signed-off-by: Jiri Denemark <jdenemar> commit e68f395fcbae0267368f9974cc49f582cc83c752 Author: Jiri Denemark <jdenemar> Date: Thu Jul 2 08:26:48 2015 +0200 qemu: Remember incoming migration errors If QEMU fails during incoming migration, the domain disappears including a possibly useful error message read from QEMU log file. Let's remember the error in virQEMUDriver so that Finish can report more than just "no such domain". Signed-off-by: Jiri Denemark <jdenemar> commit 44c42b564dc61ddc8fdfc5b6ca4ca969bcffc7d8 Author: Jiri Denemark <jdenemar> Date: Thu Jul 2 22:32:54 2015 +0200 qemu: Don't report false error from MigrateFinish virDomainMigrateFinish* APIs were unfortunately designed to return the pointer to the domain on destination and NULL on error. This looks OK in normal cases but the same API is also called when we know migration failed and thus we expect Finish to return NULL even if it actually did all it was supposed to do without any error. The call is defined to return nonnull domain pointer over RPC, which means returning NULL will always result in an error being send. If this was not in fact an error, the API itself wouldn't set anything to the thread local virError, which makes the RPC layer come up with it's own "Library function returned error but did not set virError" error. This is quite confusing and also hard to detect by the caller. This patch adds a special error code which can be used to check that Finish successfully aborted migration. Signed-off-by: Jiri Denemark <jdenemar> commit 2e7cea24355328102c40dd127329ddf47d55a3e2 Author: Jiri Denemark <jdenemar> Date: Thu Jul 2 21:46:56 2015 +0200 qemu: Use error from Finish instead of "unexpectedly failed" When QEMU exits on destination during migration, the source reports either success (if the failure happened at the very end) or unhelpful "unexpectedly failed" error message. However, the Finish API called on the destination may report a real error so let's use it instead of the generic one. Signed-off-by: Jiri Denemark <jdenemar> Reproduce this bug on build libvirt-1.2.17-1.el7.x86_64 Steps are same as comment #13: # virsh migrate --live mig1 qemu+ssh://10.66.4.208/system --verbose root.4.208's password: error: operation failed: migration job: unexpectedly failed Verify pass on build libvirt-1.2.17-8.el7.x86_64 # virsh migrate --live mig1 qemu+ssh://10.66.4.208/system --verbose root.4.208's password: error: internal error: early end of file from monitor: possible problem: 2015-09-09T03:21:08.220185Z qemu-kvm.orig: Length mismatch: pc.ram: 0x100000000 in != 0x40000000: Invalid argument 2015-09-09T03:21:08.220247Z qemu-kvm.orig: error while loading state for instance 0x0 of device 'ram' 2015-09-09T03:21:08.220323Z qemu-kvm.orig: load of migration failed: Invalid argument Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-2202.html |