Bug 1083238
Summary: | Migration of a VM fails if another API allowed during migration takes a long time to finish | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Anitha Udgiri <audgiri> | |
Component: | libvirt | Assignee: | Jiri Denemark <jdenemar> | |
Status: | CLOSED ERRATA | QA Contact: | Virtualization Bugs <virt-bugs> | |
Severity: | urgent | Docs Contact: | ||
Priority: | urgent | |||
Version: | 6.0 | CC: | audgiri, bazulay, dyuan, ederevea, iheim, jdenemar, jentrena, lpeer, mzhan, ofrenkel, pdwyer, rbalakri, stefanha, tdosek, ukar, vfeenstr, ydu, yeylon, zhwang, zpeng | |
Target Milestone: | rc | Keywords: | Upstream, ZStream | |
Target Release: | 6.6 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | virt | |||
Fixed In Version: | libvirt-0.10.2-37.el6 | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1122908 (view as bug list) | Environment: | ||
Last Closed: | 2014-10-14 04:20:53 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1100671, 1122908 |
Description
Anitha Udgiri
2014-04-01 18:18:48 UTC
Anitha, could you also attach libvirtd log from the target host covering the same time frame as the log from source? And also /var/log/libvirt/qemu/DOMAIN_NAME.log from both the source and the target hosts? The "Thread-6067683::ERROR::2014-03-14 15:04:02,084::vm::285::vm.Vm::(run) vmId=`794d339d-f7b5-40f3-8b4f-e1fd29fd2832`::Failed to migrate" error seems to be a result of an overloaded host. While the domain was in the process of migrating to another host, vdsm called virDomainBlockStats: 2014-03-14 15:02:55.123+0000: 15425: debug : virDomainBlockStats:7200 : dom=0x7f49900a1e40, (VM: name=XXXX, uuid=794d339d-f7b5-40f3-8b4f-e1fd29fd2832), disk=hdc, stats=0x7f499cd8ead0, size=40 2014-03-14 15:03:14.634+0000: 15425: debug : qemuDomainObjBeginJobInternal:808 : Starting job: query (async=migration out) 2014-03-14 15:03:14.634+0000: 15425: debug : qemuMonitorGetBlockStatsInfo:1468 : mon=0x7f498415f420 dev=ide0-1-0 2014-03-14 15:03:14.634+0000: 15425: debug : qemuMonitorSend:903 : QEMU_MONITOR_SEND_MSG: mon=0x7f498415f420 msg={"execute":"query-blockstats","id":"libvirt-1093563"} fd=-1 While one thread was waiting for the result of query-blockstats QMP command, the migration thread wanted to check the state of migration. For that it needs to get an exclusive access to the monitor which was still in use by thread 15425 and thus waiting for the job timed out: 2014-03-14 15:03:44.000+0000: 15429: warning : qemuDomainObjBeginJobInternal:839 : Cannot start job (async nested, none) for domain XXXX; current job is (query, migration out) owned by (15425, 15429) 2014-03-14 15:03:44.000+0000: 15429: error : qemuDomainObjBeginJobInternal:843 : Timed out during operation: cannot acquire state change lock Libvirt considered this a fatal error (such timeouts are signs of non-reponsive or hung qemu-kvm process) and cancelled the migration: 2014-03-14 15:03:44.000+0000: 15429: debug : doPeer2PeerMigrate3:2544 : Finish3 0x7f4984160710 ret=-1 2014-03-14 15:03:55.220+0000: 15429: error : virNetClientProgramDispatchError:174 : An error occurred, but the cause is unknown 2014-03-14 15:03:55.220+0000: 15429: debug : doPeer2PeerMigrate3:2577 : Confirm3 0x7f498c00dd00 cancelled=1 vm=0x7f498416a300 2014-03-14 15:03:55.220+0000: 15429: debug : qemuMigrationConfirm:3227 : driver=0x7f49900f5fa0, conn=0x7f498c00dd00, vm=0x7f498416a300, cookiein=(null), cookieinlen=0, flags=3, retcode=1 2014-03-14 15:03:55.220+0000: 15429: debug : qemuDomainObjSetJobPhase:689 : Setting 'migration out' phase to 'confirm3_cancelled' 2014-03-14 15:03:55.220+0000: 15429: debug : qemuMigrationEatCookie:755 : cookielen=0 cookie='(null)' 2014-03-14 15:03:55.220+0000: 15429: debug : qemuProcessStartCPUs:2825 : Using lock state '(null)' 2014-03-14 15:04:01.948+0000: 15429: debug : qemuDomainObjBeginJobInternal:808 : Starting job: async nested (async=migration out) 2014-03-14 15:04:01.948+0000: 15429: debug : qemuMonitorStartCPUs:1187 : mon=0x7f498415f420 2014-03-14 15:04:01.948+0000: 15429: debug : qemuMonitorSend:903 : QEMU_MONITOR_SEND_MSG: mon=0x7f498415f420 msg={"execute":"cont","id":"libvirt-1093564"} Note the "An error occurred, but the cause is unknown" returned by the Finish method called on the target host. We need the logs from there to see what happened but I guess the domain actually disappeared by itself on the target host before we attempted to kill it explicitly. The original query-blockstats request finished just before the end of cancelled migration: 2014-03-14 15:04:01.946+0000: 15413: debug : qemuMonitorIOProcess:353 : QEMU_MONITOR_IO_PROCESS: mon=0x7f498415f420 buf={"return": [{"device": "drive-ide0-1-0", "stats": {"flush_total_time_ns": 0, "wr_highest_offset": 0, "wr_total_time_ns": 0, "wr_bytes": 0, "rd_total_time_ns": 0, "flush_operations": 0, "wr_operations": 0, "rd_bytes": 0, "rd_operations": 0}}, {"device": "drive-virtio-disk0", "parent": {"stats": {"flush_total_time_ns": 0, "wr_highest_offset": 46814906368, "wr_total_time_ns": 0, "wr_bytes": 0, "rd_total_time_ns": 0, "flush_operations": 0, "wr_operations": 0, "rd_bytes": 0, "rd_operations": 0}}, "stats": {"flush_total_time_ns": 0, "wr_highest_offset": 46814906368, "wr_total_time_ns": 88839345819632, "wr_bytes": 361862714368, "rd_total_time_ns": 13164079237085, "flush_operations": 0, "wr_operations": 33862105, "rd_bytes": 249739890688, "rd_operations": 9968284}}], "id": "libvirt-1093563"} len=796 2014-03-14 15:04:01.947+0000: 15425: debug : qemuDomainObjEndJob:937 : Stopping job: query (async=migration out) 2014-03-14 15:04:01.948+0000: 15425: debug : virDomainFree:2281 : dom=0x7f49900a1e40, (VM: name=XXXX, uuid=794d339d-f7b5-40f3-8b4f-e1fd29fd2832) (In reply to Jiri Denemark from comment #10) > Anitha, could you also attach libvirtd log from the target host covering the > same time frame as the log from source? And also > /var/log/libvirt/qemu/DOMAIN_NAME.log from both the source and the target > hosts? Jiri, Attached the same. Thanks, so the "An error occurred, but the cause is unknown" error message is a harmless bug in qemuMigrationFinish. If the sources aborts the migration for whatever reason and calls virDomainMigrateFinish3 with non-zero cancelled argument, the API returns NULL (normally used to indicate an error) without setting any error (logically, because it correctly killed QEMU and cleaned up after cancelled incoming migration). So there's apparently a bug in our migration code that causes migration to be aborted if qemu is slow to response. The "Domain not found: no domain with matching uuid" error only show up on the destination and is a result of something (likely vdsm) calling virDomainLookupByUUID on the domain just after it was killed because the source cancelled the migration. (In reply to Jiri Denemark from comment #17) > So there's apparently a bug in our migration code that causes migration to > be aborted if qemu is slow to response. > > The "Domain not found: no domain with matching uuid" error only show up on > the destination and is a result of something (likely vdsm) calling > virDomainLookupByUUID on the domain just after it was killed because the > source cancelled the migration. Thanks Jiri. What are the possible circumatances where Qemu response is slow. I ask because in one of the cases, one of the VMs fails to migrate possibly due to heavy load. But the same VM when migrated separately under a perceivably lighter load continues to fail with the same error in one case but succeeds in another case. Hard to say what causes QEMU to response so slowly. We would need a QEMU expert to help answer this question. Stefan, do you have any idea why would query-blockstats be so slow (more than 30 seconds) to respond? In any case, I'm afraid we won't be able to fix this bug without replacing our query-migrate polling with a migration-completed event, which was apparently forgotten (see bug 580006). Actually, until the event is supported by qemu the migration thread could possibly ignore lock timeouts when it asks for migration status. I'll think about it more and see if it's doable or not. (In reply to Jiri Denemark from comment #16) > Thanks, so the "An error occurred, but the cause is unknown" error message > is a harmless bug in qemuMigrationFinish. If the sources aborts the > migration for whatever reason and calls virDomainMigrateFinish3 with > non-zero cancelled argument, the API returns NULL (normally used to indicate > an error) without setting any error (logically, because it correctly killed > QEMU and cleaned up after cancelled incoming migration). IMO this should be changed to reflect migration cancelled. It's not correct that the cause is unknown. (In reply to Jiri Denemark from comment #19) > Hard to say what causes QEMU to response so slowly. We would need a QEMU > expert to help answer this question. Stefan, do you have any idea why would > query-blockstats be so slow (more than 30 seconds) to respond? Jiri, sorry for the late response. I looked at qemu-kvm-0.12.1.2-2.415.el6_5.7 query-blockstats: The query-blockstats command itself is light-weight and does not perform blocking operations. I suspect that something else (migration) is monopolizing the QEMU main loop. If you'd like to find out the root cause we'd need to look at the stack trace of the QEMU process (e.g. gstack $(pgrep qemu)). (In reply to Jiri Denemark from comment #20) > Actually, until the event is supported by qemu the migration thread could > possibly ignore lock timeouts when it asks for migration status. I'll think > about it more and see if it's doable or not. This solution should work and seems reasonable until we have the event implemented in qemu. I'll propose a patch upstream. Pathes implementing the change suggested in comment #20 were sent upstream for review: https://www.redhat.com/archives/libvir-list/2014-May/msg00403.html This is fixed upstream by v1.2.4-60-g1643b15..v1.2.4-63-gcc4882f: commit 1643b158f6de87a133a5d8a80f69786055f709b1 Author: Jiri Denemark <jdenemar> Date: Tue May 13 14:34:55 2014 +0200 qemu: Make qemuDomainObjBeginNestedJob static It's only used within qemu_domain.c. Signed-off-by: Jiri Denemark <jdenemar> commit dca377dbd63a514963a06f4f15aa61fb8cb7f19b Author: Jiri Denemark <jdenemar> Date: Tue May 13 14:39:35 2014 +0200 qemuDomainObjBeginNestedJob: Return -2 for temporary failures If job queue is full or waiting for a job times out, the function returns -2 so that it can be handled in a different way by callers. The change is safe since all existing callers of qemuDomainObjBeginNestedJob check the return value to be less than zero. Signed-off-by: Jiri Denemark <jdenemar> commit b5e0532e0446c6255ecedc2d459f4d47172b9225 Author: Jiri Denemark <jdenemar> Date: Wed May 14 13:22:34 2014 +0200 qemu: Make qemuProcess{Start,Stop}CPUs easier to follow As a side effect, the return value of qemuDomainObjEnterMonitorAsync is not directly used as the return value of qemuProcess{Start,Stop}CPUs. Signed-off-by: Jiri Denemark <jdenemar> commit cc4882f4ab470334b37e6dd8e65cba60c16e868a Author: Jiri Denemark <jdenemar> Date: Tue May 13 14:45:31 2014 +0200 qemu: Ignore temporary job errors when checking migration status When qemu driver is polling for migration to finish (in qemuMigrationWaitForCompletion), it may happen that another job allowed during migration is running and if it does not finish within 30 seconds, migration would be cancelled because of that. However, we can just ignore the timeout and let the waiting loop try again later. If an event fired at the end of migration is ever implemented in QEMU, we can just wait for the event instead of polling for migration status and libvirt will behave consistently, i.e., migration won't be cancelled in case another job started during migration takes long time to finish. For bug https://bugzilla.redhat.com/show_bug.cgi?id=1083238 Signed-off-by: Jiri Denemark <jdenemar> 1.In the first ternial, set the breakpoint for the libvirtd 1.1.# ps aux|grep libvirtd root 15352 0.5 0.1 831628 15368 ? Sl 16:16 0:00 libvirtd --daemon root 15486 0.0 0.0 103252 836 pts/2 S+ 16:17 0:00 grep libvirtd 1.2.# gdb libvirtd 15352 (gdb) br qemuMonitorGetBlockStatsInfo Breakpoint 1 at 0x477b90: file qemu/qemu_monitor.c, line 1483. (gdb) i b Num Type Disp Enb Address What 1 breakpoint keep y 0x0000000000477b90 in qemuMonitorGetBlockStatsInfo at qemu/qemu_monitor.c:1483 (gdb) r The program being debugged has been started already. Start it from the beginning? (y or n) y 2.Open the second terminal, d the migration operation 2.1 Prepare the migration env, start the guest #virsh start rhel6 Domain rhel6 started 2.2.Set the migrate-speed to low value # virsh migrate-setspeed rhel7k 1 # virsh migrate-getspeed rhel7k 1 2.3.Start the migration #virsh migrate --live rhel7k qemu+ssh://$target_ip/system --verbose 3.Excute the following command in third terminal before the migration finished, we can see the gdb stops libvirtd on the breakpoint on the first terminal # virsh domblkstat rhel7k hda 4.Open the fourth terminal,Stop the qemu process with the following command #ps axu|grep $guest_name #kill -STOP $qemu_proccess_id 5.Let libvirtd continue run in the first terminal (gdb)c Continuing. 6.Wait for about 30 seconds, we can find the migration fails in the second terminal # virsh migrate --live rhel7k qemu+ssh://$target_ip/system --verbose root@$target_ip's password: Migration: [ 67 %]error: Timed out during operation: cannot acquire state change lock Verify this bug with libvirt-0.10.2-37.el6. the following was the verify steps steps 1.Excute step 1~5 in the upper reproduce steps 2.Wait for more than 30 seconds, the migration didn't fail in the second terminal, it just waited the another job finished 3.Run the following command in the fourth terminal ,the guest will continue to migrate and can successfully migrate to the target #kill -CONT $qemu_process_id According to the upper description, mark this bug verified (In reply to zhenfeng wang from comment #31) > 1.In the first ternial, set the breakpoint for the libvirtd > > 1.1.# ps aux|grep libvirtd > root 15352 0.5 0.1 831628 15368 ? Sl 16:16 0:00 libvirtd > --daemon > root 15486 0.0 0.0 103252 836 pts/2 S+ 16:17 0:00 grep > libvirtd > > 1.2.# gdb libvirtd 15352 > > (gdb) br qemuMonitorGetBlockStatsInfo > Breakpoint 1 at 0x477b90: file qemu/qemu_monitor.c, line 1483. > (gdb) i b > Num Type Disp Enb Address What > 1 breakpoint keep y 0x0000000000477b90 in > qemuMonitorGetBlockStatsInfo at qemu/qemu_monitor.c:1483 > (gdb) r > The program being debugged has been started already. > Start it from the beginning? (y or n) y > Sorry to forget to give the reproduce packet in the comment31, I reproduce this bug with the libvirt-0.10.2-35.el6.x86_64.rpm Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2014-1374.html |