Hide Forgot
Created attachment 1145007 [details] qemu log Description of problem: Stress the guest, then do migration, migration failed with error "operation failed: migration job: unexpectedly failed" Version-Release number of selected component (if applicable): libvirt-1.3.2-1.el7.x86_64 qemu-kvm-rhev-2.5.0-3.el7.x86_64 How reproducible: 100% Steps to Reproduce: 1.Stress the guest using stress: #stress --vm 10 --vm-bytes 128M 2.Do migration #virsh migrate rhel7.2 qemu+ssh://10.66.4.148/system --verbose --live root.196.125's password: Migration: [ 86 %]error: operation failed: migration job: unexpectedly failed 3.Check the qemu log: #cat /var/log/libvirt/qemu/rhel7.2.log ... 2016-04-08T05:37:44.302478Z qemu-kvm: socket_writev_buffer: Got err=104 for (131328/18446744073709551615) main_channel_migrate_src_complete: main_channel_migrate_src_complete: client 0x7f6825107480 MIGRATE_CANCEL ... Actual results: After about 2 minutes, migration failed with error "migration job: unexpectedly failed " Expected results: Migration should complete successfully. Additional info: 1.Migration also failed with same error when play HD video in guest; 2.When the source host with high performance cpu and memory, the migration will wait a long long time, not finish or fail with error.
Hi, Can you repeat this test please and include all of the following: a) The qemu log from the destination b) The qemu log from the source c) The libvirt log from the destination d) The libvirt log from the source also, can you explain what hosts you were using for this test and what different hosts you were using for the test where you say 'When the source host with high performance cpu and memory,' Dave
Created attachment 1146376 [details] libvirtd.log and qemu log both on source and target host
So the source qemu logs: 2016-04-12T10:17:41.498594Z qemu-kvm: socket_writev_buffer: Got err=104 for (131328/18446744073709551615) that means it's connection was broken - but why? So looking at the destination we see just before that: 2016-04-12T10:17:41.475112Z qemu-kvm: terminating on signal 15 from pid 31366 so something killed the destination qemu; the destination libvirt log shows: 2016-04-12 10:17:41.474+0000: 31366: debug : qemuProcessAutoDestroy:6322 : vm=rhel7.1 has long-term job active, cancelling 2016-04-12 10:17:41.474+0000: 31366: info : virObjectRef:296 : OBJECT_REF: obj=0x7fb10c1b2590 2016-04-12 10:17:41.474+0000: 31366: debug : virFileMakePathHelper:2755 : path=/var/run/libvirt/qemu mode=0777 2016-04-12 10:17:41.475+0000: 31366: debug : virFileClose:102 : Closed fd 28 2016-04-12 10:17:41.475+0000: 31366: info : virObjectUnref:259 : OBJECT_UNREF: obj=0x7fb10c1b2590 2016-04-12 10:17:41.475+0000: 31366: debug : qemuProcessAutoDestroy:6326 : Killing domain 2016-04-12 10:17:41.475+0000: 31366: debug : qemuProcessKill:5671 : vm=0x7fb150004e10 name=rhel7.1 pid=31511 flags=1 2016-04-12 10:17:41.475+0000: 31366: debug : virProcessKillPainfully:352 : vpid=31511 force=1 so libvirt killed the destination qemu; but why?
Chatting to jdenemar, it looks like libvirt is killing qemu because it's not heard from the source 'virsh' for 30 seconds, but we don't understand why; flipping to libvirt. Yan, please rerun the test on the same machines but please: a) Open some more terminal tabs logged into both the source and destination machines before you start b) In one terminal on each of the source and destination run: vmstat 5 (You should see one line of output every 5 seconds) c) On the source machine run in one terminal: ping the-destination-host-name (You should see a line of output about once a second) d) Issue the virsh command in a separate tab, do not background or doing anything to the virsh while it runs. e) When it fails, please collect the logs again but also include the output of both vmstat's and the ping. Thanks, Dave
Created attachment 1146696 [details] libvirtd.log, qemu log and vmstat results both on source and target host, ping results on source host
So the Prepare phase finishes at 03:12:55: 2016-04-13 03:12:55.121+0000: 17819: debug : virThreadJobClear:121 : Thread 17819 (virNetServerHandleJob) finished job remoteDispatchDomainMigratePrepare3Params with ret=0 and migration starts soon after this. Since then the connection between virsh and libvirtd on the destination host is quite bad (remember, it's the process to process connection, so both process and I/O scheduling may influence the connection significantly): 03:12:59 Sending keepalive request 03:13:03 Got keepalive response 03:13:08 Sending keepalive request 03:13:11 Got keepalive response 03:13:16 Sending keepalive request 03:13:21 Sending keepalive request 03:13:21 Got keepalive response 03:13:26 Sending keepalive request 03:13:31 Sending keepalive request 03:13:36 Sending keepalive request 03:13:38 Got keepalive response 03:13:38 Got keepalive response 03:13:43 Sending keepalive request 03:13:48 Sending keepalive request 03:13:53 Sending keepalive request 03:13:58 Sending keepalive request 03:14:03 Got keepalive response 03:14:03 Got keepalive response 03:14:03 Got keepalive response 03:14:03 Got keepalive response 03:14:03 Got keepalive response 03:14:03 Got keepalive response 03:14:08 Sending keepalive request 03:14:16 Sending keepalive request 03:14:20 Got keepalive response 03:14:20 Got keepalive response 03:14:25 Sending keepalive request 03:14:30 Sending keepalive request 03:14:35 Sending keepalive request 03:14:36 Got keepalive response 03:14:36 Got keepalive response 03:14:36 Got keepalive response 03:14:41 Sending keepalive request 03:14:44 Got keepalive response 03:14:49 Sending keepalive request 03:14:58 Sending keepalive request 03:15:03 Sending keepalive request 03:15:08 Sending keepalive request 03:15:09 Got keepalive response 03:15:09 Got keepalive response 03:15:09 Got keepalive response 03:15:09 Got keepalive response 03:15:14 Sending keepalive request 03:15:18 Got keepalive response 03:15:23 Sending keepalive request 03:15:28 Sending keepalive request 03:15:33 Sending keepalive request 03:15:38 Sending keepalive request 03:15:43 Got keepalive response 03:15:43 Got keepalive response 03:15:43 Got keepalive response 03:15:43 Got keepalive response 03:15:48 Sending keepalive request 03:15:53 Sending keepalive request 03:15:58 Sending keepalive request 03:15:59 Got keepalive response 03:15:59 Got keepalive response 03:15:59 Got keepalive response 03:16:04 Sending keepalive request 03:16:09 Sending keepalive request 03:16:14 Sending keepalive request 03:16:21 Sending keepalive request 03:16:24 Got keepalive response 03:16:24 Got keepalive response 03:16:24 Got keepalive response 03:16:24 Got keepalive response 03:16:29 Sending keepalive request 03:16:32 Got keepalive response 03:16:37 Sending keepalive request 03:16:46 Sending keepalive request 03:16:49 Got keepalive response 03:16:49 Got keepalive response 03:16:54 Sending keepalive request 03:16:58 Got keepalive response 03:17:03 Sending keepalive request 03:17:08 Sending keepalive request 03:17:13 Sending keepalive request 03:17:19 Sending keepalive request 03:17:22 Got keepalive response 03:17:22 Got keepalive response 03:17:22 Got keepalive response 03:17:22 Got keepalive response 03:17:27 Sending keepalive request 03:17:30 Got keepalive response 03:17:35 Sending keepalive request 03:17:39 Got keepalive response 03:17:44 Sending keepalive request 03:17:49 Sending keepalive request 03:17:54 Sending keepalive request 03:17:59 Sending keepalive request 03:18:04 Sending keepalive request => timeout, migration is cancelled I don't know why this is happening, but it's definitely not a libvirt bug. You can try to increase the default keepalive timeout.