| Summary: | Migration failed when stressing the guest | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | yafu <yafu> |
| Component: | libvirt | Assignee: | Jiri Denemark <jdenemar> |
| Status: | CLOSED NOTABUG | QA Contact: | Virtualization Bugs <virt-bugs> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 7.3 | CC: | amit.shah, chayang, dgilbert, dyuan, dzheng, fjin, huding, juzhang, knoel, mzhan, qizhu, quintela, rbalakri, virt-maint, xfu, yafu, zhanghm.zhm, zhanghongming, zpeng |
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2016-07-19 11:55:39 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Attachments: | |||
Hi,
Can you repeat this test please and include all of the following:
a) The qemu log from the destination
b) The qemu log from the source
c) The libvirt log from the destination
d) The libvirt log from the source
also, can you explain what hosts you were using for this test and what different hosts you were using for the test where you say 'When the source host with high performance cpu and memory,'
Dave
Created attachment 1146376 [details]
libvirtd.log and qemu log both on source and target host
So the source qemu logs: 2016-04-12T10:17:41.498594Z qemu-kvm: socket_writev_buffer: Got err=104 for (131328/18446744073709551615) that means it's connection was broken - but why? So looking at the destination we see just before that: 2016-04-12T10:17:41.475112Z qemu-kvm: terminating on signal 15 from pid 31366 so something killed the destination qemu; the destination libvirt log shows: 2016-04-12 10:17:41.474+0000: 31366: debug : qemuProcessAutoDestroy:6322 : vm=rhel7.1 has long-term job active, cancelling 2016-04-12 10:17:41.474+0000: 31366: info : virObjectRef:296 : OBJECT_REF: obj=0x7fb10c1b2590 2016-04-12 10:17:41.474+0000: 31366: debug : virFileMakePathHelper:2755 : path=/var/run/libvirt/qemu mode=0777 2016-04-12 10:17:41.475+0000: 31366: debug : virFileClose:102 : Closed fd 28 2016-04-12 10:17:41.475+0000: 31366: info : virObjectUnref:259 : OBJECT_UNREF: obj=0x7fb10c1b2590 2016-04-12 10:17:41.475+0000: 31366: debug : qemuProcessAutoDestroy:6326 : Killing domain 2016-04-12 10:17:41.475+0000: 31366: debug : qemuProcessKill:5671 : vm=0x7fb150004e10 name=rhel7.1 pid=31511 flags=1 2016-04-12 10:17:41.475+0000: 31366: debug : virProcessKillPainfully:352 : vpid=31511 force=1 so libvirt killed the destination qemu; but why? Chatting to jdenemar, it looks like libvirt is killing qemu because it's not heard from the source 'virsh' for 30 seconds, but we don't understand why; flipping to libvirt.
Yan, please rerun the test on the same machines but please:
a) Open some more terminal tabs logged into both the source and destination machines before you start
b) In one terminal on each of the source and destination run:
vmstat 5
(You should see one line of output every 5 seconds)
c) On the source machine run in one terminal:
ping the-destination-host-name
(You should see a line of output about once a second)
d) Issue the virsh command in a separate tab, do not background or doing anything to the virsh while it runs.
e) When it fails, please collect the logs again but also include the output of both vmstat's and the ping.
Thanks,
Dave
Created attachment 1146696 [details]
libvirtd.log, qemu log and vmstat results both on source and target host, ping results on source host
So the Prepare phase finishes at 03:12:55:
2016-04-13 03:12:55.121+0000: 17819: debug : virThreadJobClear:121 :
Thread 17819 (virNetServerHandleJob) finished job
remoteDispatchDomainMigratePrepare3Params with ret=0
and migration starts soon after this. Since then the connection between virsh
and libvirtd on the destination host is quite bad (remember, it's the process
to process connection, so both process and I/O scheduling may influence the
connection significantly):
03:12:59 Sending keepalive request
03:13:03 Got keepalive response
03:13:08 Sending keepalive request
03:13:11 Got keepalive response
03:13:16 Sending keepalive request
03:13:21 Sending keepalive request
03:13:21 Got keepalive response
03:13:26 Sending keepalive request
03:13:31 Sending keepalive request
03:13:36 Sending keepalive request
03:13:38 Got keepalive response
03:13:38 Got keepalive response
03:13:43 Sending keepalive request
03:13:48 Sending keepalive request
03:13:53 Sending keepalive request
03:13:58 Sending keepalive request
03:14:03 Got keepalive response
03:14:03 Got keepalive response
03:14:03 Got keepalive response
03:14:03 Got keepalive response
03:14:03 Got keepalive response
03:14:03 Got keepalive response
03:14:08 Sending keepalive request
03:14:16 Sending keepalive request
03:14:20 Got keepalive response
03:14:20 Got keepalive response
03:14:25 Sending keepalive request
03:14:30 Sending keepalive request
03:14:35 Sending keepalive request
03:14:36 Got keepalive response
03:14:36 Got keepalive response
03:14:36 Got keepalive response
03:14:41 Sending keepalive request
03:14:44 Got keepalive response
03:14:49 Sending keepalive request
03:14:58 Sending keepalive request
03:15:03 Sending keepalive request
03:15:08 Sending keepalive request
03:15:09 Got keepalive response
03:15:09 Got keepalive response
03:15:09 Got keepalive response
03:15:09 Got keepalive response
03:15:14 Sending keepalive request
03:15:18 Got keepalive response
03:15:23 Sending keepalive request
03:15:28 Sending keepalive request
03:15:33 Sending keepalive request
03:15:38 Sending keepalive request
03:15:43 Got keepalive response
03:15:43 Got keepalive response
03:15:43 Got keepalive response
03:15:43 Got keepalive response
03:15:48 Sending keepalive request
03:15:53 Sending keepalive request
03:15:58 Sending keepalive request
03:15:59 Got keepalive response
03:15:59 Got keepalive response
03:15:59 Got keepalive response
03:16:04 Sending keepalive request
03:16:09 Sending keepalive request
03:16:14 Sending keepalive request
03:16:21 Sending keepalive request
03:16:24 Got keepalive response
03:16:24 Got keepalive response
03:16:24 Got keepalive response
03:16:24 Got keepalive response
03:16:29 Sending keepalive request
03:16:32 Got keepalive response
03:16:37 Sending keepalive request
03:16:46 Sending keepalive request
03:16:49 Got keepalive response
03:16:49 Got keepalive response
03:16:54 Sending keepalive request
03:16:58 Got keepalive response
03:17:03 Sending keepalive request
03:17:08 Sending keepalive request
03:17:13 Sending keepalive request
03:17:19 Sending keepalive request
03:17:22 Got keepalive response
03:17:22 Got keepalive response
03:17:22 Got keepalive response
03:17:22 Got keepalive response
03:17:27 Sending keepalive request
03:17:30 Got keepalive response
03:17:35 Sending keepalive request
03:17:39 Got keepalive response
03:17:44 Sending keepalive request
03:17:49 Sending keepalive request
03:17:54 Sending keepalive request
03:17:59 Sending keepalive request
03:18:04 Sending keepalive request
=> timeout, migration is cancelled
I don't know why this is happening, but it's definitely not a libvirt bug. You can try to increase the default keepalive timeout.
|
Created attachment 1145007 [details] qemu log Description of problem: Stress the guest, then do migration, migration failed with error "operation failed: migration job: unexpectedly failed" Version-Release number of selected component (if applicable): libvirt-1.3.2-1.el7.x86_64 qemu-kvm-rhev-2.5.0-3.el7.x86_64 How reproducible: 100% Steps to Reproduce: 1.Stress the guest using stress: #stress --vm 10 --vm-bytes 128M 2.Do migration #virsh migrate rhel7.2 qemu+ssh://10.66.4.148/system --verbose --live root.196.125's password: Migration: [ 86 %]error: operation failed: migration job: unexpectedly failed 3.Check the qemu log: #cat /var/log/libvirt/qemu/rhel7.2.log ... 2016-04-08T05:37:44.302478Z qemu-kvm: socket_writev_buffer: Got err=104 for (131328/18446744073709551615) main_channel_migrate_src_complete: main_channel_migrate_src_complete: client 0x7f6825107480 MIGRATE_CANCEL ... Actual results: After about 2 minutes, migration failed with error "migration job: unexpectedly failed " Expected results: Migration should complete successfully. Additional info: 1.Migration also failed with same error when play HD video in guest; 2.When the source host with high performance cpu and memory, the migration will wait a long long time, not finish or fail with error.