Bug 1325059 - Migration failed when stressing the guest
Summary: Migration failed when stressing the guest
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: libvirt
Version: 7.3
Hardware: x86_64
OS: Linux
unspecified
unspecified
Target Milestone: rc
: ---
Assignee: Jiri Denemark
QA Contact: Virtualization Bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-04-08 06:09 UTC by yafu
Modified: 2018-03-15 02:09 UTC (History)
19 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-07-19 11:55:39 UTC
Target Upstream Version:


Attachments (Terms of Use)
qemu log (3.13 KB, text/plain)
2016-04-08 06:09 UTC, yafu
no flags Details
libvirtd.log and qemu log both on source and target host (424.47 KB, application/x-gzip)
2016-04-12 10:57 UTC, yafu
no flags Details
libvirtd.log, qemu log and vmstat results both on source and target host, ping results on source host (1.79 MB, application/x-gzip)
2016-04-13 03:34 UTC, yafu
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1367620 0 unspecified CLOSED storage migration fails due to keepalive timeout 2021-06-10 11:27:45 UTC

Internal Links: 1367620

Description yafu 2016-04-08 06:09:16 UTC
Created attachment 1145007 [details]
qemu log

Description of problem:
Stress the guest, then do migration, migration failed with error "operation failed: migration job: unexpectedly failed"

Version-Release number of selected component (if applicable):
libvirt-1.3.2-1.el7.x86_64
qemu-kvm-rhev-2.5.0-3.el7.x86_64

How reproducible:
100%

Steps to Reproduce:
1.Stress the guest using stress:
  #stress --vm 10 --vm-bytes 128M

2.Do migration 
 #virsh migrate rhel7.2 qemu+ssh://10.66.4.148/system --verbose --live
 root.196.125's password: 
 Migration: [ 86 %]error: operation failed: migration job: unexpectedly failed 

3.Check the qemu log:
 #cat /var/log/libvirt/qemu/rhel7.2.log
  ...
2016-04-08T05:37:44.302478Z qemu-kvm: socket_writev_buffer: Got err=104 for (131328/18446744073709551615)
main_channel_migrate_src_complete:
main_channel_migrate_src_complete: client 0x7f6825107480 MIGRATE_CANCEL
  ...

Actual results:
After about 2 minutes, migration failed with error "migration job: unexpectedly failed "

Expected results:
Migration should complete successfully.

Additional info:
1.Migration also failed with same error when play HD video in guest;
2.When the source host with high performance cpu and memory, the migration will wait a long long time, not finish or fail with error.

Comment 1 Dr. David Alan Gilbert 2016-04-12 08:52:16 UTC
Hi,
  Can you repeat this test please and include all of the following:

    a) The qemu log from the destination
    b) The qemu log from the source
    c) The libvirt log from the destination
    d) The libvirt log from the source

also,  can you explain what hosts you were using for this test and what different hosts you were using for the test  where you say 'When the source host with high performance cpu and memory,'

Dave

Comment 3 yafu 2016-04-12 10:57:34 UTC
Created attachment 1146376 [details]
libvirtd.log and qemu log both on source and target host

Comment 4 Dr. David Alan Gilbert 2016-04-12 11:09:46 UTC
So the source qemu logs:
2016-04-12T10:17:41.498594Z qemu-kvm: socket_writev_buffer: Got err=104 for (131328/18446744073709551615)

that means it's connection was broken - but why? So looking at the destination we see just before that:

2016-04-12T10:17:41.475112Z qemu-kvm: terminating on signal 15 from pid 31366

so something killed the destination qemu;  the destination libvirt log shows:


2016-04-12 10:17:41.474+0000: 31366: debug : qemuProcessAutoDestroy:6322 : vm=rhel7.1 has long-term job active, cancelling
2016-04-12 10:17:41.474+0000: 31366: info : virObjectRef:296 : OBJECT_REF: obj=0x7fb10c1b2590
2016-04-12 10:17:41.474+0000: 31366: debug : virFileMakePathHelper:2755 : path=/var/run/libvirt/qemu mode=0777
2016-04-12 10:17:41.475+0000: 31366: debug : virFileClose:102 : Closed fd 28
2016-04-12 10:17:41.475+0000: 31366: info : virObjectUnref:259 : OBJECT_UNREF: obj=0x7fb10c1b2590
2016-04-12 10:17:41.475+0000: 31366: debug : qemuProcessAutoDestroy:6326 : Killing domain
2016-04-12 10:17:41.475+0000: 31366: debug : qemuProcessKill:5671 : vm=0x7fb150004e10 name=rhel7.1 pid=31511 flags=1
2016-04-12 10:17:41.475+0000: 31366: debug : virProcessKillPainfully:352 : vpid=31511 force=1

so libvirt killed the destination qemu; but why?

Comment 5 Dr. David Alan Gilbert 2016-04-12 11:38:30 UTC
Chatting to jdenemar, it looks like libvirt is killing qemu because it's not heard from the source 'virsh' for 30 seconds, but we don't understand why; flipping to libvirt.

Yan, please rerun the test on the same machines but please:
   a) Open some more terminal tabs logged into both the source and destination machines before you start
   b) In one terminal on each of the source and destination run:
      vmstat 5
      (You should see one line of output every 5 seconds)
   c) On the source machine run in one terminal:
      ping the-destination-host-name
      (You should see a line of output about once a second)
   d) Issue the virsh command in a separate tab, do not background or doing anything to the virsh while it runs.
   e) When it fails, please collect the logs again but also include the output of both vmstat's and the ping.

Thanks,

Dave

Comment 7 yafu 2016-04-13 03:34:57 UTC
Created attachment 1146696 [details]
libvirtd.log, qemu log and vmstat results both on source and target host, ping results on source host

Comment 10 Jiri Denemark 2016-07-19 11:55:39 UTC
So the Prepare phase finishes at 03:12:55:

2016-04-13 03:12:55.121+0000: 17819: debug : virThreadJobClear:121 :
    Thread 17819 (virNetServerHandleJob) finished job
    remoteDispatchDomainMigratePrepare3Params with ret=0

and migration starts soon after this. Since then the connection between virsh
and libvirtd on the destination host is quite bad (remember, it's the process
to process connection, so both process and I/O scheduling may influence the
connection significantly):

03:12:59 Sending keepalive request
                                        03:13:03 Got keepalive response
03:13:08 Sending keepalive request
                                        03:13:11 Got keepalive response
03:13:16 Sending keepalive request
03:13:21 Sending keepalive request
                                        03:13:21 Got keepalive response
03:13:26 Sending keepalive request
03:13:31 Sending keepalive request
03:13:36 Sending keepalive request
                                        03:13:38 Got keepalive response
                                        03:13:38 Got keepalive response
03:13:43 Sending keepalive request
03:13:48 Sending keepalive request
03:13:53 Sending keepalive request
03:13:58 Sending keepalive request
                                        03:14:03 Got keepalive response
                                        03:14:03 Got keepalive response
                                        03:14:03 Got keepalive response
                                        03:14:03 Got keepalive response
                                        03:14:03 Got keepalive response
                                        03:14:03 Got keepalive response
03:14:08 Sending keepalive request
03:14:16 Sending keepalive request
                                        03:14:20 Got keepalive response
                                        03:14:20 Got keepalive response
03:14:25 Sending keepalive request
03:14:30 Sending keepalive request
03:14:35 Sending keepalive request
                                        03:14:36 Got keepalive response
                                        03:14:36 Got keepalive response
                                        03:14:36 Got keepalive response
03:14:41 Sending keepalive request
                                        03:14:44 Got keepalive response
03:14:49 Sending keepalive request
03:14:58 Sending keepalive request
03:15:03 Sending keepalive request
03:15:08 Sending keepalive request
                                        03:15:09 Got keepalive response
                                        03:15:09 Got keepalive response
                                        03:15:09 Got keepalive response
                                        03:15:09 Got keepalive response
03:15:14 Sending keepalive request
                                        03:15:18 Got keepalive response
03:15:23 Sending keepalive request
03:15:28 Sending keepalive request
03:15:33 Sending keepalive request
03:15:38 Sending keepalive request
                                        03:15:43 Got keepalive response
                                        03:15:43 Got keepalive response
                                        03:15:43 Got keepalive response
                                        03:15:43 Got keepalive response
03:15:48 Sending keepalive request
03:15:53 Sending keepalive request
03:15:58 Sending keepalive request
                                        03:15:59 Got keepalive response
                                        03:15:59 Got keepalive response
                                        03:15:59 Got keepalive response
03:16:04 Sending keepalive request
03:16:09 Sending keepalive request
03:16:14 Sending keepalive request
03:16:21 Sending keepalive request
                                        03:16:24 Got keepalive response
                                        03:16:24 Got keepalive response
                                        03:16:24 Got keepalive response
                                        03:16:24 Got keepalive response
03:16:29 Sending keepalive request
                                        03:16:32 Got keepalive response
03:16:37 Sending keepalive request
03:16:46 Sending keepalive request
                                        03:16:49 Got keepalive response
                                        03:16:49 Got keepalive response
03:16:54 Sending keepalive request
                                        03:16:58 Got keepalive response
03:17:03 Sending keepalive request
03:17:08 Sending keepalive request
03:17:13 Sending keepalive request
03:17:19 Sending keepalive request
                                        03:17:22 Got keepalive response
                                        03:17:22 Got keepalive response
                                        03:17:22 Got keepalive response
                                        03:17:22 Got keepalive response
03:17:27 Sending keepalive request
                                        03:17:30 Got keepalive response
03:17:35 Sending keepalive request
                                        03:17:39 Got keepalive response
03:17:44 Sending keepalive request
03:17:49 Sending keepalive request
03:17:54 Sending keepalive request
03:17:59 Sending keepalive request
03:18:04 Sending keepalive request

=> timeout, migration is cancelled

I don't know why this is happening, but it's definitely not a libvirt bug. You can try to increase the default keepalive timeout.


Note You need to log in before you can comment on or make changes to this bug.