Bug 513765

Summary:	Large guest ( 256G RAM + 16 vcpu ) hang during live migration
Product:	Red Hat Enterprise Linux 5	Reporter:	lihuang <lihuang>
Component:	kvm	Assignee:	Juan Quintela <quintela>
Status:	CLOSED ERRATA	QA Contact:	ovirt-maint <ovirt-maint>
Severity:	medium	Docs Contact:
Priority:	low
Version:	5.4	CC:	bcao, dyasny, ehabkost, gyang, juzhang, llim, michen, tao, tburke, virt-maint, ykaul
Target Milestone:	rc
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	kvm-83-213.el5	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:
Clones:	658823 (view as bug list)		Environment:
Last Closed:	2011-01-13 23:11:18 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	545233, 565939, 568128, 580949, 643970, 645188, 658823

Description lihuang 2009-07-25 14:11:46 UTC

Description of problem:
   It spent 21 minutes to finish the migration. In the period. mouse and kbd totally unavailable .
   Another phenomenon is : If quit dst vm. in src vm. QEMU cmd 'info migrate ' show migration still running. the value of _transferred ram_ and _total ram_ stuck. while the value of _remaining ram_ increase . e.g. ( in this testing, we just used 128G RAM )

(qemu) info migrate
Migration status: active
transferred ram: 4688132 kbytes
remaining ram: 129551184 kbytes
total ram: 134238220 kbytes
(qemu) info migrate
Migration status: active
transferred ram: 4688132 kbytes
remaining ram: 129551488 kbytes
total ram: 134238220 kbytes
(qemu) info migrate
Migration status: active
transferred ram: 4688132 kbytes
remaining ram: 129551496 kbytes
total ram: 134238220 kbytes   


asked with glommer. he said they are both problem of *dirty bit calculation*.

CLI : /usr/libexec/qemu-kvm -no-hpet -drive file=/data/images/images/RHEL5u4.64.qcow.virtio,if=virtio,cache=off,index=0,boot=on -cpu qemu64,+sse2 -net nic,macaddr=00:21:9B:58:51:D3,model=virtio -net tap,script=/etc/qemu-ifup -monitor stdio -vnc :15 -m 256G -smp 16

Version-Release number of selected component (if applicable):
kvm-83-94.el5
#cat /etc/redhat-release 
Red Hat Enterprise Virtualization Hypervisor release 5.4-2.0.99 (12.1)

How reproducible:
100%

Steps to Reproduce:
1.  
2.
3.
  
Actual results:


Expected results:


Additional info:
1 info migration show the total ram is 268455948 kbytes.

2 after migration , top on src host show the *shared mem* to 251g
top - 13:41:02 up  9:36,  3 users,  load average: 0.90, 0.91, 1.23
Tasks: 643 total,   2 running, 641 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  2.1%sy,  0.0%ni, 97.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  528140240k total, 18909080k used, 509231160k free,   147948k buffers
Swap:  4194296k total,        0k used,  4194296k free, 11160816k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
21442 root      15   0  257g 257g 251g S  0.0 51.1  57:24.39 qemu-kvm           

3 the guest is ping-able during migration. ( so the guest still alive :) 
[root@lihuang ~]# ping 10.66.83.153
PING 10.66.83.153 (10.66.83.153) 56(84) bytes of data.
64 bytes from 10.66.83.153: icmp_seq=1 ttl=63 time=1072 ms
64 bytes from 10.66.83.153: icmp_seq=2 ttl=63 time=3125 ms
64 bytes from 10.66.83.153: icmp_seq=3 ttl=63 time=2124 ms
64 bytes from 10.66.83.153: icmp_seq=4 ttl=63 time=1124 ms
64 bytes from 10.66.83.153: icmp_seq=5 ttl=63 time=1333 ms
64 bytes from 10.66.83.153: icmp_seq=6 ttl=63 time=1874 ms

--- 10.66.83.153 ping statistics ---
8 packets transmitted, 6 received, 25% packet loss, time 7000ms
rtt min/avg/max/mdev = 1072.513/1775.829/3125.411/714.942 ms, pipe 4


4. Host cpu:
processor       : 47
vendor_id       : GenuineIntel
cpu family      : 6
model           : 29
model name      : Intel(R) Xeon(R) CPU           E7450  @ 2.40GHz
stepping        : 1
cpu MHz         : 2398.834
cache size      : 12288 KB
physical id     : 27
siblings        : 2
core id         : 5
cpu cores       : 2
apicid          : 221
fpu             : yes
fpu_exception   : yes
cpuid level     : 11
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
bogomips        : 4797.50
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

5 Host Mem
cat /proc/meminfo 
MemTotal:     528140240 kB
MemFree:      398410784 kB
Buffers:        109452 kB
Cached:         577872 kB
SwapCached:          0 kB
Active:       123144808 kB
Inactive:       534388 kB
HighTotal:           0 kB
HighFree:            0 kB
LowTotal:     528140240 kB
LowFree:      398410784 kB
SwapTotal:     4194296 kB
SwapFree:      4194296 kB
Dirty:              24 kB
Writeback:           0 kB
AnonPages:    122993980 kB
Mapped:          18104 kB
Slab:          4734804 kB
PageTables:     530944 kB
NFS_Unstable:        0 kB
Bounce:              0 kB
CommitLimit:  268264416 kB
Committed_AS: 270677372 kB
VmallocTotal: 34359738367 kB
VmallocUsed:    899784 kB
VmallocChunk: 34358838399 kB
HugePages_Total:     0
HugePages_Free:      0
HugePages_Rsvd:      0
Hugepagesize:     2048 kB

6. Guest and Host do not load

Comment 1 Dor Laor 2010-03-11 10:23:57 UTC

Can you please retest with latest kvm?

Post migration, is the mouse/keyboard are back?
Can't they move at all or very slowly?

Glauber, why do you say it's dirty bit issue?

Comment 2 Glauber Costa 2010-03-11 17:32:10 UTC

Agreed with dor, first thing is to test it with latest software.

I've came across many migration issues in the past, that happened due to us calculating dirty bit mask incorrectly. So it is quite possible that this bug is a dupe that we forgot to close.

If it is not the case, I'll be glad to dig in again.

Comment 3 lihuang 2010-03-15 11:38:28 UTC

re-test in kvm-83-164.el5
1. during migration (before hit the downtime)  mouse/kbd is not avaiable.
2.after migration. some times mouse/kbd is not come back.
 "Mar 15 05:24:28 x86 kernel: psmouse.c: Explorer Mouse at isa0060/serio1/input0 lost synchronization, throwing 1 bytes away " could be found from dmesg.
3. after quest dst vm (before migration done) ;
(qemu) info migrate
Migration status: active
transferred ram: 3341932 kbytes
remaining ram: 265114932 kbytes
total ram: 268455948 kbytes
(qemu)
(qemu) info migrate
Migration status: active
transferred ram: 3341932 kbytes
remaining ram: 265116864 kbytes
total ram: 268455948 kbytes
(qemu) info migrate
Migration status: active
transferred ram: 3341932 kbytes
remaining ram: 265116880 kbytes
total ram: 268455948 kbytes

issue still exist.

Comment 4 Dor Laor 2010-03-15 11:49:48 UTC

- Does it responds to pings during the migration?
- Can you retest 256G but with less vcpus? Say -smp 1 or -smp 2?
  I'm trying to see where the problem is.
  Similarly, testing a guest of 1G with -smp 16 will be helpful too.

Comment 5 lihuang 2010-03-15 14:44:38 UTC

================
-smp 2 -m 256G :
================

[root@intel-XE7450-512-1 ~]# ping 10.66.82.192
PING 10.66.82.192 (10.66.82.192) 56(84) bytes of data.
From 10.66.83.79 icmp_seq=32 Destination Host Unreachable
From 10.66.83.79 icmp_seq=33 Destination Host Unreachable
From 10.66.83.79 icmp_seq=34 Destination Host Unreachable
From 10.66.83.79 icmp_seq=36 Destination Host Unreachable
From 10.66.83.79 icmp_seq=37 Destination Host Unreachable

no response even after migration.

after restart network in side guest.it is come back. (similar to bug 524651, but guest is using e1000 nic in the test.

soft lockup found in dmesg : http://pastebin.test.redhat.com/21126
(harmless as bug 512656 ? )


================
-smp 16 -m 2G :
================
first it respond to ping,
after a while (~ 1 min). no response.
ping again, then return "Destination Host Unreachable".

network is not come back until restart network after migration.


will try a smaller configure (-smp 8 / -m 64 ) tomorrow.

Thanks 
Lijun Huang.

Comment 6 lihuang 2010-03-16 05:01:52 UTC

================
-smp 2 -m 2G :
================
1. mouse/kbd/network(ping) works well during and after migration.


================
-smp 2 -m 64G :
================
1. mouse/kbd hang during migration. but come back after migration.
2. inside guest. ping #host is continuous (but latency is large during migration).

3. From Host, ping　#guest :

64 bytes from 10.66.82.192: icmp_seq=54 ttl=64 time=303 ms
64 bytes from 10.66.82.192: icmp_seq=55 ttl=64 time=343 ms
64 bytes from 10.66.82.192: icmp_seq=56 ttl=64 time=31.9 ms
64 bytes from 10.66.82.192: icmp_seq=57 ttl=64 time=3.00 ms
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available

--- 10.66.82.192 ping statistics ---
73 packets transmitted, 57 received, 21% packet loss, time 91029ms
rtt min/avg/max/mdev = 0.000/320.319/5304.193/1017.810 ms, pipe 6
[root@intel-XE7450-512-1 ~]# ping 10.66.82.192
PING 10.66.82.192 (10.66.82.192) 56(84) bytes of data.
From 10.66.83.79 icmp_seq=9 Destination Host Unreachable
From 10.66.83.79 icmp_seq=10 Destination Host Unreachable
From 10.66.83.79 icmp_seq=11 Destination Host Unreachable

after migraion, it come back.
[root@intel-XE7450-512-1 ~]# ping 10.66.82.192
PING 10.66.82.192 (10.66.82.192) 56(84) bytes of data.
64 bytes from 10.66.82.192: icmp_seq=1 ttl=64 time=2.07 ms
64 bytes from 10.66.82.192: icmp_seq=2 ttl=64 time=12.5 ms
64 bytes from 10.66.82.192: icmp_seq=3 ttl=64 time=0.391 ms
64 bytes from 10.66.82.192: icmp_seq=4 ttl=64 time=0.663 ms

Comment 7 Izik Eidus 2010-03-21 17:06:12 UTC

Hi, I need to fix this bug.

How can i reproduce it?, I have here machine with just 100giga ram
Any chance it trigger at lower levels of memory lets say 50giga guest?

Is it always happen?

Can i some how get access to machine with big memory ?

Is it still happening to you?

Thanks for the info !

Comment 8 Izik Eidus 2010-03-23 21:00:27 UTC

I just sent to rhvirt-patchs patch to fix it.

Thanks.

Comment 14 Juan Quintela 2010-10-26 17:25:24 UTC

reproduced and patches have been posted for it.

Comment 19 Juan Quintela 2010-11-24 13:40:58 UTC

*** Bug 601045 has been marked as a duplicate of this bug. ***

Comment 20 Mike Cao 2010-12-01 03:22:47 UTC

For the issue of Large guest ( 256G RAM + 16 vcpu ) hang during live migration 

Re-produced on kvm-83-164.el5 
Verified on kvm-83-217.el5

steps:
1.start VM in src host:
/usr/libexec/qemu-kvm -m 128G -smp 16 -name VM1 -uuid 438915f2-c0fc-8d6b-bb06-b8ddd28046fa -no-kvm-pit-reinjection -boot c -drive file=/home/tt.img,if=virtio,index=0,boot=on,cache=none -net nic,macaddr=54:52:00:3a:d4:4d,vlan=0,model=virtio -net tap,script=/etc/qemu-ifup,vlan=0 -serial pty -parallel none -usbdevice tablet  -k en-us -vnc :2 -monitor stdio
2.start listenning port
3.do live migration.

Actual Results:
for kvm-83-164.el5 ,guest hang during migartion
for kvm-83-217.el5 ,guest works well.

Comment 21 Mike Cao 2010-12-01 03:23:23 UTC

Referring to comment#0
Another phenomenon is : If quit dst vm. in src vm. QEMU cmd 'info migrate '
show migration still running. the value of _transferred ram_ and _total ram_
stuck. while the value of _remaining ram_ increase . e.g. ( in this testing, we
just used 128G RAM )

re-test on kvm-83-217.el5

this issue still exists.the value of _transferred ram_ and _total ram_
stuck._remaining ram_ increase a little ,then stuck.

Expected Results:
'(qemu)info migration' should show migration failed.

Based on above ,re-assign this issue.

Comment 22 Juan Quintela 2010-12-01 12:02:44 UTC

eventually, state of migration will shown as failed.
it needs to detect that the other side have died.  This is not a regression, is something that has always been there.

opening other bugzilla for this is ok.

Comment 23 Eduardo Habkost 2010-12-01 12:19:53 UTC

Moving back to ON_QA, based on comment #22.

Comment 26 errata-xmlrpc 2011-01-13 23:11:18 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0028.html