Description of problem: It spent 21 minutes to finish the migration. In the period. mouse and kbd totally unavailable . Another phenomenon is : If quit dst vm. in src vm. QEMU cmd 'info migrate ' show migration still running. the value of _transferred ram_ and _total ram_ stuck. while the value of _remaining ram_ increase . e.g. ( in this testing, we just used 128G RAM ) (qemu) info migrate Migration status: active transferred ram: 4688132 kbytes remaining ram: 129551184 kbytes total ram: 134238220 kbytes (qemu) info migrate Migration status: active transferred ram: 4688132 kbytes remaining ram: 129551488 kbytes total ram: 134238220 kbytes (qemu) info migrate Migration status: active transferred ram: 4688132 kbytes remaining ram: 129551496 kbytes total ram: 134238220 kbytes asked with glommer. he said they are both problem of *dirty bit calculation*. CLI : /usr/libexec/qemu-kvm -no-hpet -drive file=/data/images/images/RHEL5u4.64.qcow.virtio,if=virtio,cache=off,index=0,boot=on -cpu qemu64,+sse2 -net nic,macaddr=00:21:9B:58:51:D3,model=virtio -net tap,script=/etc/qemu-ifup -monitor stdio -vnc :15 -m 256G -smp 16 Version-Release number of selected component (if applicable): kvm-83-94.el5 #cat /etc/redhat-release Red Hat Enterprise Virtualization Hypervisor release 5.4-2.0.99 (12.1) How reproducible: 100% Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: 1 info migration show the total ram is 268455948 kbytes. 2 after migration , top on src host show the *shared mem* to 251g top - 13:41:02 up 9:36, 3 users, load average: 0.90, 0.91, 1.23 Tasks: 643 total, 2 running, 641 sleeping, 0 stopped, 0 zombie Cpu(s): 0.0%us, 2.1%sy, 0.0%ni, 97.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 528140240k total, 18909080k used, 509231160k free, 147948k buffers Swap: 4194296k total, 0k used, 4194296k free, 11160816k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 21442 root 15 0 257g 257g 251g S 0.0 51.1 57:24.39 qemu-kvm 3 the guest is ping-able during migration. ( so the guest still alive :) [root@lihuang ~]# ping 10.66.83.153 PING 10.66.83.153 (10.66.83.153) 56(84) bytes of data. 64 bytes from 10.66.83.153: icmp_seq=1 ttl=63 time=1072 ms 64 bytes from 10.66.83.153: icmp_seq=2 ttl=63 time=3125 ms 64 bytes from 10.66.83.153: icmp_seq=3 ttl=63 time=2124 ms 64 bytes from 10.66.83.153: icmp_seq=4 ttl=63 time=1124 ms 64 bytes from 10.66.83.153: icmp_seq=5 ttl=63 time=1333 ms 64 bytes from 10.66.83.153: icmp_seq=6 ttl=63 time=1874 ms --- 10.66.83.153 ping statistics --- 8 packets transmitted, 6 received, 25% packet loss, time 7000ms rtt min/avg/max/mdev = 1072.513/1775.829/3125.411/714.942 ms, pipe 4 4. Host cpu: processor : 47 vendor_id : GenuineIntel cpu family : 6 model : 29 model name : Intel(R) Xeon(R) CPU E7450 @ 2.40GHz stepping : 1 cpu MHz : 2398.834 cache size : 12288 KB physical id : 27 siblings : 2 core id : 5 cpu cores : 2 apicid : 221 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm bogomips : 4797.50 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: 5 Host Mem cat /proc/meminfo MemTotal: 528140240 kB MemFree: 398410784 kB Buffers: 109452 kB Cached: 577872 kB SwapCached: 0 kB Active: 123144808 kB Inactive: 534388 kB HighTotal: 0 kB HighFree: 0 kB LowTotal: 528140240 kB LowFree: 398410784 kB SwapTotal: 4194296 kB SwapFree: 4194296 kB Dirty: 24 kB Writeback: 0 kB AnonPages: 122993980 kB Mapped: 18104 kB Slab: 4734804 kB PageTables: 530944 kB NFS_Unstable: 0 kB Bounce: 0 kB CommitLimit: 268264416 kB Committed_AS: 270677372 kB VmallocTotal: 34359738367 kB VmallocUsed: 899784 kB VmallocChunk: 34358838399 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 Hugepagesize: 2048 kB 6. Guest and Host do not load
Can you please retest with latest kvm? Post migration, is the mouse/keyboard are back? Can't they move at all or very slowly? Glauber, why do you say it's dirty bit issue?
Agreed with dor, first thing is to test it with latest software. I've came across many migration issues in the past, that happened due to us calculating dirty bit mask incorrectly. So it is quite possible that this bug is a dupe that we forgot to close. If it is not the case, I'll be glad to dig in again.
re-test in kvm-83-164.el5 1. during migration (before hit the downtime) mouse/kbd is not avaiable. 2.after migration. some times mouse/kbd is not come back. "Mar 15 05:24:28 x86 kernel: psmouse.c: Explorer Mouse at isa0060/serio1/input0 lost synchronization, throwing 1 bytes away " could be found from dmesg. 3. after quest dst vm (before migration done) ; (qemu) info migrate Migration status: active transferred ram: 3341932 kbytes remaining ram: 265114932 kbytes total ram: 268455948 kbytes (qemu) (qemu) info migrate Migration status: active transferred ram: 3341932 kbytes remaining ram: 265116864 kbytes total ram: 268455948 kbytes (qemu) info migrate Migration status: active transferred ram: 3341932 kbytes remaining ram: 265116880 kbytes total ram: 268455948 kbytes issue still exist.
- Does it responds to pings during the migration? - Can you retest 256G but with less vcpus? Say -smp 1 or -smp 2? I'm trying to see where the problem is. Similarly, testing a guest of 1G with -smp 16 will be helpful too.
================ -smp 2 -m 256G : ================ [root@intel-XE7450-512-1 ~]# ping 10.66.82.192 PING 10.66.82.192 (10.66.82.192) 56(84) bytes of data. From 10.66.83.79 icmp_seq=32 Destination Host Unreachable From 10.66.83.79 icmp_seq=33 Destination Host Unreachable From 10.66.83.79 icmp_seq=34 Destination Host Unreachable From 10.66.83.79 icmp_seq=36 Destination Host Unreachable From 10.66.83.79 icmp_seq=37 Destination Host Unreachable no response even after migration. after restart network in side guest.it is come back. (similar to bug 524651, but guest is using e1000 nic in the test. soft lockup found in dmesg : http://pastebin.test.redhat.com/21126 (harmless as bug 512656 ? ) ================ -smp 16 -m 2G : ================ first it respond to ping, after a while (~ 1 min). no response. ping again, then return "Destination Host Unreachable". network is not come back until restart network after migration. will try a smaller configure (-smp 8 / -m 64 ) tomorrow. Thanks Lijun Huang.
================ -smp 2 -m 2G : ================ 1. mouse/kbd/network(ping) works well during and after migration. ================ -smp 2 -m 64G : ================ 1. mouse/kbd hang during migration. but come back after migration. 2. inside guest. ping #host is continuous (but latency is large during migration). 3. From Host, ping #guest : 64 bytes from 10.66.82.192: icmp_seq=54 ttl=64 time=303 ms 64 bytes from 10.66.82.192: icmp_seq=55 ttl=64 time=343 ms 64 bytes from 10.66.82.192: icmp_seq=56 ttl=64 time=31.9 ms 64 bytes from 10.66.82.192: icmp_seq=57 ttl=64 time=3.00 ms ping: sendmsg: No buffer space available ping: sendmsg: No buffer space available ping: sendmsg: No buffer space available ping: sendmsg: No buffer space available ping: sendmsg: No buffer space available --- 10.66.82.192 ping statistics --- 73 packets transmitted, 57 received, 21% packet loss, time 91029ms rtt min/avg/max/mdev = 0.000/320.319/5304.193/1017.810 ms, pipe 6 [root@intel-XE7450-512-1 ~]# ping 10.66.82.192 PING 10.66.82.192 (10.66.82.192) 56(84) bytes of data. From 10.66.83.79 icmp_seq=9 Destination Host Unreachable From 10.66.83.79 icmp_seq=10 Destination Host Unreachable From 10.66.83.79 icmp_seq=11 Destination Host Unreachable after migraion, it come back. [root@intel-XE7450-512-1 ~]# ping 10.66.82.192 PING 10.66.82.192 (10.66.82.192) 56(84) bytes of data. 64 bytes from 10.66.82.192: icmp_seq=1 ttl=64 time=2.07 ms 64 bytes from 10.66.82.192: icmp_seq=2 ttl=64 time=12.5 ms 64 bytes from 10.66.82.192: icmp_seq=3 ttl=64 time=0.391 ms 64 bytes from 10.66.82.192: icmp_seq=4 ttl=64 time=0.663 ms
Hi, I need to fix this bug. How can i reproduce it?, I have here machine with just 100giga ram Any chance it trigger at lower levels of memory lets say 50giga guest? Is it always happen? Can i some how get access to machine with big memory ? Is it still happening to you? Thanks for the info !
I just sent to rhvirt-patchs patch to fix it. Thanks.
reproduced and patches have been posted for it.
*** Bug 601045 has been marked as a duplicate of this bug. ***
For the issue of Large guest ( 256G RAM + 16 vcpu ) hang during live migration Re-produced on kvm-83-164.el5 Verified on kvm-83-217.el5 steps: 1.start VM in src host: /usr/libexec/qemu-kvm -m 128G -smp 16 -name VM1 -uuid 438915f2-c0fc-8d6b-bb06-b8ddd28046fa -no-kvm-pit-reinjection -boot c -drive file=/home/tt.img,if=virtio,index=0,boot=on,cache=none -net nic,macaddr=54:52:00:3a:d4:4d,vlan=0,model=virtio -net tap,script=/etc/qemu-ifup,vlan=0 -serial pty -parallel none -usbdevice tablet -k en-us -vnc :2 -monitor stdio 2.start listenning port 3.do live migration. Actual Results: for kvm-83-164.el5 ,guest hang during migartion for kvm-83-217.el5 ,guest works well.
Referring to comment#0 Another phenomenon is : If quit dst vm. in src vm. QEMU cmd 'info migrate ' show migration still running. the value of _transferred ram_ and _total ram_ stuck. while the value of _remaining ram_ increase . e.g. ( in this testing, we just used 128G RAM ) re-test on kvm-83-217.el5 this issue still exists.the value of _transferred ram_ and _total ram_ stuck._remaining ram_ increase a little ,then stuck. Expected Results: '(qemu)info migration' should show migration failed. Based on above ,re-assign this issue.
eventually, state of migration will shown as failed. it needs to detect that the other side have died. This is not a regression, is something that has always been there. opening other bugzilla for this is ok.
Moving back to ON_QA, based on comment #22.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0028.html