Bug 632557

Summary: Migration with STRESS caused guest hang
Product: Red Hat Enterprise Linux 6 Reporter: Keqin Hong <khong>
Component: qemu-kvmAssignee: Juan Quintela <quintela>
Status: CLOSED DUPLICATE QA Contact: Virtualization Bugs <virt-bugs>
Severity: medium Docs Contact:
Priority: low    
Version: 6.0CC: bcao, llim, michen, mkenneth, plyons, rwu, tburke, virt-maint
Target Milestone: rcKeywords: RHELNAK
Target Release: 6.1   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-02-04 12:47:23 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 580951    
Attachments:
Description Flags
kvm_stat log none

Description Keqin Hong 2010-09-10 11:17:11 UTC
Description of problem:
Migration locally with STRESS couldn't finish. After terminating ./stress process, migration completed. However, sometimes guest hanged.

Version-Release number of selected component (if applicable):
host:
qemu-kvm-0.12.1.2-2.113.el6.x86_64
kernel-2.6.32-71.el6.x86_64
guest:
RHEL5.5.z-64
kernel-2.6.18-194.11.3.el5

How reproducible:
3/20

Steps to Reproduce:
1. start src VM with 4vcpu and 8G mem
/usr/libexec/qemu-kvm -M rhel6.0.0 -enable-kvm -m 8G -smp 4,sockets=4,cores=1,threads=1 -name rhel5-64 -uuid d1a201e7-7109-507d-cb9a-b010becc6c6b -nodefconfig -nodefaults -monitor stdio -rtc base=utc -boot c -drive if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -drive file=/root/rhel5-64.img,if=none,id=drive-ide0-0-0,boot=on,format=raw -device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -netdev tap,id=hostnet0 -device e1000,netdev=hostnet0,id=net0,mac=52:54:00:94:3f:29,bus=pci.0,addr=0x3 -chardev pty,id=serial0 -device isa-serial,chardev=serial0 -usb -vnc 10.66.85.229:0 -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6
2. run ./stress --cpu 4 -vm 16 --vm-bytes 256M -verbose
3. start dest VM in listening mode 
   ... --incoming tcp:0:5800
4. migrate from src to dest
5. ^C to terminate ./stress
6. wait for migration to complete
  
Actual results:
Guest hanged (Desktop and console), couldn't ping its Ethernet interface

Expected results:
no hang

Additional info:
Tested on both a 64-cpu Intel box and another AMD box (virtlab: amd-2471-32-1), both had the problem.
Seems it happened on rhel5.5-z-64 (kernel-2.6.18-194.11.3.el5) only, as I didn't reproduce it on 32bit guest.

top:
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                                           
60919 root      20   0 8713m 4.0g 3128 R 11.6  3.7   1:03.07 qemu-kvm                                                                                                                                          
60907 root      20   0 8713m 4.0g 3128 S  9.6  3.7   1:07.83 qemu-kvm                                                                                                                                          
60916 root      20   0 8713m 4.0g 3128 S  0.0  3.7   0:04.34 qemu-kvm                                                                                                                                          
60917 root      20   0 8713m 4.0g 3128 S  0.0  3.7   0:02.52 qemu-kvm                                                                                                                                          
60918 root      20   0 8713m 4.0g 3128 S  0.0  3.7   0:02.28 qemu-kvm                                                                                                                 

# gdb attach 60916
(gdb) bt
#0  0x0000003581ad95f7 in ioctl () from /lib64/libc.so.6
#1  0x000000000042a57f in kvm_run (env=0x16ffd10) at /usr/src/debug/qemu-kvm-0.12.1.2/qemu-kvm.c:928
#2  0x000000000042aa09 in kvm_cpu_exec (env=<value optimized out>) at /usr/src/debug/qemu-kvm-0.12.1.2/qemu-kvm.c:1658
#3  0x000000000042b62f in kvm_main_loop_cpu (_env=0x16ffd10) at /usr/src/debug/qemu-kvm-0.12.1.2/qemu-kvm.c:1900
#4  ap_main_loop (_env=0x16ffd10) at /usr/src/debug/qemu-kvm-0.12.1.2/qemu-kvm.c:1950
#5  0x00000035822077e1 in start_thread () from /lib64/libpthread.so.0
#6  0x0000003581ae153d in clone () from /lib64/libc.so.6

# strace -p 60916
Process 60916 attached - interrupt to quit
rt_sigtimedwait([BUS RT_6], 0x7ffae001fb70, {0, 0}, 8) = -1 EAGAIN (Resource temporarily unavailable)
rt_sigpending([ALRM])                   = 0

# strace -p 60919
Process 60919 attached - interrupt to quit
rt_sigtimedwait([BUS RT_6], 0x7ffade219b70, {0, 0}, 8) = -1 EAGAIN (Resource temporarily unavailable)
rt_sigpending([])                       = 0
ioctl(14, 0xae80, 0)                    = 0
ioctl(14, 0xae80, 0)                    = 0
ioctl(14, 0xae80, 0)                    = 0
ioctl(14, 0xae80, 0)                    = 0
ioctl(14, 0xae80, 0)                    = 0
ioctl(14, 0xae80, 0)                    = 0
ioctl(14, 0xae80, 0)                    = 0
ioctl(14, 0xae80, 0)                    = 0
ioctl(5, 0xffffffffc008ae67, 0x7ffade219bb0) = 0
ioctl(14, 0xae80, 0)                    = 0
ioctl(5, 0xffffffffc008ae67, 0x7ffade219bb0) = 0
ioctl(14, 0xae80, 0)                    = 0
ioctl(14, 0xae80, 0)                    = 0
ioctl(14, 0xae80, 0)                    = 0
ioctl(14, 0xae80, 0)                    = 0
ioctl(14, 0xae80, 0)                    = 0
ioctl(14, 0xae80, 0)                    = 0
ioctl(14, 0xae80, 0)                    = 0
ioctl(14, 0xae80, 0)                    = 0
ioctl(5, 0xffffffffc008ae67, 0x7ffade219bb0) = 0
ioctl(14, 0xae80, 0)                    = 0
ioctl(5, 0xffffffffc008ae67, 0x7ffade219bb0) = 0

Comment 1 RHEL Program Management 2010-09-10 11:17:55 UTC
Thank you for your bug report. This issue was evaluated for inclusion
in the current release of Red Hat Enterprise Linux. Unfortunately, we
are unable to address this request in the current release. Because we
are in the final stage of Red Hat Enterprise Linux 6 development, only
significant, release-blocking issues involving serious regressions and
data corruption can be considered.

If you believe this issue meets the release blocking criteria as
defined and communicated to you by your Red Hat Support representative,
please ask your representative to file this issue as a blocker for the
current release. Otherwise, ask that it be evaluated for inclusion in
the next minor release of Red Hat Enterprise Linux.

Comment 2 Keqin Hong 2010-09-10 11:18:30 UTC
Created attachment 446491 [details]
kvm_stat log

Comment 4 Glauber Costa 2010-09-13 13:02:03 UTC
What's the output of 5 consecutive "info migrate" commands at qemu monitor console, when the migration is stalled?

Comment 5 Keqin Hong 2010-09-14 01:59:29 UTC
(qemu) info migrate 
Migration status: active
transferred ram: 10860216 kbytes
remaining ram: 3079744 kbytes
total ram: 8405440 kbytes
(qemu) info migrate 
Migration status: active
transferred ram: 11212244 kbytes
remaining ram: 3058088 kbytes
total ram: 8405440 kbytes
(qemu) info migrate 
Migration status: active
transferred ram: 11686008 kbytes
remaining ram: 2987196 kbytes
total ram: 8405440 kbytes
(qemu) info migrate 
Migration status: active
transferred ram: 12123908 kbytes
remaining ram: 3125364 kbytes
total ram: 8405440 kbytes
(qemu) info migrate 
Migration status: active
transferred ram: 12189708 kbytes
remaining ram: 3125872 kbytes
total ram: 8405440 kbytes

Comment 6 Glauber Costa 2010-09-14 11:46:32 UTC
Ok, so on the migration side, it does really seem that the reason is that we're dirtying pages faster than we transfer, and not some other mystical reason.

Still don't have a theory on why it hangs after it is finished.

Would be good to rule out the dirty pages as a driver for this. Can you try migrating again, but now issuing, right before migration:

(qemu) migrate_set_speed 4G 

This should transfer the pages, regardless of the memory pressure we're seeing...

Comment 7 Keqin Hong 2010-09-14 15:04:11 UTC
I tried with (qemu) migrate_set_speed 4G right before migration. Migration first succeeded from A to B with no problem, but triggered guest hang after migration from B to A. 

(qemu) migrate_set_speed 4G
(qemu) migrate -d tcp:10.66.86.26:5831
(qemu) info migrate
Migration status: active
transferred ram: 3670644 kbytes
remaining ram: 4734932 kbytes
total ram: 8405440 kbytes
(qemu) info migrate
Migration status: completed

Comment 8 Glauber Costa 2010-09-14 15:42:14 UTC
Ok, let me get it straight:

You do set_speed from A -> B, and it works
You *DO NOT* do set_speed from B -> A, and then it hangs.

Is that correct?

Comment 9 Keqin Hong 2010-09-15 01:58:13 UTC
(In reply to comment #8)
> Ok, let me get it straight:
> 
> You do set_speed from A -> B, and it works
> You *DO NOT* do set_speed from B -> A, and then it hangs.
> 
> Is that correct?
No, I did set_speed for both before migration.
From A -> B, migration finished, and guest continued to work. From B -> A, migration also completed, but guest hanged. It might just show that under high mem stress, even with set_speed migration still can cause guest to hang, just not 100% reproducible.

Thanks.

Comment 10 Dor Laor 2010-10-17 14:03:09 UTC
This is a very good test case for live migration!

When the guest hang, is there a message? Can you still see the screen with vnc?
Without live migration, does a guest running stress will even hang?

Comment 11 Keqin Hong 2010-10-18 02:53:49 UTC
(In reply to comment #10)
> This is a very good test case for live migration!
> 
> When the guest hang, is there a message?
no message I observed. 
> Can you still see the screen with vnc?
Yes I can. But guest hung, no network, no mouse/keyboard input allowed.
> Without live migration, does a guest running stress will even hang?
No, it won't.

Comment 12 Juan Quintela 2011-02-04 12:47:23 UTC

*** This bug has been marked as a duplicate of bug 643970 ***