632557 – Migration with STRESS caused guest hang

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 632557 - Migration with STRESS caused guest hang

Summary: Migration with STRESS caused guest hang

Keywords:
Status:	CLOSED DUPLICATE of bug 643970
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	qemu-kvm
Sub Component:
Version:	6.0
Hardware:	All
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	rc
Target Release:	6.1
Assignee:	Juan Quintela
QA Contact:	Virtualization Bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	Rhel6KvmTier1
TreeView+	depends on / blocked

Reported:	2010-09-10 11:17 UTC by Keqin Hong
Modified:	2013-01-11 03:17 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2011-02-04 12:47:23 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
kvm_stat log (8.38 KB, text/plain) 2010-09-10 11:18 UTC, Keqin Hong	no flags	Details
View All

Description Keqin Hong 2010-09-10 11:17:11 UTC

Description of problem:
Migration locally with STRESS couldn't finish. After terminating ./stress process, migration completed. However, sometimes guest hanged.

Version-Release number of selected component (if applicable):
host:
qemu-kvm-0.12.1.2-2.113.el6.x86_64
kernel-2.6.32-71.el6.x86_64
guest:
RHEL5.5.z-64
kernel-2.6.18-194.11.3.el5

How reproducible:
3/20

Steps to Reproduce:
1. start src VM with 4vcpu and 8G mem
/usr/libexec/qemu-kvm -M rhel6.0.0 -enable-kvm -m 8G -smp 4,sockets=4,cores=1,threads=1 -name rhel5-64 -uuid d1a201e7-7109-507d-cb9a-b010becc6c6b -nodefconfig -nodefaults -monitor stdio -rtc base=utc -boot c -drive if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -drive file=/root/rhel5-64.img,if=none,id=drive-ide0-0-0,boot=on,format=raw -device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -netdev tap,id=hostnet0 -device e1000,netdev=hostnet0,id=net0,mac=52:54:00:94:3f:29,bus=pci.0,addr=0x3 -chardev pty,id=serial0 -device isa-serial,chardev=serial0 -usb -vnc 10.66.85.229:0 -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6
2. run ./stress --cpu 4 -vm 16 --vm-bytes 256M -verbose
3. start dest VM in listening mode 
   ... --incoming tcp:0:5800
4. migrate from src to dest
5. ^C to terminate ./stress
6. wait for migration to complete
  
Actual results:
Guest hanged (Desktop and console), couldn't ping its Ethernet interface

Expected results:
no hang

Additional info:
Tested on both a 64-cpu Intel box and another AMD box (virtlab: amd-2471-32-1), both had the problem.
Seems it happened on rhel5.5-z-64 (kernel-2.6.18-194.11.3.el5) only, as I didn't reproduce it on 32bit guest.

top:
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                                           
60919 root      20   0 8713m 4.0g 3128 R 11.6  3.7   1:03.07 qemu-kvm                                                                                                                                          
60907 root      20   0 8713m 4.0g 3128 S  9.6  3.7   1:07.83 qemu-kvm                                                                                                                                          
60916 root      20   0 8713m 4.0g 3128 S  0.0  3.7   0:04.34 qemu-kvm                                                                                                                                          
60917 root      20   0 8713m 4.0g 3128 S  0.0  3.7   0:02.52 qemu-kvm                                                                                                                                          
60918 root      20   0 8713m 4.0g 3128 S  0.0  3.7   0:02.28 qemu-kvm                                                                                                                 

# gdb attach 60916
(gdb) bt
#0  0x0000003581ad95f7 in ioctl () from /lib64/libc.so.6
#1  0x000000000042a57f in kvm_run (env=0x16ffd10) at /usr/src/debug/qemu-kvm-0.12.1.2/qemu-kvm.c:928
#2  0x000000000042aa09 in kvm_cpu_exec (env=<value optimized out>) at /usr/src/debug/qemu-kvm-0.12.1.2/qemu-kvm.c:1658
#3  0x000000000042b62f in kvm_main_loop_cpu (_env=0x16ffd10) at /usr/src/debug/qemu-kvm-0.12.1.2/qemu-kvm.c:1900
#4  ap_main_loop (_env=0x16ffd10) at /usr/src/debug/qemu-kvm-0.12.1.2/qemu-kvm.c:1950
#5  0x00000035822077e1 in start_thread () from /lib64/libpthread.so.0
#6  0x0000003581ae153d in clone () from /lib64/libc.so.6

# strace -p 60916
Process 60916 attached - interrupt to quit
rt_sigtimedwait([BUS RT_6], 0x7ffae001fb70, {0, 0}, 8) = -1 EAGAIN (Resource temporarily unavailable)
rt_sigpending([ALRM])                   = 0

# strace -p 60919
Process 60919 attached - interrupt to quit
rt_sigtimedwait([BUS RT_6], 0x7ffade219b70, {0, 0}, 8) = -1 EAGAIN (Resource temporarily unavailable)
rt_sigpending([])                       = 0
ioctl(14, 0xae80, 0)                    = 0
ioctl(14, 0xae80, 0)                    = 0
ioctl(14, 0xae80, 0)                    = 0
ioctl(14, 0xae80, 0)                    = 0
ioctl(14, 0xae80, 0)                    = 0
ioctl(14, 0xae80, 0)                    = 0
ioctl(14, 0xae80, 0)                    = 0
ioctl(14, 0xae80, 0)                    = 0
ioctl(5, 0xffffffffc008ae67, 0x7ffade219bb0) = 0
ioctl(14, 0xae80, 0)                    = 0
ioctl(5, 0xffffffffc008ae67, 0x7ffade219bb0) = 0
ioctl(14, 0xae80, 0)                    = 0
ioctl(14, 0xae80, 0)                    = 0
ioctl(14, 0xae80, 0)                    = 0
ioctl(14, 0xae80, 0)                    = 0
ioctl(14, 0xae80, 0)                    = 0
ioctl(14, 0xae80, 0)                    = 0
ioctl(14, 0xae80, 0)                    = 0
ioctl(14, 0xae80, 0)                    = 0
ioctl(5, 0xffffffffc008ae67, 0x7ffade219bb0) = 0
ioctl(14, 0xae80, 0)                    = 0
ioctl(5, 0xffffffffc008ae67, 0x7ffade219bb0) = 0

Comment 1 RHEL Program Management 2010-09-10 11:17:55 UTC

Thank you for your bug report. This issue was evaluated for inclusion
in the current release of Red Hat Enterprise Linux. Unfortunately, we
are unable to address this request in the current release. Because we
are in the final stage of Red Hat Enterprise Linux 6 development, only
significant, release-blocking issues involving serious regressions and
data corruption can be considered.

If you believe this issue meets the release blocking criteria as
defined and communicated to you by your Red Hat Support representative,
please ask your representative to file this issue as a blocker for the
current release. Otherwise, ask that it be evaluated for inclusion in
the next minor release of Red Hat Enterprise Linux.

Comment 2 Keqin Hong 2010-09-10 11:18:30 UTC

Created attachment 446491 [details]
kvm_stat log

Comment 4 Glauber Costa 2010-09-13 13:02:03 UTC

What's the output of 5 consecutive "info migrate" commands at qemu monitor console, when the migration is stalled?

Comment 5 Keqin Hong 2010-09-14 01:59:29 UTC

(qemu) info migrate 
Migration status: active
transferred ram: 10860216 kbytes
remaining ram: 3079744 kbytes
total ram: 8405440 kbytes
(qemu) info migrate 
Migration status: active
transferred ram: 11212244 kbytes
remaining ram: 3058088 kbytes
total ram: 8405440 kbytes
(qemu) info migrate 
Migration status: active
transferred ram: 11686008 kbytes
remaining ram: 2987196 kbytes
total ram: 8405440 kbytes
(qemu) info migrate 
Migration status: active
transferred ram: 12123908 kbytes
remaining ram: 3125364 kbytes
total ram: 8405440 kbytes
(qemu) info migrate 
Migration status: active
transferred ram: 12189708 kbytes
remaining ram: 3125872 kbytes
total ram: 8405440 kbytes

Comment 6 Glauber Costa 2010-09-14 11:46:32 UTC

Ok, so on the migration side, it does really seem that the reason is that we're dirtying pages faster than we transfer, and not some other mystical reason.

Still don't have a theory on why it hangs after it is finished.

Would be good to rule out the dirty pages as a driver for this. Can you try migrating again, but now issuing, right before migration:

(qemu) migrate_set_speed 4G 

This should transfer the pages, regardless of the memory pressure we're seeing...

Comment 7 Keqin Hong 2010-09-14 15:04:11 UTC

I tried with (qemu) migrate_set_speed 4G right before migration. Migration first succeeded from A to B with no problem, but triggered guest hang after migration from B to A. 

(qemu) migrate_set_speed 4G
(qemu) migrate -d tcp:10.66.86.26:5831
(qemu) info migrate
Migration status: active
transferred ram: 3670644 kbytes
remaining ram: 4734932 kbytes
total ram: 8405440 kbytes
(qemu) info migrate
Migration status: completed

Comment 8 Glauber Costa 2010-09-14 15:42:14 UTC

Ok, let me get it straight:

You do set_speed from A -> B, and it works
You *DO NOT* do set_speed from B -> A, and then it hangs.

Is that correct?

Comment 9 Keqin Hong 2010-09-15 01:58:13 UTC

(In reply to comment #8)
> Ok, let me get it straight:
> 
> You do set_speed from A -> B, and it works
> You *DO NOT* do set_speed from B -> A, and then it hangs.
> 
> Is that correct?
No, I did set_speed for both before migration.
From A -> B, migration finished, and guest continued to work. From B -> A, migration also completed, but guest hanged. It might just show that under high mem stress, even with set_speed migration still can cause guest to hang, just not 100% reproducible.

Thanks.

Comment 10 Dor Laor 2010-10-17 14:03:09 UTC

This is a very good test case for live migration!

When the guest hang, is there a message? Can you still see the screen with vnc?
Without live migration, does a guest running stress will even hang?

Comment 11 Keqin Hong 2010-10-18 02:53:49 UTC

(In reply to comment #10)
> This is a very good test case for live migration!
> 
> When the guest hang, is there a message?
no message I observed. 
> Can you still see the screen with vnc?
Yes I can. But guest hung, no network, no mouse/keyboard input allowed.
> Without live migration, does a guest running stress will even hang?
No, it won't.

Comment 12 Juan Quintela 2011-02-04 12:47:23 UTC


*** This bug has been marked as a duplicate of bug 643970 ***

Note You need to log in before you can comment on or make changes to this bug.