1044853 – Migration sometimes failed on the destination host side

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1044853 - Migration sometimes failed on the destination host side

Summary: Migration sometimes failed on the destination host side

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	qemu-kvm
Sub Component:
Version:	7.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Hai Huang
QA Contact:	Virtualization Bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-12-19 05:58 UTC by Qunfang Zhang
Modified:	2014-01-03 13:19 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2014-01-03 13:19:49 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Qunfang Zhang 2013-12-19 05:58:17 UTC

Description of problem:
Migrate a guest from source host to destination host side with some application running, it sometimes fails on the destination host. I hit this problem for about 4~5 times today, it happens in the following scenarios:
(1) Migrate guest during image installation (installing package stage).
(2) Migrate guest during cdrom is in use:
(in guest) # while true; do cp -r /media/RHEL7\ X86_64/ /home/test; sleep 1; rm -rm /home/test ; done
(3) Migrate guest when stress is running inside guest:
#stress -m 2

All the above scenarios are with xbzrle=on when I do migration. But I'm not sure whether xbzrle=off could trigger it because the bug is not always reproduced. This will need lots of time attempts.

Version-Release number of selected component (if applicable):
kernek-3.10.0-63.el7.x86_64
qemu-kvm-1.5.3-24.el7.x86_64

How reproducible:
Always 

Steps to Reproduce:
1. Boot up a guest on source host 

/usr/libexec/qemu-kvm -cpu SandyBridge -enable-kvm -m 2048 -smp 2,sockets=2,cores=1,threads=1 -enable-kvm -name t2-rhel6.4-32 -uuid 61b6c504-5a8b-4fe1-8347-6c929b750dde -k en-us -rtc base=localtime,clock=host,driftfix=slew -no-kvm-pit-reinjection -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device usb-tablet,id=input0 -drive file=/mnt/installation.qcow2,if=none,id=disk0,format=qcow2,werror=stop,rerror=stop,aio=native -device ide-drive,bus=ide.0,unit=1,drive=disk0,id=disk0  -drive file=/mnt/boot.iso,if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw -device ide-drive,drive=drive-ide0-1-0,bus=ide.1,unit=0,id=cdrom -netdev tap,id=hostnet0 -device rtl8139,netdev=hostnet0,id=net0,mac=44:37:E6:5E:91:85,bus=pci.0,addr=0x5 -monitor stdio -qmp tcp:0:6666,server,nowait -chardev socket,path=/tmp/isa-serial,server,nowait,id=isa1 -device isa-serial,chardev=isa1,id=isa-serial1 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x8 -chardev socket,id=charchannel0,path=/tmp/serial-socket,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.rhevm.vdsm -chardev socket,path=/tmp/foo,server,nowait,id=foo -device virtconsole,chardev=foo,id=console0  -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x9 -vnc :10 -k en-us -boot dc -chardev socket,path=/tmp/qga.sock,server,nowait,id=qga0 -device virtserialport,bus=virtio-serial0.0,chardev=qga0,name=org.qemu.guest_agent.0  -global  PIIX4_PM.disable_s3=0 -global PIIX4_PM.disable_s4=0

2.  
(qemu)migrate_set_capability xbzrle on
(qemu) migrate_set_cache_size 2G

3. Do 1 of the following 3 scenarios:
(1) install guest (for an empty image)
(2) read cdrom inside guest (for an pre-installed image)
(in guest) # while true; do cp -r /media/RHEL7\ X86_64/ /home/test; sleep 1; rm -rm /home/test ; done
(3) Migrate guest when stress is running inside guest:
#stress -m 2

4. Migrate the guest
(qemu) migrate -d tcp:t2:5800

Actual results:
Guest failed to load on the destination host:
(qemu) info status 
VM status: paused (inmigrate)
(qemu) qemu: warning: error while loading state section id 2
load of migration failed


Expected results:
Guest should be migrated successfully and work well.

Additional info:
Sometimes the issue happens when the migration has not finished, so guest is still running on src host. But sometimes the problems happens just when migration finished. Then guest will be dead.

Comment 2 Orit Wasserman 2013-12-19 06:50:30 UTC

Hi,
I'm suspecting it related to the high memory usage of the XBZRLE feature.
What is the amount of memory the hosts have?
Can you print the memory usage when migration fails?
does it happen when you set the cache size to a smaller value (migrate_set_cache_size)?

Thanks,
Orit

Comment 3 Qunfang Zhang 2013-12-19 09:59:51 UTC

(In reply to Orit Wasserman from comment #2)
> Hi,
> I'm suspecting it related to the high memory usage of the XBZRLE feature.
> What is the amount of memory the hosts have?

The host has 8G mem

> Can you print the memory usage when migration fails?

Below is the host memory usage when migration fails. I re-test and reproduce again still with 2G migration cache size.

#cat /proc/meminfo

MemTotal:        7911636 kB
MemFree:         4280912 kB
Buffers:              36 kB
Cached:          1333200 kB
SwapCached:          968 kB
Active:          2421036 kB
Inactive:        1050756 kB
Active(anon):    2069740 kB
Inactive(anon):    75688 kB
Active(file):     351296 kB
Inactive(file):   975068 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:       8273916 kB
SwapFree:        8272240 kB
Dirty:                88 kB
Writeback:             0 kB
AnonPages:       2138048 kB
Mapped:            21272 kB
Shmem:              6852 kB
Slab:              54464 kB
SReclaimable:      19512 kB
SUnreclaim:        34952 kB
KernelStack:        1560 kB
PageTables:        11228 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    12229732 kB
Committed_AS:    2567384 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      150644 kB
VmallocChunk:   34359584748 kB
HardwareCorrupted:     0 kB
AnonHugePages:   1992704 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:       98072 kB
DirectMap2M:     4091904 kB
DirectMap1G:     4194304 kB

#free -m 
             total       used       free     shared    buffers     cached
Mem:          7726       3545       4180          6          0       1301
-/+ buffers/cache:       2243       5482
Swap:         8079          1       8078


> does it happen when you set the cache size to a smaller value
> (migrate_set_cache_size)?

I have not reproduced the bug with a smaller value so far (I used 512M cache size and tried 5 times already)

> 
> Thanks,
> Orit

Comment 5 Qunfang Zhang 2013-12-24 08:21:43 UTC

(In reply to Qunfang Zhang from comment #0)

> 
> How reproducible:
> Always 
Sometimes (as in summary)

Note You need to log in before you can comment on or make changes to this bug.