Bug 697821

Summary: fail to migrate multiple guests at the same time
Product: Red Hat Enterprise Linux 5 Reporter: Qixiang Wan <qwan>
Component: xenAssignee: Xen Maintainance List <xen-maint>
Status: CLOSED WONTFIX QA Contact: Virtualization Bugs <virt-bugs>
Severity: medium Docs Contact:
Priority: medium    
Version: 5.6CC: drjones, jzheng, leiwang, minovotn, mrezanin, xen-maint, yuzhang, yuzhou
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-04-27 14:53:56 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 699611    
Attachments:
Description Flags
xend log from source host
none
hypervisor log (xm dmesg) from source host
none
xend log from destination host
none
hypervisor log (xm dmesg) from destination host none

Description Qixiang Wan 2011-04-19 11:37:41 UTC
Description of problem:
When try to migrate multiple guests at the same time, some of the guest can't be
up on destinate host.

Version-Release number of selected component (if applicable):
RHEL5.6 (kernel-xen-2.6.18-238.el5, xen-3.0.3-120.el5)
RHEL5.7 (kernel-xen-2.6.18-257.el5, xen-3.0.3-128.el5)

How reproducible:
100%

Steps to Reproduce:
1. boot up several guests on host
I tried 5 HVM guests on a 12G memory host, 512MB memory for each one:
$ cat t1.cfg
name='t1'
maxmem = 512
memory = 512
vcpus = 1
builder = "hvm"
kernel = "/usr/lib/xen/boot/hvmloader"
boot = "c"
pae = 1
acpi = 1
apic = 1
localtime = 0
on_poweroff = "destroy"
on_reboot = "restart"
on_crash = "restart"
sdl = 0
vnc = 1
vncunused = 1
vnclisten = "0.0.0.0"
device_model = "/usr/lib64/xen/bin/qemu-dm"
disk = [ "file:/data/export/t1.img,hda,w" ]
vif = [ "mac=00:16:36:40:12:01,bridge=xenbr0,script=vif-bridge" ]
serial = "pty"

$ xm list
Name                                      ID Mem(MiB) VCPUs State   Time(s)
Domain-0                                   0     9387     8 r-----     72.8
t1                                         1      519     1 -b----     37.2
t2                                         2      519     1 -b----     35.4
t3                                         3      519     1 -b----     35.8
t4                                         4      519     1 -b----     36.6
t5                                         5      519     1 -b----     36.8

2. migrate the DomUs to the destination at the same time:
$ for i in t1 t2 t3 t4 t5; do ( xm migrate -l $i $dst_host_ip & );done

3. wait for the migration finish
  
Actual results:
[1] 2 of the 5 guests failed to migrate:
-------------------------------------------------------------------------------
Error: /usr/lib64/xen/bin/xc_save 22 2 0 0 5 failed
Usage: xm migrate <Domain> <Host>

Migrate a domain to another machine.

Options:

-h, --help           Print this help.
-l, --live           Use live migration.
-p=portnum, --port=portnum
                     Use specified port for migration.
-r=MBIT, --resource=MBIT
                     Set level of resource usage for migration.

Error: /usr/lib64/xen/bin/xc_save 33 5 0 0 5 failed
Usage: xm migrate <Domain> <Host>

Migrate a domain to another machine.

Options:

-h, --help           Print this help.
-l, --live           Use live migration.
-p=portnum, --port=portnum
                     Use specified port for migration.
-r=MBIT, --resource=MBIT
                     Set level of resource usage for migration.
-------------------------------------------------------------------------------

[2] and on the destinate host, hypervisor keep on printing " (XEN) memory.c:124:d0 Could not allocate order=0 extent: id=0 memflags=0 (0 of 512)"

please refer to the attachment for the xend and hypervisor (xm dmesg) log on both source and destinate hosts.

Expected results:
all the guests should be migrated to the destination successfully.

Additional info:
[1] the defect can be reproduced on both AMD and Intel host
[2] happen with both PV and HVM guests
[3] there is chance to migrate 4 guests (512M memory for each) successfully on the 12G memory host, but easily reproduce when migrate 5 guests.
on 4G memory hosts, there is chance to migrate 3 guests (512 guests for each) successfully, easily reproduce the failure with 4 guests

Comment 1 Qixiang Wan 2011-04-19 11:39:04 UTC
Created attachment 493159 [details]
xend log from source host

Comment 2 Qixiang Wan 2011-04-19 11:39:57 UTC
Created attachment 493160 [details]
hypervisor log (xm dmesg) from source host

Comment 3 Qixiang Wan 2011-04-19 11:41:15 UTC
Created attachment 493162 [details]
xend log from destination host

Comment 4 Qixiang Wan 2011-04-19 11:41:55 UTC
Created attachment 493163 [details]
hypervisor log (xm dmesg) from destination host

Comment 8 Andrew Jones 2011-04-19 15:10:56 UTC
Doing the migrations in parallel (i.e. backgrounding each command with '&') seems like a stress test to me. Doing each migration sequentially should definitely work, and I hope it does work. The resulting performance (speedup over sequential) of doing all the migrations in parallel is dependant on the host (# cpus, memory topology, network connections). And, most importantly, dom0 needs free memory to transfer the memory of it's guests, which is actually seen in the results above; when there was more memory, there were better results. Running out of memory is probably why we're failing when we fail. So I think this BZ is a good candidate for WONTFIX.

Determining a maximum number of parallel migrations based on the host setup and memory allocation of the guests, and then blocking additional migration attempts until one or more migrations complete, sounds like a nice feature request. However, xen isn't really excepting feature requests at this stage in its life...

Comment 9 Michal Novotny 2011-04-20 11:11:42 UTC
According to the log from source machine it's failing because the connection can't be established. See line:

[2009-04-15 13:48:49 xend 4673] INFO (XendCheckpoint:498) Saving memory pages: iter 1   0%ERROR Internal error: Error when writing to state file (5) (errno 2)

and then it's failing because the connection is being established however connection is being reset by peer:

[2009-04-15 13:48:50 xend 4673] INFO (XendCheckpoint:498) Saving memory pages: iter 1   0 [snip] 70%ERROR Internal error: Error when writing to state file (2) (errno 104)

# perror 2
OS error code 2: No such file or directory
# perror 104
OS error code 104: Connection reset by peer

According to the destination log it seems like dom IDs 2, 3, 5 are restored successfully and their device models exist (they are having PIDs according to the log file) however for domain ID 4 it's failing because it fails to allocate memory:

[2011-04-19 07:09:31 xend 4782] INFO (XendCheckpoint:498) ERROR Internal error: Failed to allocate memory for batch.!

There's no trace of domain 1 result and it appears to be failing with the message:  Not enough memory is available, and dom0 cannot be shrunk any further.

The "(XEN) memory.c:124:d0 Could not allocate order=0 extent: id=0 memflags=0 (0 of 512)" message seems to be indicating that the guest memory has not been transferred by trying to access the invalid memory blocks.

But basically I agree with Drew's suggestion to WONTFIX it since this appears to be stress connected.

Michal

Comment 10 Michal Novotny 2011-04-27 14:53:56 UTC
I've been testing this in the Beaker lab with the 32G IBM machines and it seems that closing as WONTFIX is appropriate so closing it now.

Michal