Hide Forgot
Description of problem: When try to migrate multiple guests at the same time, some of the guest can't be up on destinate host. Version-Release number of selected component (if applicable): RHEL5.6 (kernel-xen-2.6.18-238.el5, xen-3.0.3-120.el5) RHEL5.7 (kernel-xen-2.6.18-257.el5, xen-3.0.3-128.el5) How reproducible: 100% Steps to Reproduce: 1. boot up several guests on host I tried 5 HVM guests on a 12G memory host, 512MB memory for each one: $ cat t1.cfg name='t1' maxmem = 512 memory = 512 vcpus = 1 builder = "hvm" kernel = "/usr/lib/xen/boot/hvmloader" boot = "c" pae = 1 acpi = 1 apic = 1 localtime = 0 on_poweroff = "destroy" on_reboot = "restart" on_crash = "restart" sdl = 0 vnc = 1 vncunused = 1 vnclisten = "0.0.0.0" device_model = "/usr/lib64/xen/bin/qemu-dm" disk = [ "file:/data/export/t1.img,hda,w" ] vif = [ "mac=00:16:36:40:12:01,bridge=xenbr0,script=vif-bridge" ] serial = "pty" $ xm list Name ID Mem(MiB) VCPUs State Time(s) Domain-0 0 9387 8 r----- 72.8 t1 1 519 1 -b---- 37.2 t2 2 519 1 -b---- 35.4 t3 3 519 1 -b---- 35.8 t4 4 519 1 -b---- 36.6 t5 5 519 1 -b---- 36.8 2. migrate the DomUs to the destination at the same time: $ for i in t1 t2 t3 t4 t5; do ( xm migrate -l $i $dst_host_ip & );done 3. wait for the migration finish Actual results: [1] 2 of the 5 guests failed to migrate: ------------------------------------------------------------------------------- Error: /usr/lib64/xen/bin/xc_save 22 2 0 0 5 failed Usage: xm migrate <Domain> <Host> Migrate a domain to another machine. Options: -h, --help Print this help. -l, --live Use live migration. -p=portnum, --port=portnum Use specified port for migration. -r=MBIT, --resource=MBIT Set level of resource usage for migration. Error: /usr/lib64/xen/bin/xc_save 33 5 0 0 5 failed Usage: xm migrate <Domain> <Host> Migrate a domain to another machine. Options: -h, --help Print this help. -l, --live Use live migration. -p=portnum, --port=portnum Use specified port for migration. -r=MBIT, --resource=MBIT Set level of resource usage for migration. ------------------------------------------------------------------------------- [2] and on the destinate host, hypervisor keep on printing " (XEN) memory.c:124:d0 Could not allocate order=0 extent: id=0 memflags=0 (0 of 512)" please refer to the attachment for the xend and hypervisor (xm dmesg) log on both source and destinate hosts. Expected results: all the guests should be migrated to the destination successfully. Additional info: [1] the defect can be reproduced on both AMD and Intel host [2] happen with both PV and HVM guests [3] there is chance to migrate 4 guests (512M memory for each) successfully on the 12G memory host, but easily reproduce when migrate 5 guests. on 4G memory hosts, there is chance to migrate 3 guests (512 guests for each) successfully, easily reproduce the failure with 4 guests
Created attachment 493159 [details] xend log from source host
Created attachment 493160 [details] hypervisor log (xm dmesg) from source host
Created attachment 493162 [details] xend log from destination host
Created attachment 493163 [details] hypervisor log (xm dmesg) from destination host
Doing the migrations in parallel (i.e. backgrounding each command with '&') seems like a stress test to me. Doing each migration sequentially should definitely work, and I hope it does work. The resulting performance (speedup over sequential) of doing all the migrations in parallel is dependant on the host (# cpus, memory topology, network connections). And, most importantly, dom0 needs free memory to transfer the memory of it's guests, which is actually seen in the results above; when there was more memory, there were better results. Running out of memory is probably why we're failing when we fail. So I think this BZ is a good candidate for WONTFIX. Determining a maximum number of parallel migrations based on the host setup and memory allocation of the guests, and then blocking additional migration attempts until one or more migrations complete, sounds like a nice feature request. However, xen isn't really excepting feature requests at this stage in its life...
According to the log from source machine it's failing because the connection can't be established. See line: [2009-04-15 13:48:49 xend 4673] INFO (XendCheckpoint:498) Saving memory pages: iter 1 0%ERROR Internal error: Error when writing to state file (5) (errno 2) and then it's failing because the connection is being established however connection is being reset by peer: [2009-04-15 13:48:50 xend 4673] INFO (XendCheckpoint:498) Saving memory pages: iter 1 0 [snip] 70%ERROR Internal error: Error when writing to state file (2) (errno 104) # perror 2 OS error code 2: No such file or directory # perror 104 OS error code 104: Connection reset by peer According to the destination log it seems like dom IDs 2, 3, 5 are restored successfully and their device models exist (they are having PIDs according to the log file) however for domain ID 4 it's failing because it fails to allocate memory: [2011-04-19 07:09:31 xend 4782] INFO (XendCheckpoint:498) ERROR Internal error: Failed to allocate memory for batch.! There's no trace of domain 1 result and it appears to be failing with the message: Not enough memory is available, and dom0 cannot be shrunk any further. The "(XEN) memory.c:124:d0 Could not allocate order=0 extent: id=0 memflags=0 (0 of 512)" message seems to be indicating that the guest memory has not been transferred by trying to access the invalid memory blocks. But basically I agree with Drew's suggestion to WONTFIX it since this appears to be stress connected. Michal
I've been testing this in the Beaker lab with the 32G IBM machines and it seems that closing as WONTFIX is appropriate so closing it now. Michal