Bug 307591
Summary: | XenD doesn't recover from xm migrate failure when ballooning fails | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Mark Nielsen <mnielsen> |
Component: | xen | Assignee: | Michal Novotny <minovotn> |
Status: | CLOSED WONTFIX | QA Contact: | Virtualization Bugs <virt-bugs> |
Severity: | low | Docs Contact: | |
Priority: | low | ||
Version: | 5.0 | CC: | areis, clalance, ghelleks, minovotn, xen-maint |
Target Milestone: | --- | ||
Target Release: | 5.6 | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2010-08-27 18:34:26 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 514500 |
Description
Mark Nielsen
2007-09-26 18:16:43 UTC
libvirt has no involvement with the migration process whatsoever. Failure to report errors is either xm, or XenD's fault. Please provide version numbers of the kernel-xen and xen rpms. Details of the source & destination host. The type of storage used. Also please reproduce using 'xm migrate' manually, and all conga/rgmanager stuff disabled. domain0: [root@tsg03 xen]# xm list Name ID Mem(MiB) VCPUs State Time(s) Domain-0 0 27750 4 r----- 7625.7 Zombie-migrating-test04 37 8 1 ---s-d 13.6 this is after I've attempted to migrate (using cluster suite) AND then did an "xm destroy" on the domain. Before the "xm destroy" it simply stays in the "migrating-test04" state and never recovers. As seen below, if I attempt the migrate using the xm commands (cluster suite not involved) then the domain does return to the running state after the migration failure. [root@tsg02 log]# uname -a Linux tsg02 2.6.18-8.1.14.el5xen #1 SMP Tue Sep 25 11:59:34 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux from /var/log/messages Oct 30 08:06:19 tsg02 clurgmgrd[16000]: <notice> vm:test04 is now running locally I get this messages even when the actual migration fails using cluster suite. from /var/log/xen/xend.log (there is nothing in xend-debug.log) [2007-10-30 08:06:13 xend.XendDomainInfo 15903] DEBUG (XendDomainInfo:265) XendDomainInfo.restore(['domain', ['domid', '37'], ['uuid', 'bace31ed-4bb0-0abc-4d c2-4386b0927b5f'], ['vcpus', '1'], ['vcpu_avail', '1'], ['cpu_weight', '1.0'], ['memory', '2048'], ['shadow_memory', '0'], ['maxmem', '2048'], ['bootloader', '/usr/bin/pygrub'], ['features'], ['name', 'test04'], ['on_poweroff', 'destroy'], ['on_reboot', 'restart'], ['on_crash', 'restart'], ['image', ['linux', ['r amdisk', '/var/lib/xen/initrd.4cpUvt'], ['kernel', '/var/lib/xen/vmlinuz.6V6MVa'], ['args', 'ro root=/dev/VolGrp00/LogVol01 console=xvc0 rhgb quiet']]], ['de vice', ['vif', ['backend', '0'], ['script', 'vif-bridge'], ['bridge', 'xenbr0'], ['mac', '00:16:3e:45:da:36']]], ['device', ['vbd', ['backend', '0'], ['dev', 'xvda:disk'], ['uname', 'phy:/dev/VolGrp02/LVtest04'], ['mode', 'w']]], ['state', '-b----'], ['shutdown_reason', 'poweroff'], ['cpu_time', '13.592719439'], ['online_vcpus', '1'], ['up_time', '1314.110075'], ['start_time', '1193744882.98'], ['store_mfn', '7978517'], ['console_mfn', '7978516']]) [2007-10-30 08:06:13 xend.XendDomainInfo 15903] DEBUG (XendDomainInfo:296) parseConfig: config is ['domain', ['domid', '37'], ['uuid', 'bace31ed-4bb0-0abc-4d c2-4386b0927b5f'], ['vcpus', '1'], ['vcpu_avail', '1'], ['cpu_weight', '1.0'], ['memory', '2048'], ['shadow_memory', '0'], ['maxmem', '2048'], ['bootloader', '/usr/bin/pygrub'], ['features'], ['name', 'test04'], ['on_poweroff', 'destroy'], ['on_reboot', 'restart'], ['on_crash', 'restart'], ['image', ['linux', ['r amdisk', '/var/lib/xen/initrd.4cpUvt'], ['kernel', '/var/lib/xen/vmlinuz.6V6MVa'], ['args', 'ro root=/dev/VolGrp00/LogVol01 console=xvc0 rhgb quiet']]], ['de vice', ['vif', ['backend', '0'], ['script', 'vif-bridge'], ['bridge', 'xenbr0'], ['mac', '00:16:3e:45:da:36']]], ['device', ['vbd', ['backend', '0'], ['dev', 'xvda:disk'], ['uname', 'phy:/dev/VolGrp02/LVtest04'], ['mode', 'w']]], ['state', '-b----'], ['shutdown_reason', 'poweroff'], ['cpu_time', '13.592719439'], ['online_vcpus', '1'], ['up_time', '1314.110075'], ['start_time', '1193744882.98'], ['store_mfn', '7978517'], ['console_mfn', '7978516']] [2007-10-30 08:06:13 xend.XendDomainInfo 15903] DEBUG (XendDomainInfo:397) parseConfig: result is {'shadow_memory': 0, 'start_time': 1193744882.98, 'uuid': ' bace31ed-4bb0-0abc-4dc2-4386b0927b5f', 'on_crash': 'restart', 'on_reboot': 'restart', 'localtime': None, 'image': ['linux', ['ramdisk', '/var/lib/xen/initrd. 4cpUvt'], ['kernel', '/var/lib/xen/vmlinuz.6V6MVa'], ['args', 'ro root=/dev/VolGrp00/LogVol01 console=xvc0 rhgb quiet']], 'on_poweroff': 'destroy', 'bootload er_args': None, 'cpus': None, 'name': 'test04', 'backend': [], 'vcpus': 1, 'cpu_weight': 1.0, 'features': None, 'vcpu_avail': 1, 'memory': 2048, 'device': [( 'vif', ['vif', ['backend', '0'], ['script', 'vif-bridge'], ['bridge', 'xenbr0'], ['mac', '00:16:3e:45:da:36']]), ('vbd', ['vbd', ['backend', '0'], ['dev', 'x vda:disk'], ['uname', 'phy:/dev/VolGrp02/LVtest04'], ['mode', 'w']])], 'bootloader': '/usr/bin/pygrub', 'cpu': None, 'maxmem': 2048} [2007-10-30 08:06:13 xend.XendDomainInfo 15903] DEBUG (XendDomainInfo:1264) XendDomainInfo.construct: None [2007-10-30 08:06:13 xend.XendDomainInfo 15903] DEBUG (XendDomainInfo:715) Storing VM details: {'shadow_memory': '0', 'uuid': 'bace31ed-4bb0-0abc-4dc2-4386b0 927b5f', 'on_reboot': 'restart', 'start_time': '1193744882.98', 'on_poweroff': 'destroy', 'name': 'test04', 'xend/restart_count': '0', 'vcpus': '1', 'vcpu_av ail': '1', 'memory': '2048', 'on_crash': 'restart', 'image': "(linux (ramdisk /var/lib/xen/initrd.4cpUvt) (kernel /var/lib/xen/vmlinuz.6V6MVa) (args 'ro root =/dev/VolGrp00/LogVol01 console=xvc0 rhgb quiet'))", 'maxmem': '2048'} [2007-10-30 08:06:13 xend 15903] DEBUG (DevController:110) DevController: writing {'backend-id': '0', 'mac': '00:16:3e:45:da:36', 'handle': '0', 'state': '1' , 'backend': '/local/domain/0/backend/vif/11/0'} to /local/domain/11/device/vif/0. [2007-10-30 08:06:13 xend 15903] DEBUG (DevController:112) DevController: writing {'bridge': 'xenbr0', 'domain': 'test04', 'handle': '0', 'script': '/etc/xen /scripts/vif-bridge', 'state': '1', 'frontend': '/local/domain/11/device/vif/0', 'mac': '00:16:3e:45:da:36', 'online': '1', 'frontend-id': '11'} to /local/do main/0/backend/vif/11/0. [2007-10-30 08:06:13 xend 15903] DEBUG (blkif:24) exception looking up device number for xvda: [Errno 2] No such file or directory: '/dev/xvda' [2007-10-30 08:06:13 xend 15903] DEBUG (DevController:110) DevController: writing {'backend-id': '0', 'virtual-device': '51712', 'device-type': 'disk', 'stat e': '1', 'backend': '/local/domain/0/backend/vbd/11/51712'} to /local/domain/11/device/vbd/51712. [2007-10-30 08:06:13 xend 15903] DEBUG (DevController:112) DevController: writing {'domain': 'test04', 'frontend': '/local/domain/11/device/vbd/51712', 'dev' : 'xvda', 'state': '1', 'params': '/dev/VolGrp02/LVtest04', 'mode': 'w', 'online': '1', 'frontend-id': '11', 'type': 'phy'} to /local/domain/0/backend/vbd/11 /51712. [2007-10-30 08:06:14 xend.XendDomainInfo 15903] DEBUG (XendDomainInfo:750) Storing domain details: {'console/port': '2', 'name': 'test04', 'console/limit': ' 1048576', 'vm': '/vm/bace31ed-4bb0-0abc-4dc2-4386b0927b5f', 'domid': '11', 'cpu/0/availability': 'online', 'memory/target': '2097152', 'store/port': '1'} [2007-10-30 08:06:14 xend 15903] DEBUG (balloon:133) Balloon: 8584 KiB free; 0 to scrub; need 2105344; retries: 25. [2007-10-30 08:06:46 xend.XendDomainInfo 15903] DEBUG (XendDomainInfo:1463) XendDomainInfo.destroy: domid=11 [2007-10-30 08:06:46 xend.XendDomainInfo 15903] DEBUG (XendDomainInfo:1471) XendDomainInfo.destroyDomain(11) [2007-10-30 08:06:46 xend 15903] ERROR (XendDomain:268) Restore failed Traceback (most recent call last): File "/usr/lib64/python2.4/site-packages/xen/xend/XendDomain.py", line 263, in domain_restore_fd return XendCheckpoint.restore(self, fd) File "/usr/lib64/python2.4/site-packages/xen/xend/XendCheckpoint.py", line 150, in restore balloon.free(xc.pages_to_kib(nr_pfns)) File "/usr/lib64/python2.4/site-packages/xen/xend/balloon.py", line 166, in free raise VmError( VmError: I need 2105344 KiB, but dom0_min_mem is 262144 and shrinking to 262144 KiB would leave only 1710728 KiB free. [root@tsg03 ~]# xm migrate --live test04 tsg02-gfs Error: /usr/lib64/xen/bin/xc_save 23 7 0 0 1 failed Usage: xm migrate <Domain> <Host> Migrate a domain to another machine. Options: -h, --help Print this help. -l, --live Use live migration. -p=portnum, --port=portnum Use specified port for migration. -r=MBIT, --resource=MBIT Set level of resource usage for migration. [root@tsg03 ~]# at this point I have pretty much the same information in /var/log/xend.log, but the domains properly falls back to running properly on tsg03. So, to answer your question about "xm migrate" manually, it does properly determine that a failure occurred and returns to the running state on the original node. It is only the failed migration using conga (or clusvcadm) that the VM does not properly fall back to the original node after migration failure. domainU Linux localhost.localdomain 2.6.18-8.el5xen #1 SMP Fri Jan 26 14:29:35 EST 2007 x86_64 x86_64 x86_64 GNU/Linux this migrate does succeed if there is enough memory available (i.e. if there are no other domains running) Ok, this comment #3 at least hints at a way to reproduce the issue. The logs show that XenD on the target machine was unable to balloon down Dom0 enough to be able to accept the incoming guest domain. At a guess I'd say the migration code isn't handling this failure to balloon & just giving up, leaving this still-born zombie domain. is it the migration code that's failing? When I attempt the migration manually as stated in comment #3, the domU will at least return to a running state after the failure. It seems to me that cluster suite (rgmanager?) is not allowing the domU to return to running on the original node after the migration failure. (In reply to comment #5) > is it the migration code that's failing? When I attempt the migration manually > as stated in comment #3, the domU will at least return to a running state after > the failure. It seems to me that cluster suite (rgmanager?) is not allowing the > domU to return to running on the original node after the migration failure. Mark, I was unable to reproduce it but I am not having cluster environment set up. Could you please retest using the latest virttest version of xen package and the latest kernel-xen package, i.e. kernel-xen-2.6.18-194.el5 and xen-3.0.3-107.el5virttest25 ( http://people.redhat.com/mrezanin/xen/ ) and could you retest? Also, is there a way to reproduce it in non-cluster environment? Could you give me some steps to do it? Thanks, Michal Any news regarding this bug? Michal Closed, internal old bug. No response to needinfo. Please reopen if info surfaces. |