Bug 307591

Summary:	XenD doesn't recover from xm migrate failure when ballooning fails
Product:	Red Hat Enterprise Linux 5	Reporter:	Mark Nielsen <mnielsen>
Component:	xen	Assignee:	Michal Novotny <minovotn>
Status:	CLOSED WONTFIX	QA Contact:	Virtualization Bugs <virt-bugs>
Severity:	low	Docs Contact:
Priority:	low
Version:	5.0	CC:	areis, clalance, ghelleks, minovotn, xen-maint
Target Milestone:	---
Target Release:	5.6
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2010-08-27 18:34:26 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	514500

Description Mark Nielsen 2007-09-26 18:16:43 UTC

Description of problem:
When attempting to live migrate a VM using conga, rgmanager calls xm migrate. If
the destination node for the migrate fails (in my situation, due to lack of
memory) libvirt doesn't provide the correct information to let rgmanager know
the migration has failed. rgmanager thinks the migration was a success and
reports it as such, leaving a zombie migrating VM on the source node, and
nothing running on the destination.

Version-Release number of selected component (if applicable):


How reproducible:
100%

Steps to Reproduce:
1. attempt to migrate a VM to a node in a cluster that can't run the VM
2.
3.
  
Actual results:
the migration fails, rgmanager thinks it succeeded and reports to cluster suite
a success. The source node has a "hung" migrating-VM and no new VM shows up on
the destination.

Expected results:
if a migration fails, libvirt reports such to rgmanager and then whatever action
happens from there would be up to cluster suite, I imagine. Whether your policy
is to relocate, disable, or restart.

Additional info:

Comment 1 Daniel Berrangé 2007-09-26 19:28:04 UTC

libvirt has no involvement with the migration process whatsoever. Failure to
report errors is either xm, or XenD's fault.

Comment 2 Daniel Berrangé 2007-09-26 19:43:45 UTC

Please provide version numbers of the kernel-xen and xen rpms. Details of the
source & destination host. The type of storage used. Also please reproduce 
using 'xm migrate' manually, and all conga/rgmanager stuff disabled.

Comment 3 Mark Nielsen 2007-10-31 14:47:34 UTC

domain0:

[root@tsg03 xen]# xm list
Name                                      ID Mem(MiB) VCPUs State   Time(s)
Domain-0                                   0    27750     4 r-----   7625.7
Zombie-migrating-test04                   37        8     1 ---s-d     13.6

this is after I've attempted to migrate (using cluster suite) AND then did an
"xm destroy" on the domain. Before the "xm destroy" it simply stays in the
"migrating-test04" state and never recovers. As seen below, if I attempt the
migrate using the xm commands (cluster suite not involved) then the domain does
return to the running state after the migration failure.

[root@tsg02 log]# uname -a
Linux tsg02 2.6.18-8.1.14.el5xen #1 SMP Tue Sep 25 11:59:34 EDT 2007 x86_64
x86_64 x86_64 GNU/Linux

from /var/log/messages
Oct 30 08:06:19 tsg02 clurgmgrd[16000]: <notice> vm:test04 is now running locally 
I get this messages even when the actual migration fails using cluster suite.

from /var/log/xen/xend.log (there is nothing in xend-debug.log)
[2007-10-30 08:06:13 xend.XendDomainInfo 15903] DEBUG (XendDomainInfo:265)
XendDomainInfo.restore(['domain', ['domid', '37'], ['uuid', 'bace31ed-4bb0-0abc-4d
c2-4386b0927b5f'], ['vcpus', '1'], ['vcpu_avail', '1'], ['cpu_weight', '1.0'],
['memory', '2048'], ['shadow_memory', '0'], ['maxmem', '2048'], ['bootloader',
 '/usr/bin/pygrub'], ['features'], ['name', 'test04'], ['on_poweroff',
'destroy'], ['on_reboot', 'restart'], ['on_crash', 'restart'], ['image',
['linux', ['r
amdisk', '/var/lib/xen/initrd.4cpUvt'], ['kernel',
'/var/lib/xen/vmlinuz.6V6MVa'], ['args', 'ro root=/dev/VolGrp00/LogVol01
console=xvc0 rhgb quiet']]], ['de
vice', ['vif', ['backend', '0'], ['script', 'vif-bridge'], ['bridge', 'xenbr0'],
['mac', '00:16:3e:45:da:36']]], ['device', ['vbd', ['backend', '0'], ['dev',
 'xvda:disk'], ['uname', 'phy:/dev/VolGrp02/LVtest04'], ['mode', 'w']]],
['state', '-b----'], ['shutdown_reason', 'poweroff'], ['cpu_time', '13.592719439'], 
['online_vcpus', '1'], ['up_time', '1314.110075'], ['start_time',
'1193744882.98'], ['store_mfn', '7978517'], ['console_mfn', '7978516']])
[2007-10-30 08:06:13 xend.XendDomainInfo 15903] DEBUG (XendDomainInfo:296)
parseConfig: config is ['domain', ['domid', '37'], ['uuid', 'bace31ed-4bb0-0abc-4d
c2-4386b0927b5f'], ['vcpus', '1'], ['vcpu_avail', '1'], ['cpu_weight', '1.0'],
['memory', '2048'], ['shadow_memory', '0'], ['maxmem', '2048'], ['bootloader',
 '/usr/bin/pygrub'], ['features'], ['name', 'test04'], ['on_poweroff',
'destroy'], ['on_reboot', 'restart'], ['on_crash', 'restart'], ['image',
['linux', ['r
amdisk', '/var/lib/xen/initrd.4cpUvt'], ['kernel',
'/var/lib/xen/vmlinuz.6V6MVa'], ['args', 'ro root=/dev/VolGrp00/LogVol01
console=xvc0 rhgb quiet']]], ['de
vice', ['vif', ['backend', '0'], ['script', 'vif-bridge'], ['bridge', 'xenbr0'],
['mac', '00:16:3e:45:da:36']]], ['device', ['vbd', ['backend', '0'], ['dev',
 'xvda:disk'], ['uname', 'phy:/dev/VolGrp02/LVtest04'], ['mode', 'w']]],
['state', '-b----'], ['shutdown_reason', 'poweroff'], ['cpu_time', '13.592719439'], 
['online_vcpus', '1'], ['up_time', '1314.110075'], ['start_time',
'1193744882.98'], ['store_mfn', '7978517'], ['console_mfn', '7978516']]
[2007-10-30 08:06:13 xend.XendDomainInfo 15903] DEBUG (XendDomainInfo:397)
parseConfig: result is {'shadow_memory': 0, 'start_time': 1193744882.98, 'uuid': '
bace31ed-4bb0-0abc-4dc2-4386b0927b5f', 'on_crash': 'restart', 'on_reboot':
'restart', 'localtime': None, 'image': ['linux', ['ramdisk', '/var/lib/xen/initrd.
4cpUvt'], ['kernel', '/var/lib/xen/vmlinuz.6V6MVa'], ['args', 'ro
root=/dev/VolGrp00/LogVol01 console=xvc0 rhgb quiet']], 'on_poweroff':
'destroy', 'bootload
er_args': None, 'cpus': None, 'name': 'test04', 'backend': [], 'vcpus': 1,
'cpu_weight': 1.0, 'features': None, 'vcpu_avail': 1, 'memory': 2048, 'device': [(
'vif', ['vif', ['backend', '0'], ['script', 'vif-bridge'], ['bridge', 'xenbr0'],
['mac', '00:16:3e:45:da:36']]), ('vbd', ['vbd', ['backend', '0'], ['dev', 'x
vda:disk'], ['uname', 'phy:/dev/VolGrp02/LVtest04'], ['mode', 'w']])],
'bootloader': '/usr/bin/pygrub', 'cpu': None, 'maxmem': 2048}
[2007-10-30 08:06:13 xend.XendDomainInfo 15903] DEBUG (XendDomainInfo:1264)
XendDomainInfo.construct: None
[2007-10-30 08:06:13 xend.XendDomainInfo 15903] DEBUG (XendDomainInfo:715)
Storing VM details: {'shadow_memory': '0', 'uuid': 'bace31ed-4bb0-0abc-4dc2-4386b0
927b5f', 'on_reboot': 'restart', 'start_time': '1193744882.98', 'on_poweroff':
'destroy', 'name': 'test04', 'xend/restart_count': '0', 'vcpus': '1', 'vcpu_av
ail': '1', 'memory': '2048', 'on_crash': 'restart', 'image': "(linux (ramdisk
/var/lib/xen/initrd.4cpUvt) (kernel /var/lib/xen/vmlinuz.6V6MVa) (args 'ro root
=/dev/VolGrp00/LogVol01 console=xvc0 rhgb quiet'))", 'maxmem': '2048'}
[2007-10-30 08:06:13 xend 15903] DEBUG (DevController:110) DevController:
writing {'backend-id': '0', 'mac': '00:16:3e:45:da:36', 'handle': '0', 'state': '1'
, 'backend': '/local/domain/0/backend/vif/11/0'} to /local/domain/11/device/vif/0.
[2007-10-30 08:06:13 xend 15903] DEBUG (DevController:112) DevController:
writing {'bridge': 'xenbr0', 'domain': 'test04', 'handle': '0', 'script': '/etc/xen
/scripts/vif-bridge', 'state': '1', 'frontend': '/local/domain/11/device/vif/0',
'mac': '00:16:3e:45:da:36', 'online': '1', 'frontend-id': '11'} to /local/do
main/0/backend/vif/11/0.
[2007-10-30 08:06:13 xend 15903] DEBUG (blkif:24) exception looking up device
number for xvda: [Errno 2] No such file or directory: '/dev/xvda'
[2007-10-30 08:06:13 xend 15903] DEBUG (DevController:110) DevController:
writing {'backend-id': '0', 'virtual-device': '51712', 'device-type': 'disk', 'stat
e': '1', 'backend': '/local/domain/0/backend/vbd/11/51712'} to
/local/domain/11/device/vbd/51712.
[2007-10-30 08:06:13 xend 15903] DEBUG (DevController:112) DevController:
writing {'domain': 'test04', 'frontend': '/local/domain/11/device/vbd/51712', 'dev'
: 'xvda', 'state': '1', 'params': '/dev/VolGrp02/LVtest04', 'mode': 'w',
'online': '1', 'frontend-id': '11', 'type': 'phy'} to /local/domain/0/backend/vbd/11
/51712.
[2007-10-30 08:06:14 xend.XendDomainInfo 15903] DEBUG (XendDomainInfo:750)
Storing domain details: {'console/port': '2', 'name': 'test04', 'console/limit': '
1048576', 'vm': '/vm/bace31ed-4bb0-0abc-4dc2-4386b0927b5f', 'domid': '11',
'cpu/0/availability': 'online', 'memory/target': '2097152', 'store/port': '1'}
[2007-10-30 08:06:14 xend 15903] DEBUG (balloon:133) Balloon: 8584 KiB free; 0
to scrub; need 2105344; retries: 25.
[2007-10-30 08:06:46 xend.XendDomainInfo 15903] DEBUG (XendDomainInfo:1463)
XendDomainInfo.destroy: domid=11
[2007-10-30 08:06:46 xend.XendDomainInfo 15903] DEBUG (XendDomainInfo:1471)
XendDomainInfo.destroyDomain(11)
[2007-10-30 08:06:46 xend 15903] ERROR (XendDomain:268) Restore failed
Traceback (most recent call last):
  File "/usr/lib64/python2.4/site-packages/xen/xend/XendDomain.py", line 263, in
domain_restore_fd
    return XendCheckpoint.restore(self, fd)
  File "/usr/lib64/python2.4/site-packages/xen/xend/XendCheckpoint.py", line
150, in restore
    balloon.free(xc.pages_to_kib(nr_pfns))
  File "/usr/lib64/python2.4/site-packages/xen/xend/balloon.py", line 166, in free
    raise VmError(
VmError: I need 2105344 KiB, but dom0_min_mem is 262144 and shrinking to 262144
KiB would leave only 1710728 KiB free.


[root@tsg03 ~]# xm migrate --live test04 tsg02-gfs
Error: /usr/lib64/xen/bin/xc_save 23 7 0 0 1 failed
Usage: xm migrate <Domain> <Host>

Migrate a domain to another machine.

Options:

-h, --help           Print this help.
-l, --live           Use live migration.
-p=portnum, --port=portnum
                     Use specified port for migration.
-r=MBIT, --resource=MBIT
                     Set level of resource usage for migration.

[root@tsg03 ~]# 

at this point I have pretty much the same information in /var/log/xend.log, but
the domains properly falls back to running properly on tsg03. So, to answer your
question about "xm migrate" manually, it does properly determine that a failure
occurred and returns to the running state on the original node. It is only the
failed migration using conga (or clusvcadm) that the VM does not properly fall
back to the original node after migration failure.


domainU

Linux localhost.localdomain 2.6.18-8.el5xen #1 SMP Fri Jan 26 14:29:35 EST 2007
x86_64 x86_64 x86_64 GNU/Linux


this migrate does succeed if there is enough memory available (i.e. if there are
no other domains running)

Comment 4 Daniel Berrangé 2007-10-31 14:51:58 UTC

Ok, this comment #3 at least hints at a way to reproduce the issue. The logs
show that XenD on the target machine was unable to balloon down Dom0 enough to
be able to accept the incoming guest domain. At a guess I'd say the migration
code isn't handling this failure to balloon & just giving up, leaving this
still-born zombie domain.

Comment 5 Mark Nielsen 2007-11-01 12:18:17 UTC

is it the migration code that's failing? When I attempt the migration manually
as stated in comment #3, the domU will at least return to a running state after
the failure. It seems to me that cluster suite (rgmanager?) is not allowing the
domU to return to running on the original node after the migration failure.

Comment 10 Michal Novotny 2010-05-05 09:14:33 UTC

(In reply to comment #5)
> is it the migration code that's failing? When I attempt the migration manually
> as stated in comment #3, the domU will at least return to a running state after
> the failure. It seems to me that cluster suite (rgmanager?) is not allowing the
> domU to return to running on the original node after the migration failure.    

Mark, I was unable to reproduce it but I am not having cluster environment set up. Could you please retest using the latest virttest version of xen package and the latest kernel-xen package, i.e. kernel-xen-2.6.18-194.el5 and xen-3.0.3-107.el5virttest25 ( http://people.redhat.com/mrezanin/xen/ ) and could you retest? Also, is there a way to reproduce it in non-cluster environment? Could you give me some steps to do it?

Thanks,
Michal

Comment 11 Michal Novotny 2010-05-25 11:00:55 UTC

Any news regarding this bug?

Michal

Comment 13 Bill Burns 2010-08-27 18:34:26 UTC

Closed, internal old bug. No response to needinfo. Please reopen if info surfaces.