Bug 469130 - Xen live migration may fail due to fragmented memory
Xen live migration may fail due to fragmented memory
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel-xen (Show other bugs)
5.2
x86_64 Linux
medium Severity high
: rc
: ---
Assigned To: Jiri Denemark
Red Hat Kernel QE team
:
Depends On:
Blocks: 512913
  Show dependency treegraph
 
Reported: 2008-10-29 22:35 EDT by Sean E. Millichamp
Modified: 2010-10-23 01:28 EDT (History)
10 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-09-02 04:47:18 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Proposed patch to fix this bug (12.18 KB, patch)
2009-03-26 10:47 EDT, Jiri Denemark
no flags Details | Diff

  None (edit)
Description Sean E. Millichamp 2008-10-29 22:35:05 EDT
Description of problem:

I installed a simple domU guest running the latest updates.  If vcpus=1 then I perform live migrations between two dom0 hosts apparently as many times as I want with no problems.

When I increased vcpus to 2 I was able to perform a live migration from dom0 A to dom0 B, but then when I ran live migration from dom0 B to dom0 A my domU crashed on dom0 B.  While I reproduced vcpus=2 for this bug report, a coworker has told me that it is even more reproducable when vcpus > 2.

Version-Release number of selected component (if applicable):
kernel-xen-2.6.18-92.1.13.el5 9 (on both hosts and guest)

How reproducible: Frequently.

Steps to Reproduce:
1. Install a fully updated domU paravirt guest
2. virsh setvcpus livetest 2
3. On dom0a: virsh start livetest
4. Run a while loop in an SSH session to keep livetest doing something I can monitor during the migration.  I used 'while true ; do date +%s ; sleep .1 ; done'
5. virsh migrate --live livetest xen+ssh://dom0b/
6. On dom0b: virsh migrate --live livetest xen+ssh://dom0a/

Do steps 4 and 5 until it crashes.
  
Actual results:

# virsh list
 Id Name                 State
----------------------------------
  0 Domain-0             running
 13 migrating-livetest   crashed

Expected results:

Live migration should work for any number of vcpus - or the tools should prevent you from trying.

Additional info:

dom0a and dom0b are Dell R200s with full BIOS updates.  The shared storage is a clvmd managed volume group.  When vcpus=1 I can perform an apparently unlimited number of live migrations with otherwise identical configurations.

Following the failed migration the Xen environment on dom0b seemed damaged.  I had to 'service xend restart' in order to be able to continue to perform some normal Xen operations.

Logs during a failure when attempting migration from dom0b to dom0a:

dom0b:

[2008-10-29 21:48:08 xend 3102] DEBUG (balloon:133) Balloon: 356 KiB free; 0 to scrub; need 4096; retries: 20.
[2008-10-29 21:48:08 xend 3102] DEBUG (balloon:148) Balloon: setting dom0 target to 2986 MiB.
[2008-10-29 21:48:08 xend.XendDomainInfo 3102] DEBUG (XendDomainInfo:1079) Setting memory target of domain Domain-0 (0) to 2986 MiB.
[2008-10-29 21:48:08 xend 3102] DEBUG (balloon:127) Balloon: 4452 KiB free; need 4096; done.
[2008-10-29 21:48:08 xend 3102] DEBUG (XendCheckpoint:89) [xc_save]: /usr/lib64/xen/bin/xc_save 4 4 0 0 1
[2008-10-29 21:48:08 xend 3102] INFO (XendCheckpoint:351) ERROR Internal error: Couldn't enable shadow mode
[2008-10-29 21:48:08 xend 3102] INFO (XendCheckpoint:351) Save exit rc=1
[2008-10-29 21:48:08 xend 3102] ERROR (XendCheckpoint:133) Save failed on domain livetest (4).
Traceback (most recent call last):
  File "/usr/lib64/python2.4/site-packages/xen/xend/XendCheckpoint.py", line 110, in save
    forkHelper(cmd, fd, saveInputHandler, False)
  File "/usr/lib64/python2.4/site-packages/xen/xend/XendCheckpoint.py", line 339, in forkHelper
    raise XendError("%s failed" % string.join(cmd))
XendError: /usr/lib64/xen/bin/xc_save 4 4 0 0 1 failed
[2008-10-29 21:48:08 xend.XendDomainInfo 3102] DEBUG (XendDomainInfo:1601) XendDomainInfo.resumeDomain(4)
[2008-10-29 21:48:08 xend.XendDomainInfo 3102] INFO (XendDomainInfo:1722) Dev 51712 still active, looping...
[2008-10-29 21:48:08 xend.XendDomainInfo 3102] INFO (XendDomainInfo:1722) Dev 51712 still active, looping...
[2008-10-29 21:48:08 xend.XendDomainInfo 3102] INFO (XendDomainInfo:1722) Dev 51712 still active, looping...
[2008-10-29 21:48:09 xend.XendDomainInfo 3102] INFO (XendDomainInfo:1722) Dev 51712 still active, looping...
[2008-10-29 21:48:09 xend.XendDomainInfo 3102] INFO (XendDomainInfo:1722) Dev 51712 still active, looping...
[2008-10-29 21:48:09 xend.XendDomainInfo 3102] WARNING (XendDomainInfo:923) Domain has crashed: name=migrating-livetest id=4.
[2008-10-29 21:48:09 xend.XendDomainInfo 3102] INFO (XendDomainInfo:1722) Dev 51712 still active, looping...
[2008-10-29 21:48:09 xend.XendDomainInfo 3102] INFO (XendDomainInfo:1722) Dev 51712 still active, looping...
[2008-10-29 21:48:09 xend.XendDomainInfo 3102] DEBUG (XendDomainInfo:1614) XendDomainInfo.resumeDomain: devices released
[2008-10-29 21:48:09 xend.XendDomainInfo 3102] DEBUG (XendDomainInfo:791) Storing domain details: {'console/ring-ref': '187233', 'console/port': '2', 'name': '
migrating-livetest', 'console/limit': '1048576', 'vm': '/vm/7cc7e4ad-4006-5bd3-83a3-5a05f845cf8f', 'domid': '4', 'cpu/0/availability': 'online', 'memory/target
': '524288', 'store/ring-ref': '171941', 'cpu/1/availability': 'online', 'store/port': '1'}
[2008-10-29 21:48:09 xend.XendDomainInfo 3102] ERROR (XendDomainInfo:1631) XendDomainInfo.resume: xc.domain_resume failed on domain 4.
Traceback (most recent call last):
  File "/usr/lib64/python2.4/site-packages/xen/xend/XendDomainInfo.py", line 1625, in resumeDomain
    self.createDevices()
  File "/usr/lib64/python2.4/site-packages/xen/xend/XendDomainInfo.py", line 1682, in createDevices
    self.createDevice(n, c)
  File "/usr/lib64/python2.4/site-packages/xen/xend/XendDomainInfo.py", line 1155, in createDevice
    return self.getDeviceController(deviceClass).createDevice(devconfig)
  File "/usr/lib64/python2.4/site-packages/xen/xend/server/DevController.py", line 106, in createDevice
    raise VmError("Device %s is already connected." % dev_str)
VmError: Device 0 (vkbd) is already connected.
[2008-10-29 21:48:09 xend 3102] DEBUG (XendCheckpoint:136) XendCheckpoint.save: resumeDomain

On dom0a:

[2008-10-29 21:48:08 xend 3104] DEBUG (tcp:66) Match dom0b.example.com ^.*\.example\.com$
[2008-10-29 21:48:08 xend.XendDomainInfo 3104] DEBUG (XendDomainInfo:275) XendDomainInfo.restore(['domain', ['domid',
 '4'], ['uuid', '7cc7e4ad-4006-5bd3-83a3-5a05f845cf8f'], ['vcpus', '2'], ['vcpu_avail', '3'], ['cpu_weight', '1.0'], 
['memory', '512'], ['shadow_memory', '0'], ['maxmem', '512'], ['bootloader', '/usr/bin/pygrub'], ['features'], ['name
', 'livetest'], ['on_poweroff', 'destroy'], ['on_reboot', 'restart'], ['on_crash', 'restart'], ['image', ['linux', ['
kernel', '/var/lib/xen/boot_kernel.mmh2Kr'], ['ramdisk', '/var/lib/xen/boot_ramdisk.NGpVbu'], ['args', 'ro root=LABEL
=/ console=tty0 console=xvc0,115200n8r']]], ['device', ['vif', ['backend', '0'], ['script', 'vif-bridge'], ['bridge',
 'brbond0'], ['mac', '00:16:3e:16:8e:40']]], ['device', ['vbd', ['backend', '0'], ['dev', 'xvda:disk'], ['uname', 'ph
y:/dev/mapper/xenguests-livetest--disk0'], ['mode', 'w']]], ['device', ['vkbd', ['backend', '0']]], ['device', ['vfb'
, ['backend', '0'], ['type', 'vnc'], ['vncunused', '1']]], ['state', '-b----'], ['shutdown_reason', 'poweroff'], ['cp
u_time', '2.288732157'], ['online_vcpus', '2'], ['up_time', '206.301565886'], ['start_time', '1225331082.4'], ['store
_mfn', '171941'], ['console_mfn', '187233']])
[2008-10-29 21:48:08 xend.XendDomainInfo 3104] DEBUG (XendDomainInfo:306) parseConfig: config is ['domain', ['domid',
 '4'], ['uuid', '7cc7e4ad-4006-5bd3-83a3-5a05f845cf8f'], ['vcpus', '2'], ['vcpu_avail', '3'], ['cpu_weight', '1.0'], 
['memory', '512'], ['shadow_memory', '0'], ['maxmem', '512'], ['bootloader', '/usr/bin/pygrub'], ['features'], ['name
', 'livetest'], ['on_poweroff', 'destroy'], ['on_reboot', 'restart'], ['on_crash', 'restart'], ['image', ['linux', ['
kernel', '/var/lib/xen/boot_kernel.mmh2Kr'], ['ramdisk', '/var/lib/xen/boot_ramdisk.NGpVbu'], ['args', 'ro root=LABEL
=/ console=tty0 console=xvc0,115200n8r']]], ['device', ['vif', ['backend', '0'], ['script', 'vif-bridge'], ['bridge',
 'brbond0'], ['mac', '00:16:3e:16:8e:40']]], ['device', ['vbd', ['backend', '0'], ['dev', 'xvda:disk'], ['uname', 'ph
y:/dev/mapper/xenguests-livetest--disk0'], ['mode', 'w']]], ['device', ['vkbd', ['backend', '0']]], ['device', ['vfb'
, ['backend', '0'], ['type', 'vnc'], ['vncunused', '1']]], ['state', '-b----'], ['shutdown_reason', 'poweroff'], ['cp
u_time', '2.288732157'], ['online_vcpus', '2'], ['up_time', '206.301565886'], ['start_time', '1225331082.4'], ['store
_mfn', '171941'], ['console_mfn', '187233']]
[2008-10-29 21:48:08 xend.XendDomainInfo 3104] DEBUG (XendDomainInfo:411) parseConfig: result is {'shadow_memory': 0,
 'start_time': 1225331082.4000001, 'uuid': '7cc7e4ad-4006-5bd3-83a3-5a05f845cf8f', 'on_crash': 'restart', 'on_reboot'
: 'restart', 'localtime': None, 'image': ['linux', ['kernel', '/var/lib/xen/boot_kernel.mmh2Kr'], ['ramdisk', '/var/l
ib/xen/boot_ramdisk.NGpVbu'], ['args', 'ro root=LABEL=/ console=tty0 console=xvc0,115200n8r']], 'on_poweroff': 'destr
oy', 'bootloader_args': None, 'cpus': None, 'name': 'livetest', 'backend': [], 'vcpus': 2, 'cpu_weight': 1.0, 'featur
es': None, 'vcpu_avail': 3, 'memory': 512, 'device': [('vif', ['vif', ['backend', '0'], ['script', 'vif-bridge'], ['b
ridge', 'brbond0'], ['mac', '00:16:3e:16:8e:40']]), ('vbd', ['vbd', ['backend', '0'], ['dev', 'xvda:disk'], ['uname',
 'phy:/dev/mapper/xenguests-livetest--disk0'], ['mode', 'w']]), ('vkbd', ['vkbd', ['backend', '0']]), ('vfb', ['vfb',
 ['backend', '0'], ['type', 'vnc'], ['vncunused', '1']])], 'bootloader': '/usr/bin/pygrub', 'cpu': None, 'maxmem': 51
2}
[2008-10-29 21:48:08 xend.XendDomainInfo 3104] DEBUG (XendDomainInfo:1350) XendDomainInfo.construct: None
[2008-10-29 21:48:08 xend 3104] DEBUG (balloon:127) Balloon: 528460 KiB free; need 2048; done.
[2008-10-29 21:48:08 xend.XendDomainInfo 3104] DEBUG (XendDomainInfo:756) Storing VM details: {'shadow_memory': '0', 
'uuid': '7cc7e4ad-4006-5bd3-83a3-5a05f845cf8f', 'on_reboot': 'restart', 'start_time': '1225331082.4', 'on_poweroff': 
'destroy', 'name': 'livetest', 'vcpus': '2', 'vcpu_avail': '3', 'memory': '512', 'on_crash': 'restart', 'image': "(li
nux (kernel /var/lib/xen/boot_kernel.mmh2Kr) (ramdisk /var/lib/xen/boot_ramdisk.NGpVbu) (args 'ro root=LABEL=/ consol
e=tty0 console=xvc0,115200n8r'))", 'maxmem': '512'}
[2008-10-29 21:48:08 xend 3104] DEBUG (DevController:110) DevController: writing {'state': '1', 'backend-id': '0', 'b
ackend': '/local/domain/0/backend/vkbd/11/0'} to /local/domain/11/device/vkbd/0.
[2008-10-29 21:48:08 xend 3104] DEBUG (DevController:112) DevController: writing {'frontend-id': '11', 'domain': 'liv
etest', 'frontend': '/local/domain/11/device/vkbd/0', 'state': '1', 'online': '1'} to /local/domain/0/backend/vkbd/11
/0.
[2008-10-29 21:48:08 xend 3104] DEBUG (DevController:110) DevController: writing {'backend-id': '0', 'mac': '00:16:3e
:16:8e:40', 'handle': '0', 'state': '1', 'backend': '/local/domain/0/backend/vif/11/0'} to /local/domain/11/device/vi
f/0.
[2008-10-29 21:48:08 xend 3104] DEBUG (DevController:112) DevController: writing {'bridge': 'brbond0', 'domain': 'liv
etest', 'handle': '0', 'script': '/etc/xen/scripts/vif-bridge', 'state': '1', 'frontend': '/local/domain/11/device/vi
f/0', 'mac': '00:16:3e:16:8e:40', 'online': '1', 'frontend-id': '11'} to /local/domain/0/backend/vif/11/0.
[2008-10-29 21:48:08 xend 3104] DEBUG (blkif:24) exception looking up device number for xvda: [Errno 2] No such file 
or directory: '/dev/xvda'
[2008-10-29 21:48:08 xend 3104] DEBUG (DevController:110) DevController: writing {'backend-id': '0', 'virtual-device'
: '51712', 'device-type': 'disk', 'state': '1', 'backend': '/local/domain/0/backend/vbd/11/51712'} to /local/domain/1
1/device/vbd/51712.
[2008-10-29 21:48:08 xend 3104] DEBUG (DevController:112) DevController: writing {'domain': 'livetest', 'frontend': '
/local/domain/11/device/vbd/51712', 'format': 'raw', 'dev': 'xvda', 'state': '1', 'params': '/dev/mapper/xenguests-li
vetest--disk0', 'mode': 'w', 'online': '1', 'frontend-id': '11', 'type': 'phy'} to /local/domain/0/backend/vbd/11/517
12.
[2008-10-29 21:48:08 xend 3104] DEBUG (DevController:110) DevController: writing {'state': '1', 'backend-id': '0', 'b
ackend': '/local/domain/0/backend/vfb/11/0'} to /local/domain/11/device/vfb/0.
[2008-10-29 21:48:08 xend 3104] DEBUG (DevController:112) DevController: writing {'vncunused': '1', 'domain': 'livete
st', 'frontend': '/local/domain/11/device/vfb/0', 'online': '1', 'frontend-id': '11', 'type': 'vnc', 'state': '1'} to
 /local/domain/0/backend/vfb/11/0.
[2008-10-29 21:48:08 xend 3104] DEBUG (vfbif:70) No VNC passwd configured for vfb access
[2008-10-29 21:48:08 xend 3104] DEBUG (vfbif:11) Spawn: ['/usr/lib64/xen/bin/qemu-dm', '-M', 'xenpv', '-d', '11', '-d
omain-name', 'livetest', '-vnc', '127.0.0.1:0', '-vncunused']
[2008-10-29 21:48:08 xend.XendDomainInfo 3104] DEBUG (XendDomainInfo:791) Storing domain details: {'console/port': '2
', 'name': 'livetest', 'console/limit': '1048576', 'vm': '/vm/7cc7e4ad-4006-5bd3-83a3-5a05f845cf8f', 'domid': '11', '
cpu/0/availability': 'online', 'memory/target': '524288', 'cpu/1/availability': 'online', 'store/port': '1'}
[2008-10-29 21:48:08 xend 3104] DEBUG (XendCheckpoint:198) restore:shadow=0x0, _static_max=0x200, _static_min=0x200, 
[2008-10-29 21:48:08 xend 3104] DEBUG (balloon:127) Balloon: 528452 KiB free; need 524288; done.
[2008-10-29 21:48:08 xend 3104] DEBUG (XendCheckpoint:215) [xc_restore]: /usr/lib64/xen/bin/xc_restore 15 11 1 2 0 0 
0
[2008-10-29 21:48:09 xend 3104] INFO (XendCheckpoint:351) ERROR Internal error: read: p2m_size
[2008-10-29 21:48:09 xend 3104] INFO (XendCheckpoint:351) Restore exit with rc=1
[2008-10-29 21:48:09 xend.XendDomainInfo 3104] DEBUG (XendDomainInfo:1560) XendDomainInfo.destroy: domid=11
[2008-10-29 21:48:09 xend.XendDomainInfo 3104] DEBUG (XendDomainInfo:1568) XendDomainInfo.destroyDomain(11)
[2008-10-29 21:48:09 xend.XendDomainInfo 3104] ERROR (XendDomainInfo:1575) XendDomainInfo.destroy: xc.domain_destroy 
failed.
Traceback (most recent call last):
  File "/usr/lib64/python2.4/site-packages/xen/xend/XendDomainInfo.py", line 1573, in destroyDomain
    xc.domain_destroy(self.domid)
Error: (3, 'No such process')
[2008-10-29 21:48:09 xend 3104] ERROR (XendDomain:278) Restore failed
Traceback (most recent call last):
  File "/usr/lib64/python2.4/site-packages/xen/xend/XendDomain.py", line 273, in domain_restore_fd
    return XendCheckpoint.restore(self, fd)
  File "/usr/lib64/python2.4/site-packages/xen/xend/XendCheckpoint.py", line 219, in restore
    forkHelper(cmd, fd, handler.handler, True)
  File "/usr/lib64/python2.4/site-packages/xen/xend/XendCheckpoint.py", line 339, in forkHelper
    raise XendError("%s failed" % string.join(cmd))
XendError: /usr/lib64/xen/bin/xc_restore 15 11 1 2 0 0 0 failed
Comment 3 Jiri Denemark 2009-03-24 12:07:35 EDT
Uff, it seems I found a way of reproducing this bug reliably even with the most recent changes:

1. reboot a host
2. create a PV guest, 512MB, 4CPUs
3. migrate the guest 5-times to another host and back again
4. trying to live-migrate the guest fails
Comment 4 Chris Lalancette 2009-03-24 14:45:39 EDT
Jiri (or the original reporter),
     Since the tools seem to think this is a crash of the guest, it might be worthwhile to set up xend to automatically generate a core dump when this happens, so we can analyze it.  Just set "enable-dump" to yes in /etc/xen/xend-config.sxp, restart xend, and try the test again, and if it is truly a guest crash (as opposed to a problem with the tools), then we should get a nice core in /var/lib/xen/dump.

Chris Lalancette
Comment 5 Sean E. Millichamp 2009-03-24 15:09:42 EDT
I don't have the resources available anymore to perform any more analysis of this.

I can tell you that after updating the Xen hypervisor to the XenSource 3.2.1 hypervisor (with the same kernel version as reported above) we were unable to reproduce the problem.

Thanks for looking at this.
Comment 6 Jiri Denemark 2009-03-25 04:09:51 EDT
Chris,

The problem is a guest doesn't really crash in my case :-) It just fails to be migrated with the same Internal error: Couldn't enable shadow mode. So I guess whatever was causing a guest to crash has already been fixed in its kernel but the xc_save issue is still there.
Comment 7 Chris Lalancette 2009-03-25 06:06:04 EDT
Jiri,
     Ah, OK.  Well, given Sean's point about the 3.2.1 HV, I have to assume they also updated the userland tools to 3.2.1 as well.  In which case, we have both the HV and the userland tools to look at; probably the kernel is not really involved (anymore?).

Chris Lalancette
Comment 8 Jiri Denemark 2009-03-25 06:39:40 EDT
The following can be observed in xm dmesg when live migration fails:

(XEN) mm.c:625:d0 Non-privileged (6) attempt to map I/O space 00000001
Comment 9 Jiri Denemark 2009-03-26 10:45:36 EDT
Actually, this is a hypervisor issue...
Comment 10 Jiri Denemark 2009-03-26 10:47:45 EDT
Created attachment 336818 [details]
Proposed patch to fix this bug
Comment 11 Chris Lalancette 2009-04-06 04:23:10 EDT
Jiri, and/or the original reporter,
     The backported patch from Comment #10 is now in the virttest15 kernel.  Can someone give that a try, and just (re-)confirm that it fixes the issue?  Then we can go ahead and POST this.

Thanks,
Chris Lalancette
Comment 12 Chris Lalancette 2009-04-06 04:27:49 EDT
Err, I guess I should point out that the virttest kernel is available here:

http://people.redhat.com/clalance/virttest

Chris Lalancette
Comment 13 Jiri Denemark 2009-04-06 05:49:33 EDT
Chris,

My machine has been running virttest15 since Friday and I haven't seen this issue since then even though I tried to reproduce it.
Comment 14 Don Zickus 2009-04-27 11:58:02 EDT
in kernel-2.6.18-141.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.
Comment 17 Chris Ward 2009-07-03 14:11:49 EDT
~~ Attention - RHEL 5.4 Beta Released! ~~

RHEL 5.4 Beta has been released! There should be a fix present in the Beta release that addresses this particular request. Please test and report back results here, at your earliest convenience. RHEL 5.4 General Availability release is just around the corner!

If you encounter any issues while testing Beta, please describe the issues you have encountered and set the bug into NEED_INFO. If you encounter new issues, please clone this bug to open a new issue and request it be reviewed for inclusion in RHEL 5.4 or a later update, if it is not of urgent severity.

Please do not flip the bug status to VERIFIED. Only post your verification results, and if available, update Verified field with the appropriate value.

Questions can be posted to this bug or your customer or partner representative.
Comment 25 errata-xmlrpc 2009-09-02 04:47:18 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1243.html

Note You need to log in before you can comment on or make changes to this bug.