Description of problem: I installed a simple domU guest running the latest updates. If vcpus=1 then I perform live migrations between two dom0 hosts apparently as many times as I want with no problems. When I increased vcpus to 2 I was able to perform a live migration from dom0 A to dom0 B, but then when I ran live migration from dom0 B to dom0 A my domU crashed on dom0 B. While I reproduced vcpus=2 for this bug report, a coworker has told me that it is even more reproducable when vcpus > 2. Version-Release number of selected component (if applicable): kernel-xen-2.6.18-92.1.13.el5 9 (on both hosts and guest) How reproducible: Frequently. Steps to Reproduce: 1. Install a fully updated domU paravirt guest 2. virsh setvcpus livetest 2 3. On dom0a: virsh start livetest 4. Run a while loop in an SSH session to keep livetest doing something I can monitor during the migration. I used 'while true ; do date +%s ; sleep .1 ; done' 5. virsh migrate --live livetest xen+ssh://dom0b/ 6. On dom0b: virsh migrate --live livetest xen+ssh://dom0a/ Do steps 4 and 5 until it crashes. Actual results: # virsh list Id Name State ---------------------------------- 0 Domain-0 running 13 migrating-livetest crashed Expected results: Live migration should work for any number of vcpus - or the tools should prevent you from trying. Additional info: dom0a and dom0b are Dell R200s with full BIOS updates. The shared storage is a clvmd managed volume group. When vcpus=1 I can perform an apparently unlimited number of live migrations with otherwise identical configurations. Following the failed migration the Xen environment on dom0b seemed damaged. I had to 'service xend restart' in order to be able to continue to perform some normal Xen operations. Logs during a failure when attempting migration from dom0b to dom0a: dom0b: [2008-10-29 21:48:08 xend 3102] DEBUG (balloon:133) Balloon: 356 KiB free; 0 to scrub; need 4096; retries: 20. [2008-10-29 21:48:08 xend 3102] DEBUG (balloon:148) Balloon: setting dom0 target to 2986 MiB. [2008-10-29 21:48:08 xend.XendDomainInfo 3102] DEBUG (XendDomainInfo:1079) Setting memory target of domain Domain-0 (0) to 2986 MiB. [2008-10-29 21:48:08 xend 3102] DEBUG (balloon:127) Balloon: 4452 KiB free; need 4096; done. [2008-10-29 21:48:08 xend 3102] DEBUG (XendCheckpoint:89) [xc_save]: /usr/lib64/xen/bin/xc_save 4 4 0 0 1 [2008-10-29 21:48:08 xend 3102] INFO (XendCheckpoint:351) ERROR Internal error: Couldn't enable shadow mode [2008-10-29 21:48:08 xend 3102] INFO (XendCheckpoint:351) Save exit rc=1 [2008-10-29 21:48:08 xend 3102] ERROR (XendCheckpoint:133) Save failed on domain livetest (4). Traceback (most recent call last): File "/usr/lib64/python2.4/site-packages/xen/xend/XendCheckpoint.py", line 110, in save forkHelper(cmd, fd, saveInputHandler, False) File "/usr/lib64/python2.4/site-packages/xen/xend/XendCheckpoint.py", line 339, in forkHelper raise XendError("%s failed" % string.join(cmd)) XendError: /usr/lib64/xen/bin/xc_save 4 4 0 0 1 failed [2008-10-29 21:48:08 xend.XendDomainInfo 3102] DEBUG (XendDomainInfo:1601) XendDomainInfo.resumeDomain(4) [2008-10-29 21:48:08 xend.XendDomainInfo 3102] INFO (XendDomainInfo:1722) Dev 51712 still active, looping... [2008-10-29 21:48:08 xend.XendDomainInfo 3102] INFO (XendDomainInfo:1722) Dev 51712 still active, looping... [2008-10-29 21:48:08 xend.XendDomainInfo 3102] INFO (XendDomainInfo:1722) Dev 51712 still active, looping... [2008-10-29 21:48:09 xend.XendDomainInfo 3102] INFO (XendDomainInfo:1722) Dev 51712 still active, looping... [2008-10-29 21:48:09 xend.XendDomainInfo 3102] INFO (XendDomainInfo:1722) Dev 51712 still active, looping... [2008-10-29 21:48:09 xend.XendDomainInfo 3102] WARNING (XendDomainInfo:923) Domain has crashed: name=migrating-livetest id=4. [2008-10-29 21:48:09 xend.XendDomainInfo 3102] INFO (XendDomainInfo:1722) Dev 51712 still active, looping... [2008-10-29 21:48:09 xend.XendDomainInfo 3102] INFO (XendDomainInfo:1722) Dev 51712 still active, looping... [2008-10-29 21:48:09 xend.XendDomainInfo 3102] DEBUG (XendDomainInfo:1614) XendDomainInfo.resumeDomain: devices released [2008-10-29 21:48:09 xend.XendDomainInfo 3102] DEBUG (XendDomainInfo:791) Storing domain details: {'console/ring-ref': '187233', 'console/port': '2', 'name': ' migrating-livetest', 'console/limit': '1048576', 'vm': '/vm/7cc7e4ad-4006-5bd3-83a3-5a05f845cf8f', 'domid': '4', 'cpu/0/availability': 'online', 'memory/target ': '524288', 'store/ring-ref': '171941', 'cpu/1/availability': 'online', 'store/port': '1'} [2008-10-29 21:48:09 xend.XendDomainInfo 3102] ERROR (XendDomainInfo:1631) XendDomainInfo.resume: xc.domain_resume failed on domain 4. Traceback (most recent call last): File "/usr/lib64/python2.4/site-packages/xen/xend/XendDomainInfo.py", line 1625, in resumeDomain self.createDevices() File "/usr/lib64/python2.4/site-packages/xen/xend/XendDomainInfo.py", line 1682, in createDevices self.createDevice(n, c) File "/usr/lib64/python2.4/site-packages/xen/xend/XendDomainInfo.py", line 1155, in createDevice return self.getDeviceController(deviceClass).createDevice(devconfig) File "/usr/lib64/python2.4/site-packages/xen/xend/server/DevController.py", line 106, in createDevice raise VmError("Device %s is already connected." % dev_str) VmError: Device 0 (vkbd) is already connected. [2008-10-29 21:48:09 xend 3102] DEBUG (XendCheckpoint:136) XendCheckpoint.save: resumeDomain On dom0a: [2008-10-29 21:48:08 xend 3104] DEBUG (tcp:66) Match dom0b.example.com ^.*\.example\.com$ [2008-10-29 21:48:08 xend.XendDomainInfo 3104] DEBUG (XendDomainInfo:275) XendDomainInfo.restore(['domain', ['domid', '4'], ['uuid', '7cc7e4ad-4006-5bd3-83a3-5a05f845cf8f'], ['vcpus', '2'], ['vcpu_avail', '3'], ['cpu_weight', '1.0'], ['memory', '512'], ['shadow_memory', '0'], ['maxmem', '512'], ['bootloader', '/usr/bin/pygrub'], ['features'], ['name ', 'livetest'], ['on_poweroff', 'destroy'], ['on_reboot', 'restart'], ['on_crash', 'restart'], ['image', ['linux', [' kernel', '/var/lib/xen/boot_kernel.mmh2Kr'], ['ramdisk', '/var/lib/xen/boot_ramdisk.NGpVbu'], ['args', 'ro root=LABEL =/ console=tty0 console=xvc0,115200n8r']]], ['device', ['vif', ['backend', '0'], ['script', 'vif-bridge'], ['bridge', 'brbond0'], ['mac', '00:16:3e:16:8e:40']]], ['device', ['vbd', ['backend', '0'], ['dev', 'xvda:disk'], ['uname', 'ph y:/dev/mapper/xenguests-livetest--disk0'], ['mode', 'w']]], ['device', ['vkbd', ['backend', '0']]], ['device', ['vfb' , ['backend', '0'], ['type', 'vnc'], ['vncunused', '1']]], ['state', '-b----'], ['shutdown_reason', 'poweroff'], ['cp u_time', '2.288732157'], ['online_vcpus', '2'], ['up_time', '206.301565886'], ['start_time', '1225331082.4'], ['store _mfn', '171941'], ['console_mfn', '187233']]) [2008-10-29 21:48:08 xend.XendDomainInfo 3104] DEBUG (XendDomainInfo:306) parseConfig: config is ['domain', ['domid', '4'], ['uuid', '7cc7e4ad-4006-5bd3-83a3-5a05f845cf8f'], ['vcpus', '2'], ['vcpu_avail', '3'], ['cpu_weight', '1.0'], ['memory', '512'], ['shadow_memory', '0'], ['maxmem', '512'], ['bootloader', '/usr/bin/pygrub'], ['features'], ['name ', 'livetest'], ['on_poweroff', 'destroy'], ['on_reboot', 'restart'], ['on_crash', 'restart'], ['image', ['linux', [' kernel', '/var/lib/xen/boot_kernel.mmh2Kr'], ['ramdisk', '/var/lib/xen/boot_ramdisk.NGpVbu'], ['args', 'ro root=LABEL =/ console=tty0 console=xvc0,115200n8r']]], ['device', ['vif', ['backend', '0'], ['script', 'vif-bridge'], ['bridge', 'brbond0'], ['mac', '00:16:3e:16:8e:40']]], ['device', ['vbd', ['backend', '0'], ['dev', 'xvda:disk'], ['uname', 'ph y:/dev/mapper/xenguests-livetest--disk0'], ['mode', 'w']]], ['device', ['vkbd', ['backend', '0']]], ['device', ['vfb' , ['backend', '0'], ['type', 'vnc'], ['vncunused', '1']]], ['state', '-b----'], ['shutdown_reason', 'poweroff'], ['cp u_time', '2.288732157'], ['online_vcpus', '2'], ['up_time', '206.301565886'], ['start_time', '1225331082.4'], ['store _mfn', '171941'], ['console_mfn', '187233']] [2008-10-29 21:48:08 xend.XendDomainInfo 3104] DEBUG (XendDomainInfo:411) parseConfig: result is {'shadow_memory': 0, 'start_time': 1225331082.4000001, 'uuid': '7cc7e4ad-4006-5bd3-83a3-5a05f845cf8f', 'on_crash': 'restart', 'on_reboot' : 'restart', 'localtime': None, 'image': ['linux', ['kernel', '/var/lib/xen/boot_kernel.mmh2Kr'], ['ramdisk', '/var/l ib/xen/boot_ramdisk.NGpVbu'], ['args', 'ro root=LABEL=/ console=tty0 console=xvc0,115200n8r']], 'on_poweroff': 'destr oy', 'bootloader_args': None, 'cpus': None, 'name': 'livetest', 'backend': [], 'vcpus': 2, 'cpu_weight': 1.0, 'featur es': None, 'vcpu_avail': 3, 'memory': 512, 'device': [('vif', ['vif', ['backend', '0'], ['script', 'vif-bridge'], ['b ridge', 'brbond0'], ['mac', '00:16:3e:16:8e:40']]), ('vbd', ['vbd', ['backend', '0'], ['dev', 'xvda:disk'], ['uname', 'phy:/dev/mapper/xenguests-livetest--disk0'], ['mode', 'w']]), ('vkbd', ['vkbd', ['backend', '0']]), ('vfb', ['vfb', ['backend', '0'], ['type', 'vnc'], ['vncunused', '1']])], 'bootloader': '/usr/bin/pygrub', 'cpu': None, 'maxmem': 51 2} [2008-10-29 21:48:08 xend.XendDomainInfo 3104] DEBUG (XendDomainInfo:1350) XendDomainInfo.construct: None [2008-10-29 21:48:08 xend 3104] DEBUG (balloon:127) Balloon: 528460 KiB free; need 2048; done. [2008-10-29 21:48:08 xend.XendDomainInfo 3104] DEBUG (XendDomainInfo:756) Storing VM details: {'shadow_memory': '0', 'uuid': '7cc7e4ad-4006-5bd3-83a3-5a05f845cf8f', 'on_reboot': 'restart', 'start_time': '1225331082.4', 'on_poweroff': 'destroy', 'name': 'livetest', 'vcpus': '2', 'vcpu_avail': '3', 'memory': '512', 'on_crash': 'restart', 'image': "(li nux (kernel /var/lib/xen/boot_kernel.mmh2Kr) (ramdisk /var/lib/xen/boot_ramdisk.NGpVbu) (args 'ro root=LABEL=/ consol e=tty0 console=xvc0,115200n8r'))", 'maxmem': '512'} [2008-10-29 21:48:08 xend 3104] DEBUG (DevController:110) DevController: writing {'state': '1', 'backend-id': '0', 'b ackend': '/local/domain/0/backend/vkbd/11/0'} to /local/domain/11/device/vkbd/0. [2008-10-29 21:48:08 xend 3104] DEBUG (DevController:112) DevController: writing {'frontend-id': '11', 'domain': 'liv etest', 'frontend': '/local/domain/11/device/vkbd/0', 'state': '1', 'online': '1'} to /local/domain/0/backend/vkbd/11 /0. [2008-10-29 21:48:08 xend 3104] DEBUG (DevController:110) DevController: writing {'backend-id': '0', 'mac': '00:16:3e :16:8e:40', 'handle': '0', 'state': '1', 'backend': '/local/domain/0/backend/vif/11/0'} to /local/domain/11/device/vi f/0. [2008-10-29 21:48:08 xend 3104] DEBUG (DevController:112) DevController: writing {'bridge': 'brbond0', 'domain': 'liv etest', 'handle': '0', 'script': '/etc/xen/scripts/vif-bridge', 'state': '1', 'frontend': '/local/domain/11/device/vi f/0', 'mac': '00:16:3e:16:8e:40', 'online': '1', 'frontend-id': '11'} to /local/domain/0/backend/vif/11/0. [2008-10-29 21:48:08 xend 3104] DEBUG (blkif:24) exception looking up device number for xvda: [Errno 2] No such file or directory: '/dev/xvda' [2008-10-29 21:48:08 xend 3104] DEBUG (DevController:110) DevController: writing {'backend-id': '0', 'virtual-device' : '51712', 'device-type': 'disk', 'state': '1', 'backend': '/local/domain/0/backend/vbd/11/51712'} to /local/domain/1 1/device/vbd/51712. [2008-10-29 21:48:08 xend 3104] DEBUG (DevController:112) DevController: writing {'domain': 'livetest', 'frontend': ' /local/domain/11/device/vbd/51712', 'format': 'raw', 'dev': 'xvda', 'state': '1', 'params': '/dev/mapper/xenguests-li vetest--disk0', 'mode': 'w', 'online': '1', 'frontend-id': '11', 'type': 'phy'} to /local/domain/0/backend/vbd/11/517 12. [2008-10-29 21:48:08 xend 3104] DEBUG (DevController:110) DevController: writing {'state': '1', 'backend-id': '0', 'b ackend': '/local/domain/0/backend/vfb/11/0'} to /local/domain/11/device/vfb/0. [2008-10-29 21:48:08 xend 3104] DEBUG (DevController:112) DevController: writing {'vncunused': '1', 'domain': 'livete st', 'frontend': '/local/domain/11/device/vfb/0', 'online': '1', 'frontend-id': '11', 'type': 'vnc', 'state': '1'} to /local/domain/0/backend/vfb/11/0. [2008-10-29 21:48:08 xend 3104] DEBUG (vfbif:70) No VNC passwd configured for vfb access [2008-10-29 21:48:08 xend 3104] DEBUG (vfbif:11) Spawn: ['/usr/lib64/xen/bin/qemu-dm', '-M', 'xenpv', '-d', '11', '-d omain-name', 'livetest', '-vnc', '127.0.0.1:0', '-vncunused'] [2008-10-29 21:48:08 xend.XendDomainInfo 3104] DEBUG (XendDomainInfo:791) Storing domain details: {'console/port': '2 ', 'name': 'livetest', 'console/limit': '1048576', 'vm': '/vm/7cc7e4ad-4006-5bd3-83a3-5a05f845cf8f', 'domid': '11', ' cpu/0/availability': 'online', 'memory/target': '524288', 'cpu/1/availability': 'online', 'store/port': '1'} [2008-10-29 21:48:08 xend 3104] DEBUG (XendCheckpoint:198) restore:shadow=0x0, _static_max=0x200, _static_min=0x200, [2008-10-29 21:48:08 xend 3104] DEBUG (balloon:127) Balloon: 528452 KiB free; need 524288; done. [2008-10-29 21:48:08 xend 3104] DEBUG (XendCheckpoint:215) [xc_restore]: /usr/lib64/xen/bin/xc_restore 15 11 1 2 0 0 0 [2008-10-29 21:48:09 xend 3104] INFO (XendCheckpoint:351) ERROR Internal error: read: p2m_size [2008-10-29 21:48:09 xend 3104] INFO (XendCheckpoint:351) Restore exit with rc=1 [2008-10-29 21:48:09 xend.XendDomainInfo 3104] DEBUG (XendDomainInfo:1560) XendDomainInfo.destroy: domid=11 [2008-10-29 21:48:09 xend.XendDomainInfo 3104] DEBUG (XendDomainInfo:1568) XendDomainInfo.destroyDomain(11) [2008-10-29 21:48:09 xend.XendDomainInfo 3104] ERROR (XendDomainInfo:1575) XendDomainInfo.destroy: xc.domain_destroy failed. Traceback (most recent call last): File "/usr/lib64/python2.4/site-packages/xen/xend/XendDomainInfo.py", line 1573, in destroyDomain xc.domain_destroy(self.domid) Error: (3, 'No such process') [2008-10-29 21:48:09 xend 3104] ERROR (XendDomain:278) Restore failed Traceback (most recent call last): File "/usr/lib64/python2.4/site-packages/xen/xend/XendDomain.py", line 273, in domain_restore_fd return XendCheckpoint.restore(self, fd) File "/usr/lib64/python2.4/site-packages/xen/xend/XendCheckpoint.py", line 219, in restore forkHelper(cmd, fd, handler.handler, True) File "/usr/lib64/python2.4/site-packages/xen/xend/XendCheckpoint.py", line 339, in forkHelper raise XendError("%s failed" % string.join(cmd)) XendError: /usr/lib64/xen/bin/xc_restore 15 11 1 2 0 0 0 failed
Uff, it seems I found a way of reproducing this bug reliably even with the most recent changes: 1. reboot a host 2. create a PV guest, 512MB, 4CPUs 3. migrate the guest 5-times to another host and back again 4. trying to live-migrate the guest fails
Jiri (or the original reporter), Since the tools seem to think this is a crash of the guest, it might be worthwhile to set up xend to automatically generate a core dump when this happens, so we can analyze it. Just set "enable-dump" to yes in /etc/xen/xend-config.sxp, restart xend, and try the test again, and if it is truly a guest crash (as opposed to a problem with the tools), then we should get a nice core in /var/lib/xen/dump. Chris Lalancette
I don't have the resources available anymore to perform any more analysis of this. I can tell you that after updating the Xen hypervisor to the XenSource 3.2.1 hypervisor (with the same kernel version as reported above) we were unable to reproduce the problem. Thanks for looking at this.
Chris, The problem is a guest doesn't really crash in my case :-) It just fails to be migrated with the same Internal error: Couldn't enable shadow mode. So I guess whatever was causing a guest to crash has already been fixed in its kernel but the xc_save issue is still there.
Jiri, Ah, OK. Well, given Sean's point about the 3.2.1 HV, I have to assume they also updated the userland tools to 3.2.1 as well. In which case, we have both the HV and the userland tools to look at; probably the kernel is not really involved (anymore?). Chris Lalancette
The following can be observed in xm dmesg when live migration fails: (XEN) mm.c:625:d0 Non-privileged (6) attempt to map I/O space 00000001
Actually, this is a hypervisor issue...
Created attachment 336818 [details] Proposed patch to fix this bug
Jiri, and/or the original reporter, The backported patch from Comment #10 is now in the virttest15 kernel. Can someone give that a try, and just (re-)confirm that it fixes the issue? Then we can go ahead and POST this. Thanks, Chris Lalancette
Err, I guess I should point out that the virttest kernel is available here: http://people.redhat.com/clalance/virttest Chris Lalancette
Chris, My machine has been running virttest15 since Friday and I haven't seen this issue since then even though I tried to reproduce it.
in kernel-2.6.18-141.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please do NOT transition this bugzilla state to VERIFIED until our QE team has sent specific instructions indicating when to do so. However feel free to provide a comment indicating that this fix has been verified.
~~ Attention - RHEL 5.4 Beta Released! ~~ RHEL 5.4 Beta has been released! There should be a fix present in the Beta release that addresses this particular request. Please test and report back results here, at your earliest convenience. RHEL 5.4 General Availability release is just around the corner! If you encounter any issues while testing Beta, please describe the issues you have encountered and set the bug into NEED_INFO. If you encounter new issues, please clone this bug to open a new issue and request it be reviewed for inclusion in RHEL 5.4 or a later update, if it is not of urgent severity. Please do not flip the bug status to VERIFIED. Only post your verification results, and if available, update Verified field with the appropriate value. Questions can be posted to this bug or your customer or partner representative.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1243.html