Hide Forgot
Xen guests may hang during resume after a migration or suspend. The predominately affects HVM guests. The following upstream commits (tagged for stable) fix the hangs. "x86/xen: resume timer irqs early" https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=8d5999df35314607c38fbd6bdd709e25c3a4eeab and "xen/manage: Always freeze/thaw processes when suspend/resuming" https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=61a734d305e16944b42730ef582a7171dc733321
Can this ticket be made public? We want to link to it in our release notes.
Any update on this?
(In reply to Simon Rowe from comment #4) > Any update on this? The patches in question are in 3.16 stable and there is no obstacle to backporting them to RHEL7. Unfortunately we're late in 7.1 release cycle so I'd expect them to appear in 7.2.
Not a regression. Although Vitaly has a fix, we prefer to defer to 7.2 and avoid last minute risks.
Hi Vitaly, I test 3.10.0-190.el7_bug1141249_nohang.x86_64, there still have problems when do migration or save/restore on RHEL or Fedora20. On rhel5.11 i can do regular migration and save/restore. But the guest will hang after several (usually within 10 times) save/restore and several repeat migration (I get the hang in the thirteenth time). On fedora20 xen4, i can do regular save/restore operation but failed to do migration operation. Also when i test save/restore the guest hang at the 30 times. Also i run a automation job with 3.10.0-190.el7_bug1141249_nohang.x86_64, all the test cases are passed: [ Intel | Linux | rhel5.11 x86_64 ] https://virtlab.englab.nay.redhat.com/job/92053/details/ I created a small test run to test this kernel on Fedora20 xen4, all of them are passed except the migration and save/restore test case: [ Acceptance test for RHEL-7.0-20140507.0 64bit HVM guest on Fedora Xen 4 - Manual ] https://tcms.engineering.redhat.com/run/190002/ Here are the error messages on fedora20 xen4: #xl list Name ID Mem VCPUs State Time(s) Domain-0 0 2048 8 r----- 2481.3 hvm-7.0-64-1 39 1019 4 -b---- 18.4 [host-2]#xl migrate 39 localhost migration target: Ready to receive domain. Saving to migration stream new xl format (info 0x0/0x0/526) Loading new save file <incoming migration stream> (new xl fmt info 0x0/0x0/526) Savefile contains xl domain config WARNING: ignoring "kernel" directive for HVM guest. Use "firmware_override" instead if you really want a non-default firmware WARNING: ignoring device_model directive. WARNING: Use "device_model_override" instead if you really want a non-default device_model xc: progress: Reloading memory pages: 0/1048575 0% xc: progress: Reloading memory pages: 53248/1048575 5% xc: progress: Reloading memory pages: 105472/1048575 10% xc: progress: Reloading memory pages: 157696/1048575 15% xc: progress: Reloading memory pages: 209920/1048575 20% xc: progress: Reloading memory pages: 262144/1048575 25% libxl: error: libxl_dm.c:1280:device_model_spawn_outcome: domain 40 device model: spawn failed (rc=-3) libxl: error: libxl_create.c:1075:domcreate_devmodel_started: device model did not start: -3 libxl: error: libxl_dm.c:1311:libxl__destroy_device_model: Device Model already exited migration target: Domain creation failed (code -3). libxl: error: libxl_utils.c:393:libxl_read_exactly: file/stream truncated reading ready message from migration receiver stream libxl: info: libxl_exec.c:118:libxl_report_child_exitstatus: migration target process [20793] exited with error status 3 Migration failed, resuming at sender. xc: error: Cannot resume uncooperative HVM guests: Internal error libxl: error: libxl.c:408:libxl__domain_resume: xc_domain_resume failed for domain 39: Interrupted system call #cat /var/log/xen/qemu-dm-hvm-7.0-64-1--incoming.log domid: 40 -videoram option does not work with cirrus vga device model. Videoram set to 4M. Using xvda for guest's hda Strip off blktap sub-type prefix to /var/lib/xen/images/hvm-7.0-64-1.img (drv 'aio') Using file /var/lib/xen/images/hvm-7.0-64-1.img in read-write mode Watching /local/domain/0/device-model/40/logdirty/cmd Watching /local/domain/0/device-model/40/command Watching /local/domain/40/cpu char device redirected to /dev/pts/7 qemu_map_cache_init nr_buckets = 10000 size 4194304 shared page at pfn feffd buffered io page at pfn feffb Guest uuid = 3c8c2c18-d29a-4f84-b5cd-d25a8d1682f7 Register xen platform. Done register platform. platform_fixed_ioport: changed ro/rw state of ROM memory area. now is rw state. xs_read(/local/domain/0/device-model/40/xen_extended_power_mgmt): read error cirrus vga map change while on lfb mode mapping video RAM from f0000000 platform_fixed_ioport: changed ro/rw state of ROM memory area. now is ro state. Here are the error messages on RHEL5.11: while :; do rm -rf /var/log/xen/* && xm save hvm-7.0-64-1 save && xm restore save; sleep 4;echo $i $(date +%T) | tee result3; ((i++)); done 70 18:03:10 71 18:03:44 72 18:04:17 73 18:04:50 74 18:05:23 75 18:05:56 76 18:06:29 77 18:07:03 78 18:07:37 79 18:08:11 Error: /usr/lib64/xen/bin/xc_save 22 33 0 0 4 0 failed Usage: xm save <Domain> <CheckpointFile> Save a domain state to restore later. 80 18:08:40 Error: /usr/lib64/xen/bin/xc_save 22 33 0 0 4 0 failed Usage: xm save <Domain> <CheckpointFile> Check the guest, guest hang
(In reply to Lingfei Kong from comment #10) > Hi Vitaly, > I test 3.10.0-190.el7_bug1141249_nohang.x86_64, there still have problems > when do migration or save/restore on RHEL or Fedora20. > > On rhel5.11 i can do regular migration and save/restore. But the guest will > hang after several (usually within 10 times) save/restore and several repeat > migration (I get the hang in the thirteenth time). > > On fedora20 xen4, i can do regular save/restore operation but failed to do > migration operation. Also when i test save/restore the guest hang at the 30 > times. > I think you hit a different issue here - qemu-dm died for some reason and migration failed: libxl: error: libxl_dm.c:1280:device_model_spawn_outcome: domain 40 device model: spawn failed (rc=-3) Make sure you have enough memory - both in Xen and in Dom0. You can also try doing save/restore with longer sleeps between all operations as I can see the following in your log: xs_read(/local/domain/0/device-model/40/xen_extended_power_mgmt): read error It would also be great if you can confirm we're not seeing a regression here (so a guest without these patches behaves the same). To a certain extent we're fine with 'SanityOnly' here. Thanks!
(In reply to Vitaly Kuznetsov from comment #11) > > It would also be great if you can confirm we're not seeing a regression here > (so a guest without these patches behaves the same). To a certain extent > we're fine with 'SanityOnly' here. > > Thanks! A guest without these patches also have this problem, I add `sleep 10` between save/restore and migrate, rhel7.0 also hang after some a while on rhel5.11 and fedora20. For save/restore it hang at the 52th times, for migrate it hang at the 305th times. When i did the test no other problem found.
This changeset https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/xen?id=72978b2fe2f2cdf9f319c6c6dcdbe92b38de2be2 is also needed to fully resolve this. Can these two changes be queued for 7.2 now?
The code fixes have not yet been included in 7.2 Beta.
Patch(es) committed on kernel repository and an interim kernel build is undergoing testing
Patch(es) available on kernel-3.10.0-451.el7
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2016-2574.html