Bug 1141249 - Xen guests may hang after migration or suspend/resume
Xen guests may hang after migration or suspend/resume
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: kernel (Show other bugs)
7.2
All Unspecified
unspecified Severity high
: rc
: ---
Assigned To: Vitaly Kuznetsov
Virtualization Bugs
:
Depends On:
Blocks: 1301891 1288337
  Show dependency treegraph
 
Reported: 2014-09-12 10:16 EDT by Simon Rowe
Modified: 2016-11-03 04:46 EDT (History)
5 users (show)

See Also:
Fixed In Version: kernel-3.10.0-451.el7
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-11-03 04:46:48 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Simon Rowe 2014-09-12 10:16:48 EDT
Xen guests may hang during resume after a migration or suspend. The predominately affects HVM guests.

The following upstream commits (tagged for stable) fix the hangs.

"x86/xen: resume timer irqs early"
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=8d5999df35314607c38fbd6bdd709e25c3a4eeab

and

"xen/manage: Always freeze/thaw processes when suspend/resuming"
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=61a734d305e16944b42730ef582a7171dc733321
Comment 2 Simon Rowe 2014-09-12 12:36:12 EDT
Can this ticket be made public? We want to link to it in our release notes.
Comment 4 Simon Rowe 2014-10-08 11:17:34 EDT
Any update on this?
Comment 5 Vitaly Kuznetsov 2014-10-08 11:37:00 EDT
(In reply to Simon Rowe from comment #4)
> Any update on this?

The patches in question are in 3.16 stable and there is no obstacle to backporting them to RHEL7. Unfortunately we're late in 7.1 release cycle so I'd expect them to appear in 7.2.
Comment 9 Ronen Hod 2014-10-23 08:51:26 EDT
Not a regression.
Although Vitaly has a fix, we prefer to defer to 7.2 and avoid last minute risks.
Comment 10 Lingfei Kong 2014-10-27 22:09:59 EDT
Hi Vitaly,
I test 3.10.0-190.el7_bug1141249_nohang.x86_64, there still have problems when do migration or save/restore on RHEL or Fedora20.

On rhel5.11 i can do regular migration and save/restore. But the guest will hang after several (usually within 10 times) save/restore and several repeat migration (I get the hang in the thirteenth time). 

On fedora20 xen4, i can do regular save/restore operation but failed to do migration operation. Also when i test save/restore the guest hang at the 30 times. 

Also i run a automation job with 3.10.0-190.el7_bug1141249_nohang.x86_64, all the test cases are passed:
[ Intel | Linux | rhel5.11 x86_64 ] https://virtlab.englab.nay.redhat.com/job/92053/details/

I created a small test run to test this kernel on Fedora20 xen4, all of them are passed except the migration and save/restore test case:
[ Acceptance test for RHEL-7.0-20140507.0 64bit HVM guest on Fedora Xen 4 - Manual ] https://tcms.engineering.redhat.com/run/190002/


Here are the error messages on fedora20 xen4:
#xl list 
Name                                        ID   Mem VCPUs      State   Time(s)
Domain-0                                     0  2048     8     r-----    2481.3
hvm-7.0-64-1                                39  1019     4     -b----      18.4
[host-2]#xl migrate 39 localhost
migration target: Ready to receive domain.
Saving to migration stream new xl format (info 0x0/0x0/526)
Loading new save file <incoming migration stream> (new xl fmt info 0x0/0x0/526)
 Savefile contains xl domain config
WARNING: ignoring "kernel" directive for HVM guest. Use "firmware_override" instead if you really want a non-default firmware
WARNING: ignoring device_model directive.
WARNING: Use "device_model_override" instead if you really want a non-default device_model
xc: progress: Reloading memory pages: 0/1048575    0%
xc: progress: Reloading memory pages: 53248/1048575    5%
xc: progress: Reloading memory pages: 105472/1048575   10%
xc: progress: Reloading memory pages: 157696/1048575   15%
xc: progress: Reloading memory pages: 209920/1048575   20%
xc: progress: Reloading memory pages: 262144/1048575   25%
libxl: error: libxl_dm.c:1280:device_model_spawn_outcome: domain 40 device model: spawn failed (rc=-3)
libxl: error: libxl_create.c:1075:domcreate_devmodel_started: device model did not start: -3
libxl: error: libxl_dm.c:1311:libxl__destroy_device_model: Device Model already exited
migration target: Domain creation failed (code -3).
libxl: error: libxl_utils.c:393:libxl_read_exactly: file/stream truncated reading ready message from migration receiver stream
libxl: info: libxl_exec.c:118:libxl_report_child_exitstatus: migration target process [20793] exited with error status 3
Migration failed, resuming at sender.
xc: error: Cannot resume uncooperative HVM guests: Internal error
libxl: error: libxl.c:408:libxl__domain_resume: xc_domain_resume failed for domain 39: Interrupted system call


#cat /var/log/xen/qemu-dm-hvm-7.0-64-1--incoming.log
domid: 40
-videoram option does not work with cirrus vga device model. Videoram set to 4M.
Using xvda for guest's hda
Strip off blktap sub-type prefix to /var/lib/xen/images/hvm-7.0-64-1.img (drv 'aio')
Using file /var/lib/xen/images/hvm-7.0-64-1.img in read-write mode
Watching /local/domain/0/device-model/40/logdirty/cmd
Watching /local/domain/0/device-model/40/command
Watching /local/domain/40/cpu
char device redirected to /dev/pts/7
qemu_map_cache_init nr_buckets = 10000 size 4194304
shared page at pfn feffd
buffered io page at pfn feffb
Guest uuid = 3c8c2c18-d29a-4f84-b5cd-d25a8d1682f7
Register xen platform.
Done register platform.
platform_fixed_ioport: changed ro/rw state of ROM memory area. now is rw state.
xs_read(/local/domain/0/device-model/40/xen_extended_power_mgmt): read error
cirrus vga map change while on lfb mode
mapping video RAM from f0000000
platform_fixed_ioport: changed ro/rw state of ROM memory area. now is ro state.


Here are the error messages on RHEL5.11:
while :; do rm -rf /var/log/xen/* && xm save hvm-7.0-64-1 save && xm restore save; sleep 4;echo $i $(date +%T) | tee result3; ((i++)); done
70 18:03:10
71 18:03:44
72 18:04:17
73 18:04:50
74 18:05:23
75 18:05:56
76 18:06:29
77 18:07:03
78 18:07:37
79 18:08:11
Error: /usr/lib64/xen/bin/xc_save 22 33 0 0 4 0 failed
Usage: xm save <Domain> <CheckpointFile>

Save a domain state to restore later.
80 18:08:40
Error: /usr/lib64/xen/bin/xc_save 22 33 0 0 4 0 failed
Usage: xm save <Domain> <CheckpointFile>

Check the guest, guest hang
Comment 11 Vitaly Kuznetsov 2014-10-30 09:40:08 EDT
(In reply to Lingfei Kong from comment #10)
> Hi Vitaly,
> I test 3.10.0-190.el7_bug1141249_nohang.x86_64, there still have problems
> when do migration or save/restore on RHEL or Fedora20.
> 
> On rhel5.11 i can do regular migration and save/restore. But the guest will
> hang after several (usually within 10 times) save/restore and several repeat
> migration (I get the hang in the thirteenth time). 
> 
> On fedora20 xen4, i can do regular save/restore operation but failed to do
> migration operation. Also when i test save/restore the guest hang at the 30
> times. 
> 

I think you hit a different issue here - qemu-dm died for some reason and migration failed:

libxl: error: libxl_dm.c:1280:device_model_spawn_outcome: domain 40 device model: spawn failed (rc=-3)

Make sure you have enough memory - both in Xen and in Dom0. You can also try doing save/restore with longer sleeps between all operations as I can see the following in your log:

xs_read(/local/domain/0/device-model/40/xen_extended_power_mgmt): read error

It would also be great if you can confirm we're not seeing a regression here (so a guest without these patches behaves the same). To a certain extent we're fine with 'SanityOnly' here.

Thanks!
Comment 12 Lingfei Kong 2014-11-04 21:16:25 EST
(In reply to Vitaly Kuznetsov from comment #11)
> 
> It would also be great if you can confirm we're not seeing a regression here
> (so a guest without these patches behaves the same). To a certain extent
> we're fine with 'SanityOnly' here.
> 
> Thanks!

A guest without these patches also have this problem, I add `sleep 10` between save/restore and migrate, rhel7.0 also hang after some a while on rhel5.11 and fedora20. For save/restore it hang at the 52th times, for migrate it hang at the 305th times. When i did the test no other problem found.
Comment 13 Simon Rowe 2015-03-06 05:41:44 EST
This changeset

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/xen?id=72978b2fe2f2cdf9f319c6c6dcdbe92b38de2be2

is also needed to fully resolve this.

Can these two changes be queued for 7.2 now?
Comment 16 Simon Rowe 2015-09-16 11:50:54 EDT
The code fixes have not yet been included in 7.2 Beta.
Comment 18 Rafael Aquini 2016-06-23 22:37:26 EDT
Patch(es) committed on kernel repository and an interim kernel build is undergoing testing
Comment 20 Rafael Aquini 2016-06-24 15:28:30 EDT
Patch(es) available on kernel-3.10.0-451.el7
Comment 24 errata-xmlrpc 2016-11-03 04:46:48 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-2574.html

Note You need to log in before you can comment on or make changes to this bug.