Bug 1141249

Summary: Xen guests may hang after migration or suspend/resume
Product: Red Hat Enterprise Linux 7 Reporter: Simon Rowe <simon.rowe>
Component: kernelAssignee: Vitaly Kuznetsov <vkuznets>
kernel sub component: Xen QA Contact: Virtualization Bugs <virt-bugs>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: unspecified CC: ailan, knoel, leiwang, linl, vkuznets
Version: 7.2   
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Unspecified   
Whiteboard:
Fixed In Version: kernel-3.10.0-451.el7 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-11-03 08:46:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1288337, 1301891    

Description Simon Rowe 2014-09-12 14:16:48 UTC
Xen guests may hang during resume after a migration or suspend. The predominately affects HVM guests.

The following upstream commits (tagged for stable) fix the hangs.

"x86/xen: resume timer irqs early"
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=8d5999df35314607c38fbd6bdd709e25c3a4eeab

and

"xen/manage: Always freeze/thaw processes when suspend/resuming"
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=61a734d305e16944b42730ef582a7171dc733321

Comment 2 Simon Rowe 2014-09-12 16:36:12 UTC
Can this ticket be made public? We want to link to it in our release notes.

Comment 4 Simon Rowe 2014-10-08 15:17:34 UTC
Any update on this?

Comment 5 Vitaly Kuznetsov 2014-10-08 15:37:00 UTC
(In reply to Simon Rowe from comment #4)
> Any update on this?

The patches in question are in 3.16 stable and there is no obstacle to backporting them to RHEL7. Unfortunately we're late in 7.1 release cycle so I'd expect them to appear in 7.2.

Comment 9 Ronen Hod 2014-10-23 12:51:26 UTC
Not a regression.
Although Vitaly has a fix, we prefer to defer to 7.2 and avoid last minute risks.

Comment 10 Lingfei Kong 2014-10-28 02:09:59 UTC
Hi Vitaly,
I test 3.10.0-190.el7_bug1141249_nohang.x86_64, there still have problems when do migration or save/restore on RHEL or Fedora20.

On rhel5.11 i can do regular migration and save/restore. But the guest will hang after several (usually within 10 times) save/restore and several repeat migration (I get the hang in the thirteenth time). 

On fedora20 xen4, i can do regular save/restore operation but failed to do migration operation. Also when i test save/restore the guest hang at the 30 times. 

Also i run a automation job with 3.10.0-190.el7_bug1141249_nohang.x86_64, all the test cases are passed:
[ Intel | Linux | rhel5.11 x86_64 ] https://virtlab.englab.nay.redhat.com/job/92053/details/

I created a small test run to test this kernel on Fedora20 xen4, all of them are passed except the migration and save/restore test case:
[ Acceptance test for RHEL-7.0-20140507.0 64bit HVM guest on Fedora Xen 4 - Manual ] https://tcms.engineering.redhat.com/run/190002/


Here are the error messages on fedora20 xen4:
#xl list 
Name                                        ID   Mem VCPUs      State   Time(s)
Domain-0                                     0  2048     8     r-----    2481.3
hvm-7.0-64-1                                39  1019     4     -b----      18.4
[host-2]#xl migrate 39 localhost
migration target: Ready to receive domain.
Saving to migration stream new xl format (info 0x0/0x0/526)
Loading new save file <incoming migration stream> (new xl fmt info 0x0/0x0/526)
 Savefile contains xl domain config
WARNING: ignoring "kernel" directive for HVM guest. Use "firmware_override" instead if you really want a non-default firmware
WARNING: ignoring device_model directive.
WARNING: Use "device_model_override" instead if you really want a non-default device_model
xc: progress: Reloading memory pages: 0/1048575    0%
xc: progress: Reloading memory pages: 53248/1048575    5%
xc: progress: Reloading memory pages: 105472/1048575   10%
xc: progress: Reloading memory pages: 157696/1048575   15%
xc: progress: Reloading memory pages: 209920/1048575   20%
xc: progress: Reloading memory pages: 262144/1048575   25%
libxl: error: libxl_dm.c:1280:device_model_spawn_outcome: domain 40 device model: spawn failed (rc=-3)
libxl: error: libxl_create.c:1075:domcreate_devmodel_started: device model did not start: -3
libxl: error: libxl_dm.c:1311:libxl__destroy_device_model: Device Model already exited
migration target: Domain creation failed (code -3).
libxl: error: libxl_utils.c:393:libxl_read_exactly: file/stream truncated reading ready message from migration receiver stream
libxl: info: libxl_exec.c:118:libxl_report_child_exitstatus: migration target process [20793] exited with error status 3
Migration failed, resuming at sender.
xc: error: Cannot resume uncooperative HVM guests: Internal error
libxl: error: libxl.c:408:libxl__domain_resume: xc_domain_resume failed for domain 39: Interrupted system call


#cat /var/log/xen/qemu-dm-hvm-7.0-64-1--incoming.log
domid: 40
-videoram option does not work with cirrus vga device model. Videoram set to 4M.
Using xvda for guest's hda
Strip off blktap sub-type prefix to /var/lib/xen/images/hvm-7.0-64-1.img (drv 'aio')
Using file /var/lib/xen/images/hvm-7.0-64-1.img in read-write mode
Watching /local/domain/0/device-model/40/logdirty/cmd
Watching /local/domain/0/device-model/40/command
Watching /local/domain/40/cpu
char device redirected to /dev/pts/7
qemu_map_cache_init nr_buckets = 10000 size 4194304
shared page at pfn feffd
buffered io page at pfn feffb
Guest uuid = 3c8c2c18-d29a-4f84-b5cd-d25a8d1682f7
Register xen platform.
Done register platform.
platform_fixed_ioport: changed ro/rw state of ROM memory area. now is rw state.
xs_read(/local/domain/0/device-model/40/xen_extended_power_mgmt): read error
cirrus vga map change while on lfb mode
mapping video RAM from f0000000
platform_fixed_ioport: changed ro/rw state of ROM memory area. now is ro state.


Here are the error messages on RHEL5.11:
while :; do rm -rf /var/log/xen/* && xm save hvm-7.0-64-1 save && xm restore save; sleep 4;echo $i $(date +%T) | tee result3; ((i++)); done
70 18:03:10
71 18:03:44
72 18:04:17
73 18:04:50
74 18:05:23
75 18:05:56
76 18:06:29
77 18:07:03
78 18:07:37
79 18:08:11
Error: /usr/lib64/xen/bin/xc_save 22 33 0 0 4 0 failed
Usage: xm save <Domain> <CheckpointFile>

Save a domain state to restore later.
80 18:08:40
Error: /usr/lib64/xen/bin/xc_save 22 33 0 0 4 0 failed
Usage: xm save <Domain> <CheckpointFile>

Check the guest, guest hang

Comment 11 Vitaly Kuznetsov 2014-10-30 13:40:08 UTC
(In reply to Lingfei Kong from comment #10)
> Hi Vitaly,
> I test 3.10.0-190.el7_bug1141249_nohang.x86_64, there still have problems
> when do migration or save/restore on RHEL or Fedora20.
> 
> On rhel5.11 i can do regular migration and save/restore. But the guest will
> hang after several (usually within 10 times) save/restore and several repeat
> migration (I get the hang in the thirteenth time). 
> 
> On fedora20 xen4, i can do regular save/restore operation but failed to do
> migration operation. Also when i test save/restore the guest hang at the 30
> times. 
> 

I think you hit a different issue here - qemu-dm died for some reason and migration failed:

libxl: error: libxl_dm.c:1280:device_model_spawn_outcome: domain 40 device model: spawn failed (rc=-3)

Make sure you have enough memory - both in Xen and in Dom0. You can also try doing save/restore with longer sleeps between all operations as I can see the following in your log:

xs_read(/local/domain/0/device-model/40/xen_extended_power_mgmt): read error

It would also be great if you can confirm we're not seeing a regression here (so a guest without these patches behaves the same). To a certain extent we're fine with 'SanityOnly' here.

Thanks!

Comment 12 Lingfei Kong 2014-11-05 02:16:25 UTC
(In reply to Vitaly Kuznetsov from comment #11)
> 
> It would also be great if you can confirm we're not seeing a regression here
> (so a guest without these patches behaves the same). To a certain extent
> we're fine with 'SanityOnly' here.
> 
> Thanks!

A guest without these patches also have this problem, I add `sleep 10` between save/restore and migrate, rhel7.0 also hang after some a while on rhel5.11 and fedora20. For save/restore it hang at the 52th times, for migrate it hang at the 305th times. When i did the test no other problem found.

Comment 13 Simon Rowe 2015-03-06 10:41:44 UTC
This changeset

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/xen?id=72978b2fe2f2cdf9f319c6c6dcdbe92b38de2be2

is also needed to fully resolve this.

Can these two changes be queued for 7.2 now?

Comment 16 Simon Rowe 2015-09-16 15:50:54 UTC
The code fixes have not yet been included in 7.2 Beta.

Comment 18 Rafael Aquini 2016-06-24 02:37:26 UTC
Patch(es) committed on kernel repository and an interim kernel build is undergoing testing

Comment 20 Rafael Aquini 2016-06-24 19:28:30 UTC
Patch(es) available on kernel-3.10.0-451.el7

Comment 24 errata-xmlrpc 2016-11-03 08:46:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-2574.html