RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1141249 - Xen guests may hang after migration or suspend/resume
Summary: Xen guests may hang after migration or suspend/resume
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: kernel
Version: 7.2
Hardware: All
OS: Unspecified
unspecified
high
Target Milestone: rc
: ---
Assignee: Vitaly Kuznetsov
QA Contact: Virtualization Bugs
URL:
Whiteboard:
Depends On:
Blocks: 1288337 1301891
TreeView+ depends on / blocked
 
Reported: 2014-09-12 14:16 UTC by Simon Rowe
Modified: 2016-11-03 08:46 UTC (History)
5 users (show)

Fixed In Version: kernel-3.10.0-451.el7
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-11-03 08:46:48 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2016:2574 0 normal SHIPPED_LIVE Important: kernel security, bug fix, and enhancement update 2016-11-03 12:06:10 UTC

Description Simon Rowe 2014-09-12 14:16:48 UTC
Xen guests may hang during resume after a migration or suspend. The predominately affects HVM guests.

The following upstream commits (tagged for stable) fix the hangs.

"x86/xen: resume timer irqs early"
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=8d5999df35314607c38fbd6bdd709e25c3a4eeab

and

"xen/manage: Always freeze/thaw processes when suspend/resuming"
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=61a734d305e16944b42730ef582a7171dc733321

Comment 2 Simon Rowe 2014-09-12 16:36:12 UTC
Can this ticket be made public? We want to link to it in our release notes.

Comment 4 Simon Rowe 2014-10-08 15:17:34 UTC
Any update on this?

Comment 5 Vitaly Kuznetsov 2014-10-08 15:37:00 UTC
(In reply to Simon Rowe from comment #4)
> Any update on this?

The patches in question are in 3.16 stable and there is no obstacle to backporting them to RHEL7. Unfortunately we're late in 7.1 release cycle so I'd expect them to appear in 7.2.

Comment 9 Ronen Hod 2014-10-23 12:51:26 UTC
Not a regression.
Although Vitaly has a fix, we prefer to defer to 7.2 and avoid last minute risks.

Comment 10 Lingfei Kong 2014-10-28 02:09:59 UTC
Hi Vitaly,
I test 3.10.0-190.el7_bug1141249_nohang.x86_64, there still have problems when do migration or save/restore on RHEL or Fedora20.

On rhel5.11 i can do regular migration and save/restore. But the guest will hang after several (usually within 10 times) save/restore and several repeat migration (I get the hang in the thirteenth time). 

On fedora20 xen4, i can do regular save/restore operation but failed to do migration operation. Also when i test save/restore the guest hang at the 30 times. 

Also i run a automation job with 3.10.0-190.el7_bug1141249_nohang.x86_64, all the test cases are passed:
[ Intel | Linux | rhel5.11 x86_64 ] https://virtlab.englab.nay.redhat.com/job/92053/details/

I created a small test run to test this kernel on Fedora20 xen4, all of them are passed except the migration and save/restore test case:
[ Acceptance test for RHEL-7.0-20140507.0 64bit HVM guest on Fedora Xen 4 - Manual ] https://tcms.engineering.redhat.com/run/190002/


Here are the error messages on fedora20 xen4:
#xl list 
Name                                        ID   Mem VCPUs      State   Time(s)
Domain-0                                     0  2048     8     r-----    2481.3
hvm-7.0-64-1                                39  1019     4     -b----      18.4
[host-2]#xl migrate 39 localhost
migration target: Ready to receive domain.
Saving to migration stream new xl format (info 0x0/0x0/526)
Loading new save file <incoming migration stream> (new xl fmt info 0x0/0x0/526)
 Savefile contains xl domain config
WARNING: ignoring "kernel" directive for HVM guest. Use "firmware_override" instead if you really want a non-default firmware
WARNING: ignoring device_model directive.
WARNING: Use "device_model_override" instead if you really want a non-default device_model
xc: progress: Reloading memory pages: 0/1048575    0%
xc: progress: Reloading memory pages: 53248/1048575    5%
xc: progress: Reloading memory pages: 105472/1048575   10%
xc: progress: Reloading memory pages: 157696/1048575   15%
xc: progress: Reloading memory pages: 209920/1048575   20%
xc: progress: Reloading memory pages: 262144/1048575   25%
libxl: error: libxl_dm.c:1280:device_model_spawn_outcome: domain 40 device model: spawn failed (rc=-3)
libxl: error: libxl_create.c:1075:domcreate_devmodel_started: device model did not start: -3
libxl: error: libxl_dm.c:1311:libxl__destroy_device_model: Device Model already exited
migration target: Domain creation failed (code -3).
libxl: error: libxl_utils.c:393:libxl_read_exactly: file/stream truncated reading ready message from migration receiver stream
libxl: info: libxl_exec.c:118:libxl_report_child_exitstatus: migration target process [20793] exited with error status 3
Migration failed, resuming at sender.
xc: error: Cannot resume uncooperative HVM guests: Internal error
libxl: error: libxl.c:408:libxl__domain_resume: xc_domain_resume failed for domain 39: Interrupted system call


#cat /var/log/xen/qemu-dm-hvm-7.0-64-1--incoming.log
domid: 40
-videoram option does not work with cirrus vga device model. Videoram set to 4M.
Using xvda for guest's hda
Strip off blktap sub-type prefix to /var/lib/xen/images/hvm-7.0-64-1.img (drv 'aio')
Using file /var/lib/xen/images/hvm-7.0-64-1.img in read-write mode
Watching /local/domain/0/device-model/40/logdirty/cmd
Watching /local/domain/0/device-model/40/command
Watching /local/domain/40/cpu
char device redirected to /dev/pts/7
qemu_map_cache_init nr_buckets = 10000 size 4194304
shared page at pfn feffd
buffered io page at pfn feffb
Guest uuid = 3c8c2c18-d29a-4f84-b5cd-d25a8d1682f7
Register xen platform.
Done register platform.
platform_fixed_ioport: changed ro/rw state of ROM memory area. now is rw state.
xs_read(/local/domain/0/device-model/40/xen_extended_power_mgmt): read error
cirrus vga map change while on lfb mode
mapping video RAM from f0000000
platform_fixed_ioport: changed ro/rw state of ROM memory area. now is ro state.


Here are the error messages on RHEL5.11:
while :; do rm -rf /var/log/xen/* && xm save hvm-7.0-64-1 save && xm restore save; sleep 4;echo $i $(date +%T) | tee result3; ((i++)); done
70 18:03:10
71 18:03:44
72 18:04:17
73 18:04:50
74 18:05:23
75 18:05:56
76 18:06:29
77 18:07:03
78 18:07:37
79 18:08:11
Error: /usr/lib64/xen/bin/xc_save 22 33 0 0 4 0 failed
Usage: xm save <Domain> <CheckpointFile>

Save a domain state to restore later.
80 18:08:40
Error: /usr/lib64/xen/bin/xc_save 22 33 0 0 4 0 failed
Usage: xm save <Domain> <CheckpointFile>

Check the guest, guest hang

Comment 11 Vitaly Kuznetsov 2014-10-30 13:40:08 UTC
(In reply to Lingfei Kong from comment #10)
> Hi Vitaly,
> I test 3.10.0-190.el7_bug1141249_nohang.x86_64, there still have problems
> when do migration or save/restore on RHEL or Fedora20.
> 
> On rhel5.11 i can do regular migration and save/restore. But the guest will
> hang after several (usually within 10 times) save/restore and several repeat
> migration (I get the hang in the thirteenth time). 
> 
> On fedora20 xen4, i can do regular save/restore operation but failed to do
> migration operation. Also when i test save/restore the guest hang at the 30
> times. 
> 

I think you hit a different issue here - qemu-dm died for some reason and migration failed:

libxl: error: libxl_dm.c:1280:device_model_spawn_outcome: domain 40 device model: spawn failed (rc=-3)

Make sure you have enough memory - both in Xen and in Dom0. You can also try doing save/restore with longer sleeps between all operations as I can see the following in your log:

xs_read(/local/domain/0/device-model/40/xen_extended_power_mgmt): read error

It would also be great if you can confirm we're not seeing a regression here (so a guest without these patches behaves the same). To a certain extent we're fine with 'SanityOnly' here.

Thanks!

Comment 12 Lingfei Kong 2014-11-05 02:16:25 UTC
(In reply to Vitaly Kuznetsov from comment #11)
> 
> It would also be great if you can confirm we're not seeing a regression here
> (so a guest without these patches behaves the same). To a certain extent
> we're fine with 'SanityOnly' here.
> 
> Thanks!

A guest without these patches also have this problem, I add `sleep 10` between save/restore and migrate, rhel7.0 also hang after some a while on rhel5.11 and fedora20. For save/restore it hang at the 52th times, for migrate it hang at the 305th times. When i did the test no other problem found.

Comment 13 Simon Rowe 2015-03-06 10:41:44 UTC
This changeset

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/xen?id=72978b2fe2f2cdf9f319c6c6dcdbe92b38de2be2

is also needed to fully resolve this.

Can these two changes be queued for 7.2 now?

Comment 16 Simon Rowe 2015-09-16 15:50:54 UTC
The code fixes have not yet been included in 7.2 Beta.

Comment 18 Rafael Aquini 2016-06-24 02:37:26 UTC
Patch(es) committed on kernel repository and an interim kernel build is undergoing testing

Comment 20 Rafael Aquini 2016-06-24 19:28:30 UTC
Patch(es) available on kernel-3.10.0-451.el7

Comment 24 errata-xmlrpc 2016-11-03 08:46:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-2574.html


Note You need to log in before you can comment on or make changes to this bug.