Bug 566927 - xm save hangs with kernel 2.6.31.12-174.2.19(and .22).fc12.i686.PAE running as PV guest under CentOS5.4
Summary: xm save hangs with kernel 2.6.31.12-174.2.19(and .22).fc12.i686.PAE running a...
Keywords:
Status: CLOSED DUPLICATE of bug 566930
Alias: None
Product: Fedora
Classification: Fedora
Component: xen
Version: 12
Hardware: i686
OS: Linux
low
medium
Target Milestone: ---
Assignee: Xen Maintainance List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-02-20 13:58 UTC by Kyle
Modified: 2010-02-22 13:55 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2010-02-22 08:35:42 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 523971 0 low CLOSED xm save hangs with kernel-2.6.31-14.fc12 running as a PV guest under RHEL-5.4 2021-02-22 00:41:40 UTC

Description Kyle 2010-02-20 13:58:47 UTC
User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2) Gecko/20100115 Firefox/3.6

Attempting to save a Xen Fedora 12 guest (kernel versions kernel 2.6.31.12-174.2.19.fc12.i686.PAE and 2.6.31.12-174.2.22.fc12.i686.PAE) ends up with the save process stalling and the guest seemingly hanging - no response to pings, ssh sessions timeout and attaching to the console (xm console <domain> and xm create <domain> -c, following the entire boot process up to being presented with the login prompt) and typing stuff does not result in anything appearing - ie. it's frozen. xm list shows that the guest name has "migrating-" in front of it. (If the guest name is fedora, then xm list will show migrating-fedora.)

However, other (CentOS 5.4) PV guests can be saved/resumed as per normal, and shutting the guest down normally (with shutdown -h now or init 0) works fine.

Reproducible: Always

Steps to Reproduce:
1. Create a Fedora 12 PV guest. (My config is 20GB disk - LV, 512MB RAM, bridged networking)
2. Run xm save <domain> <savefilepath>
3. Watch it freeze up.
Actual Results:  
Output in xend.log:
[2010-02-20 21:12:39 xend 2826] DEBUG (XendCheckpoint:89) [xc_save]: /usr/lib/xen/bin/xc_save 22 7 0 0 0
[2010-02-20 21:12:39 xend 2826] DEBUG (XendCheckpoint:324) suspend
[2010-02-20 21:12:39 xend 2826] DEBUG (XendCheckpoint:92) In saveInputHandler suspend
[2010-02-20 21:12:39 xend 2826] DEBUG (XendCheckpoint:94) Suspending 7 ...
[2010-02-20 21:12:39 xend.XendDomainInfo 2826] DEBUG (XendDomainInfo:1249) XendDomainInfo.handleShutdownWatch
[2010-02-20 21:12:39 xend.XendDomainInfo 2826] DEBUG (XendDomainInfo:1249) XendDomainInfo.handleShutdownWatch


Expected Results:  
Output of a working CentOS5.4 guest:
[2010-02-20 21:27:30 xend 2826] DEBUG (XendCheckpoint:89) [xc_save]: /usr/lib/xen/bin/xc_save 26 2 0 0 0
[2010-02-20 21:27:30 xend 2826] DEBUG (XendCheckpoint:324) suspend
[2010-02-20 21:27:30 xend 2826] DEBUG (XendCheckpoint:92) In saveInputHandler suspend
[2010-02-20 21:27:30 xend 2826] DEBUG (XendCheckpoint:94) Suspending 2 ...
[2010-02-20 21:27:30 xend.XendDomainInfo 2826] DEBUG (XendDomainInfo:1249) XendDomainInfo.handleShutdownWatch
[2010-02-20 21:27:30 xend.XendDomainInfo 2826] DEBUG (XendDomainInfo:1249) XendDomainInfo.handleShutdownWatch
[2010-02-20 21:27:30 xend.XendDomainInfo 2826] INFO (XendDomainInfo:1206) Domain has shutdown: name=migrating-centos5 id=2 reason=suspend.
[2010-02-20 21:27:30 xend 2826] INFO (XendCheckpoint:99) Domain 2 suspended.
[2010-02-20 21:27:30 xend 2826] DEBUG (XendCheckpoint:108) Written done
[2010-02-20 21:27:30 xend 2826] INFO (XendCheckpoint:353) Had 0 unexplained entries in p2m table
 1: sent 131021, skipped 0, delta 9553ms, dom0 57%, target 0%, sent 449Mb/s, dirtied 0Mb/s 0 pages
[2010-02-20 21:27:40 xend 2826] INFO (XendCheckpoint:353) Total pages sent= 131021 (0.98x)
[2010-02-20 21:27:40 xend 2826] INFO (XendCheckpoint:353) (of which 0 were fixups)
[2010-02-20 21:27:40 xend 2826] INFO (XendCheckpoint:353) All memory is saved
[2010-02-20 21:27:41 xend 2826] INFO (XendCheckpoint:353) Save exit rc=0
[2010-02-20 21:27:41 xend.XendDomainInfo 2826] DEBUG (XendDomainInfo:2130) XendDomainInfo.destroy: domid=2
[2010-02-20 21:27:42 xend.XendDomainInfo 2826] INFO (XendDomainInfo:2291) Dev 51712 still active, looping...
[2010-02-20 21:27:42 xend.XendDomainInfo 2826] INFO (XendDomainInfo:2291) Dev 51712 still active, looping...
[2010-02-20 21:27:42 xend.XendDomainInfo 2826] INFO (XendDomainInfo:2291) Dev 51712 still active, looping...
[2010-02-20 21:27:42 xend.XendDomainInfo 2826] INFO (XendDomainInfo:2291) Dev 51712 still active, looping...
[2010-02-20 21:27:42 xend.XendDomainInfo 2826] INFO (XendDomainInfo:2291) Dev 51712 still active, looping...
[2010-02-20 21:27:42 xend.XendDomainInfo 2826] INFO (XendDomainInfo:2291) Dev 51712 still active, looping...
[2010-02-20 21:27:42 xend.XendDomainInfo 2826] INFO (XendDomainInfo:2291) Dev 51712 still active, looping...
[2010-02-20 21:27:42 xend.XendDomainInfo 2826] INFO (XendDomainInfo:2291) Dev 51712 still active, looping...
[2010-02-20 21:27:42 xend.XendDomainInfo 2826] DEBUG (XendDomainInfo:2055) UUID Created: True
[2010-02-20 21:27:42 xend.XendDomainInfo 2826] DEBUG (XendDomainInfo:2056) Devices to release: [], domid = 2
[2010-02-20 21:27:42 xend.XendDomainInfo 2826] DEBUG (XendDomainInfo:2068) Releasing PVFB backend devices ...


yum [root@dom0 ~]# xm info
host                   : dom0
release                : 2.6.18-164.11.1.el5xen
version                : #1 SMP Wed Jan 20 08:53:10 EST 2010
machine                : i686
nr_cpus                : 4
nr_nodes               : 1
sockets_per_node       : 2
cores_per_socket       : 1
threads_per_core       : 2
cpu_mhz                : 2392
hw_caps                : bfebfbff:00000000:00000000:00000080:00004400
total_memory           : 3071
free_memory            : 990
node_to_cpu            : node0:0-3
xen_major              : 3
xen_minor              : 1
xen_extra              : .2-164.11.1.el5
xen_caps               : xen-3.0-x86_32p
xen_pagesize           : 4096
platform_params        : virt_start=0xf5800000
xen_changeset          : unavailable
cc_compiler            : gcc version 4.1.2 20080704 (Red Hat 4.1.2-46)
cc_compile_by          : mockbuild
cc_compile_domain      : centos.org
cc_compile_date        : Wed Jan 20 07:31:16 EST 2010
xend_config_format     : 2

Destroying the guest (xm destroy <domain>) makes this appear in xend.log:
[2010-02-20 21:44:20 xend 2826] INFO (XendCheckpoint:99) Domain 7 suspended.
[2010-02-20 21:44:20 xend 2826] DEBUG (XendCheckpoint:108) Written done
[2010-02-20 21:44:20 xend 2826] INFO (XendCheckpoint:353) ERROR Internal error: domain is dying
[2010-02-20 21:44:20 xend 2826] INFO (XendCheckpoint:353) ERROR Internal error: Domain appears not to have suspended
[2010-02-20 21:44:20 xend 2826] INFO (XendCheckpoint:353) Save exit rc=1
[2010-02-20 21:44:20 xend 2826] ERROR (XendCheckpoint:133) Save failed on domain fedora (7).
Traceback (most recent call last):
  File "/usr/lib/python2.4/site-packages/xen/xend/XendCheckpoint.py", line 110, in save
    forkHelper(cmd, fd, saveInputHandler, False)
  File "/usr/lib/python2.4/site-packages/xen/xend/XendCheckpoint.py", line 341, in forkHelper
    raise XendError("%s failed" % string.join(cmd))
XendError: /usr/lib/xen/bin/xc_save 22 7 0 0 0 failed
[2010-02-20 21:44:20 xend.XendDomainInfo 2826] DEBUG (XendDomainInfo:2165) XendDomainInfo.resumeDomain(7)
[2010-02-20 21:44:20 xend.XendDomainInfo 2826] DEBUG (XendDomainInfo:2178) XendDomainInfo.resumeDomain: devices released
[2010-02-20 21:44:20 xend.XendDomainInfo 2826] ERROR (XendDomainInfo:2220) Exception in evtcnh_reset(7)
Traceback (most recent call last):
  File "/usr/lib/python2.4/site-packages/xen/xend/XendDomainInfo.py", line 2218, in _resetChannels
    return xc.evtchn_reset(dom = self.domid)
Error: (1, 'Internal error', 'do_evtchn_op: HYPERVISOR_event_channel_op failed: -1')
[2010-02-20 21:44:20 xend.XendDomainInfo 2826] ERROR (XendDomainInfo:2195) XendDomainInfo.resume: xc.domain_resume failed on domain 7.
Traceback (most recent call last):
  File "/usr/lib/python2.4/site-packages/xen/xend/XendDomainInfo.py", line 2180, in resumeDomain
    self._resetChannels()
  File "/usr/lib/python2.4/site-packages/xen/xend/XendDomainInfo.py", line 2218, in _resetChannels
    return xc.evtchn_reset(dom = self.domid)
Error: (1, 'Internal error', 'do_evtchn_op: HYPERVISOR_event_channel_op failed: -1')
[2010-02-20 21:44:20 xend 2826] DEBUG (XendCheckpoint:136) XendCheckpoint.save: resumeDomain

Note: This bug (https://bugzilla.redhat.com/show_bug.cgi?id=523971) seems to be very similar, except that it's against a later kernel and rawhide.

Comment 1 Kyle 2010-02-20 14:03:44 UTC
Started to add the bit about yum, went to make sure yum did in fact update everything, and forgot about finishing my thought. yum check-update shows no new updates after updating the kernel to the .22 release.

Also, the output in "actual behaviour" just ends there. No "Domain has shutdown" message like there is for the CentOS5.4 guest.

Comment 2 Andrew Jones 2010-02-20 16:07:01 UTC
Later kernels (at least starting with 2.6.32.7, and maybe earlier) can save and restore properly. There was some interest in upstream xen development to backport whatever the necessary patch set is to the 2.6.31 stable tree, but I'm not sure anybody is currently working on it.

Comment 3 Kyle 2010-02-21 04:43:15 UTC
(In reply to comment #2)
> Later kernels (at least starting with 2.6.32.7, and maybe earlier) can save and
> restore properly. There was some interest in upstream xen development to
> backport whatever the necessary patch set is to the 2.6.31 stable tree, but I'm
> not sure anybody is currently working on it.    

I presume the later kernels are available in rawhide? Now's as a time as any to try it, I guess. *is off to set up a PV rawhide guest*

Comment 4 Kyle 2010-02-21 15:22:34 UTC
(In reply to comment #2)
> Later kernels (at least starting with 2.6.32.7, and maybe earlier) can save and
> restore properly. There was some interest in upstream xen development to
> backport whatever the necessary patch set is to the 2.6.31 stable tree, but I'm
> not sure anybody is currently working on it.    

Uh... strange. Kernel version 2.6.33-0.48.rc8.git1.fc14.i686.PAE can't be saved either. Next thing I'm trying is an install of F13, not rawhide. And if that doesn't work, I'll try F11, see if it's a regression or not.

Comment 5 Andrew Jones 2010-02-22 08:35:42 UTC
I'm duping this bz to another bz opened for the same issue. They were both opened about the same time, but the other one has some more testing details.

*** This bug has been marked as a duplicate of bug 566930 ***

Comment 6 Kyle 2010-02-22 13:55:55 UTC
(In reply to comment #5)
> I'm duping this bz to another bz opened for the same issue. They were both
> opened about the same time, but the other one has some more testing details.
> 
> *** This bug has been marked as a duplicate of bug 566930 ***    

Yeah, there is more testing info so it makes sense to. I'll keep an eye on it, because rawhide seems to have the same problem, which doesn't make much sense (to me) where it's 2.6.33, so I would have thought that the bug would have been fixed there too.


Note You need to log in before you can comment on or make changes to this bug.