Bug 250420

Summary: [RHEL5.1]: Off-line (non-live) migrate of a RHEL5.1 PV guest panics the guest
Product: Red Hat Enterprise Linux 5 Reporter: Chris Lalancette <clalance>
Component: kernel-xenAssignee: Chris Lalancette <clalance>
Status: CLOSED ERRATA QA Contact: Martin Jenner <mjenner>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 5.1CC: i-kitayama, xen-maint
Target Milestone: ---Keywords: Regression
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: RHBA-2007-0959 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-11-07 19:57:21 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Config file for PV guest
none
Patch that fixes the crash for me none

Description Chris Lalancette 2007-08-01 14:34:31 UTC
Description of problem:
I've been testing out migration with various PV guests.  Here's my setup:

machine1 - i686, running 5.1 Beta bits on dom0, export /var/lib/xen/images via
NFS, starts guests
machine2 - i686, running 5.1 Beta bits on dom0, mount
machine1:/var/lib/xen/images /var/lib/xen/images

Both machines have had their relocation servers turned on, iptables disabled, etc.

If I start up a rhel5 GA PV guest on machine1, and:

xm migrate rhel5pv machine2

It migrates fine, and comes up on machine2.

If I take that very same guest, install the RHEL-5.1 Beta kernel (2.6.18-37 as
of this writing), then:

xm migrate rhel5pv machine2

completes, but when I connect to the console on machine2 (xm console rhel5pv), I
see:

------------[ cut here ]------------
kernel BUG at drivers/xen/core/smpboot.c:417!
invalid opcode: 0000 [#1]
SMP 
last sysfs file: /block/dm-1/range
Modules linked in: autofs4 hidp rfcomm l2cap bluetooth sunrpc xennet
ip_conntrack_netbios_ns ipt_REJECT xt_state ip_conntrack nfnetlink
iptable_filter ip_tables ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables
x_tables ipv6 parport_pc lp parport pcspkr dm_snapshot dm_zero dm_mirror dm_mod
xenblk ext3 jbd ehci_hcd ohci_hcd uhci_hcd
CPU:    0
EIP:    0061:[<c05432c6>]    Not tainted VLI
EFLAGS: 00010282   (2.6.18-37.el5xen #1) 
EIP is at __cpu_up+0x7b/0x8a
eax: ffffffea   ebx: 00000001   ecx: 00000001   edx: 00000000
esi: 00000001   edi: 00000000   ebp: c06d0900   esp: ed71ef10
ds: 007b   es: 007b   ss: 0069
Process suspend (pid: 1982, ti=ed71e000 task=c0c03000 task.ti=ed71e000)
Stack: 00000001 fffffff0 c0624c07 ed71ef30 c0431f78 00000001 ed71ef37 c0542176 
       696c6e6f c000656e c0c2ba00 00000000 c0546863 c067cdb0 c05fab04 00000000 
       2f757063 00000031 00000000 c053be62 c067d08c c067d090 c07a6640 00007ff0 
Call Trace:
 [<c0431f78>] cpu_up+0x84/0xd9
 [<c0542176>] vcpu_hotplug+0x87/0xcc
 [<c0546863>] watch_otherend+0x16/0x19
 [<c05fab04>] klist_next+0xc/0x43
 [<c053be62>] bus_for_each_dev+0x4f/0x59
 [<c05421cf>] smp_resume+0x14/0x29
 [<c0542e5e>] __do_suspend+0x3c9/0x3d5
 [<c0416993>] complete+0x2b/0x3d
 [<c0542a95>] __do_suspend+0x0/0x3d5
 [<c042cca9>] kthread+0xc0/0xeb
 [<c042cbe9>] kthread+0x0/0xeb
 [<c0403005>] kernel_thread_helper+0x5/0xb
 =======================
Code: 28 6d c0 89 04 b5 00 29 6d c0 c6 44 2a 12 01 89 f0 e8 d3 fe ff ff f0 0f ab
35 44 66 7a c0 89 f1 89 fa e8 3e e0 eb ff 85 c0 74 08 <0f> 0b a1 01 d0 1d 63 c0
89 f8 5b 5e 5f 5d c3 e8 92 8e ec ff e8 
EIP: [<c05432c6>] __cpu_up+0x7b/0x8a SS:ESP 0069:ed71ef10
 <0>Kernel panic - not syncing: Fatal exception

Preliminary investigation shows that this is a failed VCPUOP_up hypercall; I'm
not quite sure why that is failing now, but it is.

The rhel5pv guest in question has 1500MB of memory, 4 vCPUs, and is using PVFB.
 I'll attach the full configuration file.

Comment 1 Chris Lalancette 2007-08-01 14:35:30 UTC
Created attachment 160417 [details]
Config file for PV guest

Comment 3 Chris Lalancette 2007-08-01 14:49:30 UTC
danpb pointed out that this is equivalent to just a "save" and "restore"....so I
broke it down to a little more basic case....on machine1, I just:

xm create -c rhel5pv # domain running -37 kernel
xm save rhel5pv /var/lib/xen/save/rhel5pv-save
xm restore /var/lib/xen/save/rhel5pv-save

And I got the same crash.  So it doesn't really have to do with migrating at
all, just with the restore stuff.

Chris Lalancette

Comment 4 Chris Lalancette 2007-08-01 15:24:10 UTC
Grr.  I may have missed a patch when pushing the save/restore fixes to Gerd for
5.1.  Going to test out adding that patch back in, and see if things are better.

Chris Lalancette

Comment 5 Chris Lalancette 2007-08-01 15:56:47 UTC
Yep, that was it.  I added that patch back in, and things started working again.
 I'm still seeing "softlockup" warnings after the restore, but I also saw that
on RHEL5 GA, so that is not something new.  I'll roll up the patch and post it soon.

Chris Lalancette

Comment 8 Chris Lalancette 2007-08-01 20:58:32 UTC
Created attachment 160464 [details]
Patch that fixes the crash for me

Comment 9 Don Zickus 2007-08-15 19:06:02 UTC
in 2.6.18-40.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 12 errata-xmlrpc 2007-11-07 19:57:21 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0959.html