Bug 174192 - kernel-xen-hypervisor unstable (lockups) on HP Proliant G3 (single CPU, 1G RAM)
kernel-xen-hypervisor unstable (lockups) on HP Proliant G3 (single CPU, 1G RAM)
Status: CLOSED RAWHIDE
Product: Fedora
Classification: Fedora
Component: kernel-xen (Show other bugs)
rawhide
i386 Linux
medium Severity medium
: ---
: ---
Assigned To: Juan Quintela
Brian Brock
:
Depends On:
Blocks: 179599
  Show dependency treegraph
 
Reported: 2005-11-25 12:26 EST by Tim Small
Modified: 2007-11-30 17:11 EST (History)
1 user (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2006-02-07 08:45:36 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Tim Small 2005-11-25 12:26:09 EST
Description of problem:

Xen (both just running the hypervisor kernel, and also running guests as well)
seems unstable on the HP Proliant DL380 G3.  Regular non-xen kernels are fine.

Version-Release number of selected component (if applicable):

xen-3.0-0.20051109.fc5.3  kernel-xen-hypervisor-2.6.12-1.13_FC5

How reproducible:

Lock up usually occurs within 10 minutes of boot.

Additional info:

I've got this stack trace once:

login: kernel BUG at:arch/xen/i386/mm/hypervisor.c:381 (xen_create_contig
uous_region)!        s        es:
 [<c011aaed>] xen_create_contiguous_region+0x24d/0x390
 [<c01072ed>] skbuff_ctor+0x6d/0x80
 [<c015833b>] cache_alloc_debugcheck_after+0x7b/0x1b0
 [<c0158824>] kmem_cache_alloc+0xa4/0x100
 [<c0322f17>] alloc_skb_from_cache+0x47/0x100
 [<c0322f17>] alloc_skb_from_cache+0x47/0x100
 [<c0107190>] alloc_skb+0x40/0xb0
 [<c0352e38>] tcp_sendmsg+0x3f8/0x1100
 [<c015277c>] __alloc_pages+0xdc/0x490
 [<c014d2ed>] find_get_page+0x3d/0x50
 [<c014ea13>] filemap_nopage+0x3a3/0x490
 [<c016107f>] do_anonymous_page+0x8f/0x240

Two other lock ups (no network response, console dead), and once where box
responded to pings but TCP connections hung, and the console was wedged (this
was before I disabled smp).

Grub entry:

title Fedora Core (2.6.12-1.13_FC5.small_thypervisor)
        root (hd0,0)
        kernel /boot/xen.gz-2.6.12-1.13_FC5.small_t com1=115200,8n1,0x408,4
nosmp watchdog
        module /boot/vmlinuz-2.6.12-1.13_FC5.small_thypervisor ro root=LABEL=/
norhgb
        module /boot/initrd-2.6.12-1.13_FC5.small_thypervisor.img

The kernel is as the SRPMS from
http://download.fedora.redhat.com/pub/fedora/linux/core/development/SRPMS/
except that the kernel config has been modified to build the cciss driver into
the kernel (instead of as a module as was the stock config) - this was done as
part of some earlier debugging.

$ uname -a ; cat /proc/meminfo ; cat /proc/cpuinfo ; lspci ; cat /proc/devices
Linux xyz.xyz 2.6.12-1.13_FC5.xyzhypervisor #1 SMP Thu Nov 24 15:31:00 GMT 2005
i686 i686 i386 GNU/Linux
MemTotal:       393216 kB
MemFree:        308632 kB
Buffers:          6116 kB
Cached:          36784 kB
SwapCached:          0 kB
Active:          30832 kB
Inactive:        23184 kB
HighTotal:           0 kB
HighFree:            0 kB
LowTotal:       393216 kB
LowFree:        308632 kB
SwapTotal:           0 kB
SwapFree:            0 kB
Dirty:              12 kB
Writeback:           0 kB
Mapped:          18512 kB
Slab:             9092 kB
CommitLimit:    196608 kB
Committed_AS:    90792 kB
PageTables:        860 kB
VmallocTotal:   121752 kB
VmallocUsed:      2064 kB
VmallocChunk:   119252 kB
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 15
model           : 2
model name      : Intel(R) Xeon(TM) CPU 2.80GHz
stepping        : 7
cpu MHz         : 2785.128
cache size      : 512 KB
fdiv_bug        : no
hlt_bug         : yes
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu tsc msr pae mce cx8 apic mtrr mca cmov pat pse36 clflush
dts acpi mmx fxsr sse sse2 ss ht tm pbe cid xtpr
bogomips        : 5557.45

00:00.0 Host bridge: Broadcom (formerly ServerWorks) CMIC-LE Host Bridge (GC-LE
chipset) (rev 33)
00:00.1 Host bridge: Broadcom (formerly ServerWorks) CMIC-LE Host Bridge (GC-LE
chipset)
00:00.2 Host bridge: Broadcom (formerly ServerWorks) CMIC-LE Host Bridge (GC-LE
chipset)
00:03.0 VGA compatible controller: ATI Technologies Inc Rage XL (rev 27)
00:04.0 System peripheral: Compaq Computer Corporation Integrated Lights Out
Controller (rev 01)
00:04.2 System peripheral: Compaq Computer Corporation Integrated Lights Out 
Processor (rev 01)
00:0f.0 ISA bridge: Broadcom (formerly ServerWorks) CSB5 South Bridge (rev 93)
00:0f.1 IDE interface: Broadcom (formerly ServerWorks) CSB5 IDE Controller (rev 93)
00:0f.2 USB Controller: Broadcom (formerly ServerWorks) OSB4/CSB5 OHCI USB
Controller (rev 05)
00:0f.3 Host bridge: Broadcom (formerly ServerWorks) CSB5 LPC bridge
00:10.0 Host bridge: Broadcom (formerly ServerWorks) CIOB-X2 PCI-X I/O Bridge
(rev 05)
00:10.2 Host bridge: Broadcom (formerly ServerWorks) CIOB-X2 PCI-X I/O Bridge
(rev 05)
00:11.0 Host bridge: Broadcom (formerly ServerWorks) CIOB-X2 PCI-X I/O Bridge
(rev 05)
00:11.2 Host bridge: Broadcom (formerly ServerWorks) CIOB-X2 PCI-X I/O Bridge
(rev 05)
01:03.0 RAID bus controller: Compaq Computer Corporation Smart Array 5i/532 (rev 01)
02:01.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5703X Gigabit
Ethernet (rev 02)
02:02.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5703X Gigabit
Ethernet (rev 02)
03:01.0 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X
Fusion-MPT Dual Ultra320 SCSI (rev 07)
03:01.1 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X
Fusion-MPT Dual Ultra320 SCSI (rev 07)
06:02.0 USB Controller: NEC Corporation USB (rev 43)
06:02.1 USB Controller: NEC Corporation USB (rev 43)
06:02.2 USB Controller: NEC Corporation USB 2.0 (rev 04)
06:1e.0 PCI Hot-plug controller: Compaq Computer Corporation PCI Hotplug
Controller (rev 14)
Character devices:
  1 mem
  4 /dev/vc/0
  4 tty
  4 ttyS
  5 /dev/tty
  5 /dev/console
  5 /dev/ptmx
  6 lp
  7 vcs
  9 st
 10 misc
 13 input
 21 sg
 29 fb
128 ptm
136 pts
180 usb
216 rfcomm
254 pcmcia

Block devices:
  1 ramdisk
  3 ide0
  8 sd
  9 md
 11 sr
 65 sd
 66 sd
 67 sd
 68 sd
 69 sd
 70 sd
 71 sd
104 cciss0
128 sd
129 sd
130 sd
131 sd
132 sd
133 sd
134 sd
135 sd
253 device-mapper
254 mdp
Comment 1 Tim Small 2005-11-28 06:11:07 EST
I have rebuilt, with this patch:

http://lists.xensource.com/archives/html/xen-devel/2005-11/binZFz3dFCIXF.bin

and a UP kernel, and I'm currently testing it..
Comment 2 Tim Small 2005-11-29 05:46:38 EST
The new UP kernel doesn't seem to have helped much.  I've not seen the original
stack trace recurr, but domain0 on the machine locks up, usually within 15
minutes of boot up.  Xen itself is still running, I think.

If I can usefully collect any more debug info for this machine, it'll need to be
done today, if possible, as I won't have access to the box for the next 2 weeks.

Thanks,

Tim.
Comment 3 Stephen Tweedie 2006-01-14 00:12:57 EST
The latest Xen kernels include a couple of fixes regarding swiotlb enabling,
which should result in the provision of a small swiotlb on all dom0s.  Xen
should be able to use this as a fallback for cases where it can't generate
physically contiguous dma map requests.  Can you see if the problem still
persists, please?
Comment 4 Tim Small 2006-01-30 08:36:38 EST
I haven't seen I still get lockups (maximum uptime 15 minutes, on an idle
system) on these machines, using:

xen-3.0-0.20060110.fc5.4
kernel-xen-hypervisor-2.6.15-1.29_FC5

It looks like the hypervisor is locking up, as I can't seem to get anything out
of it on the serial console (although I'm limited to getting serial console I/O
using HP '"Intelligent" lights out' so I don't have 100% confidence in this).

I'm booting with:

kernel /boot/xen.gz-2.6.15-1.29_FC5.xxx com1=115200,8n1,0x408,4 nosmp watchdog debug

but haven't seen any more debug traces since the original posting (it may be
that the two issues are unrelated).  Any further suggestions to get more debug
info would be welcome, although I have limited time to get more info from this
machine, as I'm under pressure to put it back in service for user sessions (i.e.
try using Xen2, or dump Xen altogether).
Comment 5 Stephen Tweedie 2006-01-30 20:37:09 EST
Yes, a hard lockup definitely does not sound like the oops you were generating
initially --- sounds like a new problem.

I've just pushed a rebased, more recent linux-2.6-merge.hg and hypervisor to
rawhide, and that should show up tomorrow as
kernel-xen-hypervisor-2.6.15-1.33_FC5.  Does that show anything different?

We don't currently build the hypervisor with the debug options enabled ---
that's  something I'm likely to turn on soon, which may help here.
Comment 6 Tim Small 2006-02-07 08:38:11 EST
Have tested 

http://download.fedora.redhat.com/pub/fedora/linux/core/development/SRPMS/kernel-xen-2.6.15-1.40_FC5.src.rpm

and

http://download.fedora.redhat.com/pub/fedora/linux/core/development/SRPMS/xen-3.0-0.20060130.fc5.3.src.rpm

and good results so far - uptime of ~4 hours, and a couple of guest created.  I
haven't done any serious stress testing yet, but so far, look good!

Thanks!  Might be an idea to close this bug, and I'll reopen if I get any
recurrance.

BTW, previous version tested (which exhibited this bug) was 

xen-3.0-0.20060110.fc5.4
kernel-xen-hypervisor-2.6.15-1.29_FC5

If you would like me to try and nail this bug to a particular change, please let
me know, and I'll see what I can do.
Comment 7 Stephen Tweedie 2006-02-07 08:45:36 EST
OK.  I've got reasonable confidence that the networking contiguous-region bug
you originally reported should now be fixed, so I'll go ahead and close this. 
If you do get further occurrences of the hang you saw later on, it would
probably be best to open a separate bug for that to distinguish it from the
original networking bug.

Thanks for the testing!

Note You need to log in before you can comment on or make changes to this bug.