Bug 443052 - kernel failed to boot and dropped to xmon
Summary: kernel failed to boot and dropped to xmon
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel
Version: 4.6.z
Hardware: All
OS: Linux
urgent
high
Target Milestone: rc
: ---
Assignee: Vitaly Mayatskikh
QA Contact: Martin Jenner
URL:
Whiteboard: GSSApproved
Depends On:
Blocks: 450865 456630
TreeView+ depends on / blocked
 
Reported: 2008-04-18 13:00 UTC by Qian Cai
Modified: 2008-07-25 04:21 UTC (History)
5 users (show)

Fixed In Version: RHSA-2008-0665
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2008-07-24 19:29:07 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
fix irq double free (576 bytes, patch)
2008-04-21 17:30 UTC, Vitaly Mayatskikh
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2008:0665 0 normal SHIPPED_LIVE Moderate: Updated kernel packages for Red Hat Enterprise Linux 4.7 2008-07-24 16:41:06 UTC

Description Qian Cai 2008-04-18 13:00:14 UTC
Description of problem:
While booting the latest RHEL4U6 Z-stream kernel on ppc64 machines, it failed
and dropped to xmon,

		Welcome to Red Hat Enterprise Linux AS
		Press 'I' to enter interactive startup.
Setting clock  (utc): Tue Apr 15 21:13:42 EDT 2008 [  OK  ]

Starting udev:  [  OK  ]

Initializing hardware...  storage network audiocpu 0x0: Vector: 300 (Data
Access) at [c0000000020372e0]
    pc: c000000000304cec: ._spin_lock_irqsave+0x30/0xac
    lr: c0000000000130c4: .free_irq+0x4c/0x174
    sp: c000000002037560
   msr: 8000000000001032
   dar: c0000080003b0fa4
 dsisr: 40010000
  current = 0xc000000002031140
  paca    = 0xc000000000401800
    pid   = 1383, comm = modprobe
enter ? for help
0:mon>

I could not capture a vmcore, as it failed before start of netdump daemon.

I have seen the failure 3 times in around 700 reboots occurred only on two
machines, 
squad6-lp1.lab.boston.redhat.com
squad3.lab.boston.redhat.com

failure logs,
http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=2704394
http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=2732116
http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=2733824

system info and normal boot logs,
squad6-lp1.lab.boston.redhat.com
http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=2696713

squad3.lab.boston.redhat.com
http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=2727498

Version-Release number of selected component (if applicable):
both 2.6.9-67.0.12.EL and 2.6.9-67.0.13.EL kernel

How reproducible:
3 failures in 700 reboots

Additional info:
From Vitaly Mayatskikh,

I think this is a new bug in IBM's EHEA network adapter. As I've understood that
data access fault, the irq argument for free_irq() was out of bounds and
get_irq_desc() (which free_irq() calls), have no any checks:

#define get_irq_desc(irq) (&irq_desc[(irq)])

There are a lot of irq validity checks in upstream code, e.g.:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=c2b5a251b9feca727661f1a3278cafb1de4c80f3
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blobdiff;f=kernel/irq/manage.c;h=81c49a4d679eb03e54b9be8bb70d5f8f98447a41;hp=3bd7226d15fa10b5c4d6183506cff3368d4b7dfd;hb=c2b5a251b9feca727661f1a3278cafb1de4c80f3;hpb=2d0ebb36038c0626cde662a3b06da9787cfb68c3

but RHEL-4 kernel is missing these checks.

Comment 1 Qian Cai 2008-04-21 03:01:04 UTC
I could not reproduce this bug on both -67.EL and -67.0.7.EL kernel with 700
reboots each, and I can still reproduce this reliably on new kernel, so it looks
like the bug is introduced by new patches.

With Shell command trace on, It is actually failing in /etc/rc.sysinit here,

+ modprobe ohci-hcd
cpu 0x5: Vector: 300 (Data Access) at [c0000000020632e0]
    pc: c000000000304224: ._spin_lock_irqsave+0x30/0xac
    lr: c0000000000130b4: .free_irq+0x4c/0x174
    sp: c000000002063560
   msr: 8000000000001032
   dar: c0000080003acfa4
 dsisr: 40010000
  current = 0xc0000000f0e98540
  paca    = 0xc0000000003fd000
    pid   = 2371, comm = modprobe
enter ? for help
5:mon>

The following are some outputs within xmon,
5:mon> ?
Commands:
  b	show breakpoints
  bd	set data breakpoint
  bi	set instruction breakpoint
  bc	clear breakpoint
  c	print cpus stopped in xmon
  c#	try to switch to cpu number h (in hex)
  C	checksum
  d	dump bytes
  di	dump instructions
  df	dump float values
  dd	dump double values
  e	print exception information
  f	flush cache
  la	lookup symbol+offset of specified address
  ls	lookup address of specified symbol
  m	examine/change memory
  mm	move a block of memory
  ms	set a block of memory
  md	compare two blocks of memory
  ml	locate a block of memory
  mz	zero a block of memory
  mi	show information about memory allocation
  p 	show the task list
  r	print registers
  s	single step
  S	print special registers
  t	print backtrace
  T	Enable/Disable PPCDBG flags
  x	exit monitor and recover
  X	exit monitor and dont recover
  u	dump segment table or SLB
  ?	help
  zr	reboot
  zh	halt
5:mon> c
cpus stopped: 0-7
5:mon> t
[c0000000020635f0] c0000000000130b4 .free_irq+0x4c/0x174
[c0000000020636a0] d00000000015b8f8 .ohci_stop+0x60/0xc0 [ohci_hcd]
[c000000002063730] c00000000024f0ac .usb_hcd_pci_remove+0xd0/0x1c8
[c0000000020637c0] c00000000024ef9c .usb_hcd_pci_probe+0x518/0x558
[c000000002063890] c000000000181c74 .pci_device_probe_static+0x6c/0xa8
[c000000002063920] c000000000181cf4 .__pci_device_probe+0x44/0x7c
[c0000000020639b0] c000000000181d68 .pci_device_probe+0x3c/0x6c
[c000000002063a40] c0000000001e0084 .bus_match+0x94/0xd8
[c000000002063ad0] c0000000001e0230 .driver_attach+0x70/0xe4
[c000000002063b60] c0000000001e0a44 .bus_add_driver+0xf4/0x158
[c000000002063c00] c0000000001e1140 .driver_register+0x38/0x4c
[c000000002063c80] c00000000018219c .pci_register_driver+0x80/0xcc
[c000000002063d10] d00000000015bbe4 .ohci_hcd_pci_init+0x40/0x70 [ohci_hcd]
[c000000002063d90] c00000000008535c .sys_init_module+0x1f0/0x460
[c000000002063e30] c000000000011280 syscall_exit+0x0/0x18
--- Exception: c01 (System Call) at 000000000ff75728
SP (ffffe800) is in userspace
5:mon> r
R00 = 0000000080000005   R16 = 0000000000000000
R01 = c000000002063560   R17 = 0000000000000000
R02 = c0000000004f7c40   R18 = 0000000000000000
R03 = c0000080003acfa4   R19 = 0000000010001180
R04 = c0000000074b6478   R20 = 0000000000000003
R05 = 0000000000000008   R21 = 00000000100292a8
R06 = 0000000000000000   R22 = 0000000000000000
R07 = 00000000c0000000   R23 = c000000002063830
R08 = 00000000f0e98540   R24 = e000000082189000
R09 = e000000082189000   R25 = c0000000074b6478
R10 = c0000000004ff260   R26 = c0000001e3d28e00
R11 = c000000000013068   R27 = 00000000ffffffff
R12 = d00000000015c098   R28 = 0000007fffffff80
R13 = c0000000003fd000   R29 = c0000080003acfa4
R14 = 0000000010022030   R30 = c00000000047c550
R15 = 0000000000000000   R31 = 8000000000009032
pc  = c000000000304224 ._spin_lock_irqsave+0x30/0xac
lr  = c0000000000130b4 .free_irq+0x4c/0x174
msr = 8000000000001032   cr  = 24002424
ctr = c000000000013068   xer = 0000000000000001   trap =      300
5:mon> S
msr  = 8000000000001032  sprg0= 0000000000000000
pvr  = 00000000003b0200  sprg1= c0000000003fd000
dec  = 000000007f6dcfd1  sprg2= c0000000003fd000
sp   = c000000002062cc0  sprg3= c0000000003fd000
toc  = c0000000004f7c40  dar  = 0000000000000000
srr0 = c000000000037a6c  srr1 = 8000000000001032
5:mon> u
SLB contents of cpu 5
00 c000000008000000 0000408f92c94580
01 d000000008000000 0000f09b89af5480
02 c0000001e8000000 000056fe823be580
03 0000000008000000 000070b79835dc00
04 00000000f8000000 00007bef0fef2c00
05 0000000048000000 000073b551bc9c00
06 c0000000e8000000 00004b079c20e580
07 c000008008000000 00003c02a049a580
08 0000000000000000 0000000000000000
09 0000000000000000 0000000000000000
10 0000000000000000 0000000000000000
11 0000000000000000 0000000000000000
12 0000000000000000 0000000000000000
13 0000000000000000 0000000000000000
14 0000000000000000 0000000000000000
15 0000000000000000 0000000000000000
16 0000000000000000 0000000000000000
17 0000000000000000 0000000000000000
18 0000000000000000 0000000000000000
19 0000000000000000 0000000000000000
20 0000000000000000 0000000000000000
21 0000000000000000 0000000000000000
22 0000000000000000 0000000000000000
23 0000000000000000 0000000000000000
24 0000000000000000 0000000000000000
25 0000000000000000 0000000000000000
26 0000000000000000 0000000000000000
27 0000000000000000 0000000000000000
28 0000000000000000 0000000000000000
29 0000000000000000 0000000000000000
30 0000000000000000 0000000000000000
31 0000000000000000 0000000000000000
32 0000000000000000 0000000000000000
33 0000000000000000 0000000000000000
34 0000000000000000 0000000000000000
35 0000000000000000 0000000000000000
36 0000000000000000 0000000000000000
37 0000000000000000 0000000000000000
38 0000000000000000 0000000000000000
39 0000000000000000 0000000000000000
40 0000000000000000 0000000000000000
41 0000000000000000 0000000000000000
42 0000000000000000 0000000000000000
43 0000000000000000 0000000000000000
44 0000000000000000 0000000000000000
45 0000000000000000 0000000000000000
46 0000000000000000 0000000000000000
47 0000000000000000 0000000000000000
48 0000000000000000 0000000000000000
49 0000000000000000 0000000000000000
50 0000000000000000 0000000000000000
51 0000000000000000 0000000000000000
52 0000000000000000 0000000000000000
53 0000000000000000 0000000000000000
54 0000000000000000 0000000000000000
55 0000000000000000 0000000000000000
56 0000000000000000 0000000000000000
57 0000000000000000 0000000000000000
58 c0000000f8000000 00004bc70a829580
59 e000000088000000 0000a6a2f3a2f480
60 0000000000000000 0000000000000000
61 0000000000000000 0000000000000000
62 0000000000000000 0000000000000000
63 0000000000000000 0000000000000000
5:mon> e
cpu 0x5: Vector: 300 (Data Access) at [c0000000020632e0]
    pc: c000000000304224: ._spin_lock_irqsave+0x30/0xac
    lr: c0000000000130b4: .free_irq+0x4c/0x174
    sp: c000000002063560
   msr: 8000000000001032
   dar: c0000080003acfa4
 dsisr: 40010000
  current = 0xc0000000f0e98540
  paca    = 0xc0000000003fd000
    pid   = 2371, comm = modprobe


Comment 2 Qian Cai 2008-04-21 09:42:48 UTC
It looks like the problem can be reproduced by load and unload ohci-hcd module
in a loop in Runlevel 1,

init 1
i=0; while :; do echo $((i++)); rmmod ohci-hcd; modprobe ohci-hcd; done

It takes approximately half a hour to fail into xmon,

ohci_hcd 0000:c8:01.0: remove, state 1
usb usb2: USB disconnect, address 1
ohci_hcd 0000:c8:01.0: USB bus 2 deregistered
ohci_hcd 0000:c8:01.1: remove, state 1
usb usb3: USB disconnect, address 1
ohci_hcd 0000:c8:01.1: USB bus 3 deregistered
ohci_hcd 0000:c8:01.0: OHCI Host Controller
ohci_hcd 0000:c8:01.0: irq 101, pci mem e000000082189000
ohci_hcd 0000:c8:01.0: new USB bus registered, assigned bus number 2
hub 2-0:1.0: USB hub found
hub 2-0:1.0: 3 ports detected
ohci_hcd 0000:c8:01.1: OHCI Host Controller
ohci_hcd 0000:c8:01.1: irq 101, pci mem e00000008218a000
ohci_hcd 0000:c8:01.1: new USB bus registered, assigned bus number 3
ohci_hcd 0000:c8:01.1: init err
ohci_hcd 0000:c8:01.1: can't start
ohci_hcd 0000:c8:01.1: init error -16
ohci_hcd 0000:c8:01.1: remove, state 0
cpu 0x1: Vector: 300 (Data Access) at [c0000001dc9d72e0]
    pc: c000000000304224: ._spin_lock_irqsave+0x30/0xac
    lr: c0000000000130b4: .free_irq+0x4c/0x174
    sp: c0000001dc9d7560
   msr: 8000000000001032
   dar: c0000080003acfa4
 dsisr: 40010000
  current = 0xc0000000e41ab3a0
  paca    = 0xc0000000003fb000
    pid   = 12968, comm = modprobe
enter ? for help
1:mon>

Comment 3 Vitaly Mayatskikh 2008-04-21 17:05:29 UTC
I was able to trigger this bug in ~10 minutes on kernel 2.6.9-67.EL (plain
RHEL4u6, so it's not a regression in upcoming 4.6.z release):

Red Hat Enterprise Linux AS release 4 (Nahant Update 6)
Kernel 2.6.9-67.EL on an ppc64

squad3.lab.boston.redhat.com login: ohci_hcd 0000:c8:01.1: init err
ohci_hcd 0000:c8:01.1: can't start
ohci_hcd 0000:c8:01.1: init error -16
cpu 0x4: Vector: 300 (Data Access) at [c0000001df2af2e0]
    pc: c000000000303a00: ._spin_lock_irqsave+0x30/0xac
    lr: c0000000000130b4: .free_irq+0x4c/0x174
    sp: c0000001df2af560
   msr: 8000000000001032
   dar: c0000080003acfa4
 dsisr: 40010000
  current = 0xc0000001e1ede3c0
  paca    = 0xc0000000003fc800
    pid   = 19226, comm = modprobe
enter ? for help
4:mon> t
[c0000001df2af5f0] c0000000000130b4 .free_irq+0x4c/0x174
[c0000001df2af6a0] d0000000002318f8 .ohci_stop+0x60/0xc0 [ohci_hcd]
[c0000001df2af730] c00000000024eaac .usb_hcd_pci_remove+0xd0/0x1c8
[c0000001df2af7c0] c00000000024e99c .usb_hcd_pci_probe+0x518/0x558
[c0000001df2af890] c000000000181674 .pci_device_probe_static+0x6c/0xa8
[c0000001df2af920] c0000000001816f4 .__pci_device_probe+0x44/0x7c
[c0000001df2af9b0] c000000000181768 .pci_device_probe+0x3c/0x6c
[c0000001df2afa40] c0000000001dfa84 .bus_match+0x94/0xd8
[c0000001df2afad0] c0000000001dfc30 .driver_attach+0x70/0xe4
[c0000001df2afb60] c0000000001e0444 .bus_add_driver+0xf4/0x158
[c0000001df2afc00] c0000000001e0b40 .driver_register+0x38/0x4c
[c0000001df2afc80] c000000000181b9c .pci_register_driver+0x80/0xcc
[c0000001df2afd10] d000000000231be4 .ohci_hcd_pci_init+0x40/0x70 [ohci_hcd]
[c0000001df2afd90] c000000000085350 .sys_init_module+0x1f0/0x460
[c0000001df2afe30] c000000000011280 syscall_exit+0x0/0x18
--- Exception: c01 (System Call) at 000000000ff75728
SP (ffffe580) is in userspace
4:mon> ls ._spin_lock_irqsave
._spin_lock_irqsave: c0000000003039d0
4:mon> dic0000000003039d0 
c0000000003039d0  7c0802a6	mflr	r0
c0000000003039d4  fba1ffe8	std	r29,-24(r1)
c0000000003039d8  fb81ffe0	std	r28,-32(r1)
c0000000003039dc  fbe1fff8	std	r31,-8(r1)
c0000000003039e0  7c7d1b78	mr	r29,r3
c0000000003039e4  f8010010	std	r0,16(r1)
c0000000003039e8  f821ff71	stdu	r1,-144(r1)
c0000000003039ec  7fe000a6	mfmsr	r31
c0000000003039f0  7be08042	rldicl	r0,r31,48,1
c0000000003039f4  78008000	rotldi	r0,r0,16
c0000000003039f8  7c010164	mtmsrd	r0,1
c0000000003039fc  800d0010	lwz	r0,16(r13)
c000000000303a00  7d20e828	lwarx	r9,r0,r29
c000000000303a04  2c090000	cmpwi	r9,0
c000000000303a08  40820010	bne	c000000000303a18	# ._spin_lock_irqsave+0x48/0xac
c000000000303a0c  7c00e92d	stwcx.	r0,r0,r29
4:mon> r
R00 = 0000000080000004   R16 = 0000000000000000
R01 = c0000001df2af560   R17 = 0000000000000000
R02 = c0000000004f7bb0   R18 = 0000000000000000
R03 = c0000080003acfa4   R19 = 0000000010001180
R04 = c0000000006c5c78   R20 = 0000000000000003
R05 = 0000000000000008   R21 = 0000000010029298
R06 = 0000000000000000   R22 = 0000000000000000
R07 = 00000000c0000001   R23 = c0000001df2af830
R08 = 00000000e1ede3c0   R24 = e00000008218a000
R09 = e00000008218a000   R25 = c0000000006c5c78
R10 = c0000000004ff260   R26 = c0000001e3d26780
R11 = c000000000013068   R27 = 00000000ffffffff
R12 = d000000000232098   R28 = 0000007fffffff80
R13 = c0000000003fc800   R29 = c0000080003acfa4
R14 = 0000000010022028   R30 = c00000000047c4c0
R15 = 0000000000000000   R31 = 8000000000009032
pc  = c000000000303a00 ._spin_lock_irqsave+0x30/0xac
lr  = c0000000000130b4 .free_irq+0x4c/0x174
msr = 8000000000001032   cr  = 24002424
ctr = c000000000013068   xer = 0000000000000007   trap =      300
4:mon> dc0000080003acfa4
c0000080003acfa4 **************** ****************  |                |

Comment 4 Vitaly Mayatskikh 2008-04-21 17:30:38 UTC
Created attachment 303178 [details]
fix irq double free

When h/w is busy, ohci_pci_start() fails and calls ohci_stop(). ohci_stop()
frees allocated irq. After that, on toplevel, usb_hcd_pci_probe() calls
usb_hcd_pci_remove(), which itself again calls ohci_stop() though
hcd->driver->stop (hcd) and tries to free irq again. Ohci_stop() doesn't check
if irq was already freed => oops

Comment 5 Linda Wang 2008-04-21 17:53:34 UTC
Raised as an exception in 4.7:

Easy to reproduce the crash, by just loading/unloading the
ohci-hcd module.  Customer has found this issue in 4.6.z
stream as well; would be good to fix it in the next 4.6.z;
therefore, proposed as an exception.

p.s. per zstream maintainer:
this bug can be triggered on every machine with RHEL-4 and OHCI usb host
controller, not only on PPC64. therefore, reset HARDWARE component to
"all"

Comment 6 Vitaly Mayatskikh 2008-04-23 11:02:53 UTC
The same issue with ohci-omap, ohci-lh7a404 and ohci-sa1111 controllers. We need
to fix it if we support machines with these h/w

Comment 12 Vivek Goyal 2008-06-12 18:31:24 UTC
Committed in 73.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/

Comment 14 Vitaly Mayatskikh 2008-07-08 11:13:16 UTC
On x86 kernels this bug cause a kernel panic, see bz452706.

Comment 15 Vitaly Mayatskikh 2008-07-08 11:13:35 UTC
*** Bug 452706 has been marked as a duplicate of this bug. ***

Comment 16 Vitaly Mayatskikh 2008-07-09 15:55:40 UTC
Please, ignore c#14 and c#15, this is a different issue.

Comment 18 errata-xmlrpc 2008-07-24 19:29:07 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2008-0665.html


Note You need to log in before you can comment on or make changes to this bug.