Bug 443052
| Summary: | kernel failed to boot and dropped to xmon | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 4 | Reporter: | Qian Cai <qcai> | ||||
| Component: | kernel | Assignee: | Vitaly Mayatskikh <vmayatsk> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Martin Jenner <mjenner> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | urgent | ||||||
| Version: | 4.6.z | CC: | dhoward, dhowells, jplans, vgoyal, vmayatsk | ||||
| Target Milestone: | rc | Keywords: | ZStream | ||||
| Target Release: | --- | ||||||
| Hardware: | All | ||||||
| OS: | Linux | ||||||
| Whiteboard: | GSSApproved | ||||||
| Fixed In Version: | RHSA-2008-0665 | Doc Type: | Bug Fix | ||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2008-07-24 19:29:07 UTC | Type: | --- | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 450865, 456630 | ||||||
| Attachments: |
|
||||||
I could not reproduce this bug on both -67.EL and -67.0.7.EL kernel with 700
reboots each, and I can still reproduce this reliably on new kernel, so it looks
like the bug is introduced by new patches.
With Shell command trace on, It is actually failing in /etc/rc.sysinit here,
+ modprobe ohci-hcd
cpu 0x5: Vector: 300 (Data Access) at [c0000000020632e0]
pc: c000000000304224: ._spin_lock_irqsave+0x30/0xac
lr: c0000000000130b4: .free_irq+0x4c/0x174
sp: c000000002063560
msr: 8000000000001032
dar: c0000080003acfa4
dsisr: 40010000
current = 0xc0000000f0e98540
paca = 0xc0000000003fd000
pid = 2371, comm = modprobe
enter ? for help
5:mon>
The following are some outputs within xmon,
5:mon> ?
Commands:
b show breakpoints
bd set data breakpoint
bi set instruction breakpoint
bc clear breakpoint
c print cpus stopped in xmon
c# try to switch to cpu number h (in hex)
C checksum
d dump bytes
di dump instructions
df dump float values
dd dump double values
e print exception information
f flush cache
la lookup symbol+offset of specified address
ls lookup address of specified symbol
m examine/change memory
mm move a block of memory
ms set a block of memory
md compare two blocks of memory
ml locate a block of memory
mz zero a block of memory
mi show information about memory allocation
p show the task list
r print registers
s single step
S print special registers
t print backtrace
T Enable/Disable PPCDBG flags
x exit monitor and recover
X exit monitor and dont recover
u dump segment table or SLB
? help
zr reboot
zh halt
5:mon> c
cpus stopped: 0-7
5:mon> t
[c0000000020635f0] c0000000000130b4 .free_irq+0x4c/0x174
[c0000000020636a0] d00000000015b8f8 .ohci_stop+0x60/0xc0 [ohci_hcd]
[c000000002063730] c00000000024f0ac .usb_hcd_pci_remove+0xd0/0x1c8
[c0000000020637c0] c00000000024ef9c .usb_hcd_pci_probe+0x518/0x558
[c000000002063890] c000000000181c74 .pci_device_probe_static+0x6c/0xa8
[c000000002063920] c000000000181cf4 .__pci_device_probe+0x44/0x7c
[c0000000020639b0] c000000000181d68 .pci_device_probe+0x3c/0x6c
[c000000002063a40] c0000000001e0084 .bus_match+0x94/0xd8
[c000000002063ad0] c0000000001e0230 .driver_attach+0x70/0xe4
[c000000002063b60] c0000000001e0a44 .bus_add_driver+0xf4/0x158
[c000000002063c00] c0000000001e1140 .driver_register+0x38/0x4c
[c000000002063c80] c00000000018219c .pci_register_driver+0x80/0xcc
[c000000002063d10] d00000000015bbe4 .ohci_hcd_pci_init+0x40/0x70 [ohci_hcd]
[c000000002063d90] c00000000008535c .sys_init_module+0x1f0/0x460
[c000000002063e30] c000000000011280 syscall_exit+0x0/0x18
--- Exception: c01 (System Call) at 000000000ff75728
SP (ffffe800) is in userspace
5:mon> r
R00 = 0000000080000005 R16 = 0000000000000000
R01 = c000000002063560 R17 = 0000000000000000
R02 = c0000000004f7c40 R18 = 0000000000000000
R03 = c0000080003acfa4 R19 = 0000000010001180
R04 = c0000000074b6478 R20 = 0000000000000003
R05 = 0000000000000008 R21 = 00000000100292a8
R06 = 0000000000000000 R22 = 0000000000000000
R07 = 00000000c0000000 R23 = c000000002063830
R08 = 00000000f0e98540 R24 = e000000082189000
R09 = e000000082189000 R25 = c0000000074b6478
R10 = c0000000004ff260 R26 = c0000001e3d28e00
R11 = c000000000013068 R27 = 00000000ffffffff
R12 = d00000000015c098 R28 = 0000007fffffff80
R13 = c0000000003fd000 R29 = c0000080003acfa4
R14 = 0000000010022030 R30 = c00000000047c550
R15 = 0000000000000000 R31 = 8000000000009032
pc = c000000000304224 ._spin_lock_irqsave+0x30/0xac
lr = c0000000000130b4 .free_irq+0x4c/0x174
msr = 8000000000001032 cr = 24002424
ctr = c000000000013068 xer = 0000000000000001 trap = 300
5:mon> S
msr = 8000000000001032 sprg0= 0000000000000000
pvr = 00000000003b0200 sprg1= c0000000003fd000
dec = 000000007f6dcfd1 sprg2= c0000000003fd000
sp = c000000002062cc0 sprg3= c0000000003fd000
toc = c0000000004f7c40 dar = 0000000000000000
srr0 = c000000000037a6c srr1 = 8000000000001032
5:mon> u
SLB contents of cpu 5
00 c000000008000000 0000408f92c94580
01 d000000008000000 0000f09b89af5480
02 c0000001e8000000 000056fe823be580
03 0000000008000000 000070b79835dc00
04 00000000f8000000 00007bef0fef2c00
05 0000000048000000 000073b551bc9c00
06 c0000000e8000000 00004b079c20e580
07 c000008008000000 00003c02a049a580
08 0000000000000000 0000000000000000
09 0000000000000000 0000000000000000
10 0000000000000000 0000000000000000
11 0000000000000000 0000000000000000
12 0000000000000000 0000000000000000
13 0000000000000000 0000000000000000
14 0000000000000000 0000000000000000
15 0000000000000000 0000000000000000
16 0000000000000000 0000000000000000
17 0000000000000000 0000000000000000
18 0000000000000000 0000000000000000
19 0000000000000000 0000000000000000
20 0000000000000000 0000000000000000
21 0000000000000000 0000000000000000
22 0000000000000000 0000000000000000
23 0000000000000000 0000000000000000
24 0000000000000000 0000000000000000
25 0000000000000000 0000000000000000
26 0000000000000000 0000000000000000
27 0000000000000000 0000000000000000
28 0000000000000000 0000000000000000
29 0000000000000000 0000000000000000
30 0000000000000000 0000000000000000
31 0000000000000000 0000000000000000
32 0000000000000000 0000000000000000
33 0000000000000000 0000000000000000
34 0000000000000000 0000000000000000
35 0000000000000000 0000000000000000
36 0000000000000000 0000000000000000
37 0000000000000000 0000000000000000
38 0000000000000000 0000000000000000
39 0000000000000000 0000000000000000
40 0000000000000000 0000000000000000
41 0000000000000000 0000000000000000
42 0000000000000000 0000000000000000
43 0000000000000000 0000000000000000
44 0000000000000000 0000000000000000
45 0000000000000000 0000000000000000
46 0000000000000000 0000000000000000
47 0000000000000000 0000000000000000
48 0000000000000000 0000000000000000
49 0000000000000000 0000000000000000
50 0000000000000000 0000000000000000
51 0000000000000000 0000000000000000
52 0000000000000000 0000000000000000
53 0000000000000000 0000000000000000
54 0000000000000000 0000000000000000
55 0000000000000000 0000000000000000
56 0000000000000000 0000000000000000
57 0000000000000000 0000000000000000
58 c0000000f8000000 00004bc70a829580
59 e000000088000000 0000a6a2f3a2f480
60 0000000000000000 0000000000000000
61 0000000000000000 0000000000000000
62 0000000000000000 0000000000000000
63 0000000000000000 0000000000000000
5:mon> e
cpu 0x5: Vector: 300 (Data Access) at [c0000000020632e0]
pc: c000000000304224: ._spin_lock_irqsave+0x30/0xac
lr: c0000000000130b4: .free_irq+0x4c/0x174
sp: c000000002063560
msr: 8000000000001032
dar: c0000080003acfa4
dsisr: 40010000
current = 0xc0000000f0e98540
paca = 0xc0000000003fd000
pid = 2371, comm = modprobe
It looks like the problem can be reproduced by load and unload ohci-hcd module
in a loop in Runlevel 1,
init 1
i=0; while :; do echo $((i++)); rmmod ohci-hcd; modprobe ohci-hcd; done
It takes approximately half a hour to fail into xmon,
ohci_hcd 0000:c8:01.0: remove, state 1
usb usb2: USB disconnect, address 1
ohci_hcd 0000:c8:01.0: USB bus 2 deregistered
ohci_hcd 0000:c8:01.1: remove, state 1
usb usb3: USB disconnect, address 1
ohci_hcd 0000:c8:01.1: USB bus 3 deregistered
ohci_hcd 0000:c8:01.0: OHCI Host Controller
ohci_hcd 0000:c8:01.0: irq 101, pci mem e000000082189000
ohci_hcd 0000:c8:01.0: new USB bus registered, assigned bus number 2
hub 2-0:1.0: USB hub found
hub 2-0:1.0: 3 ports detected
ohci_hcd 0000:c8:01.1: OHCI Host Controller
ohci_hcd 0000:c8:01.1: irq 101, pci mem e00000008218a000
ohci_hcd 0000:c8:01.1: new USB bus registered, assigned bus number 3
ohci_hcd 0000:c8:01.1: init err
ohci_hcd 0000:c8:01.1: can't start
ohci_hcd 0000:c8:01.1: init error -16
ohci_hcd 0000:c8:01.1: remove, state 0
cpu 0x1: Vector: 300 (Data Access) at [c0000001dc9d72e0]
pc: c000000000304224: ._spin_lock_irqsave+0x30/0xac
lr: c0000000000130b4: .free_irq+0x4c/0x174
sp: c0000001dc9d7560
msr: 8000000000001032
dar: c0000080003acfa4
dsisr: 40010000
current = 0xc0000000e41ab3a0
paca = 0xc0000000003fb000
pid = 12968, comm = modprobe
enter ? for help
1:mon>
I was able to trigger this bug in ~10 minutes on kernel 2.6.9-67.EL (plain
RHEL4u6, so it's not a regression in upcoming 4.6.z release):
Red Hat Enterprise Linux AS release 4 (Nahant Update 6)
Kernel 2.6.9-67.EL on an ppc64
squad3.lab.boston.redhat.com login: ohci_hcd 0000:c8:01.1: init err
ohci_hcd 0000:c8:01.1: can't start
ohci_hcd 0000:c8:01.1: init error -16
cpu 0x4: Vector: 300 (Data Access) at [c0000001df2af2e0]
pc: c000000000303a00: ._spin_lock_irqsave+0x30/0xac
lr: c0000000000130b4: .free_irq+0x4c/0x174
sp: c0000001df2af560
msr: 8000000000001032
dar: c0000080003acfa4
dsisr: 40010000
current = 0xc0000001e1ede3c0
paca = 0xc0000000003fc800
pid = 19226, comm = modprobe
enter ? for help
4:mon> t
[c0000001df2af5f0] c0000000000130b4 .free_irq+0x4c/0x174
[c0000001df2af6a0] d0000000002318f8 .ohci_stop+0x60/0xc0 [ohci_hcd]
[c0000001df2af730] c00000000024eaac .usb_hcd_pci_remove+0xd0/0x1c8
[c0000001df2af7c0] c00000000024e99c .usb_hcd_pci_probe+0x518/0x558
[c0000001df2af890] c000000000181674 .pci_device_probe_static+0x6c/0xa8
[c0000001df2af920] c0000000001816f4 .__pci_device_probe+0x44/0x7c
[c0000001df2af9b0] c000000000181768 .pci_device_probe+0x3c/0x6c
[c0000001df2afa40] c0000000001dfa84 .bus_match+0x94/0xd8
[c0000001df2afad0] c0000000001dfc30 .driver_attach+0x70/0xe4
[c0000001df2afb60] c0000000001e0444 .bus_add_driver+0xf4/0x158
[c0000001df2afc00] c0000000001e0b40 .driver_register+0x38/0x4c
[c0000001df2afc80] c000000000181b9c .pci_register_driver+0x80/0xcc
[c0000001df2afd10] d000000000231be4 .ohci_hcd_pci_init+0x40/0x70 [ohci_hcd]
[c0000001df2afd90] c000000000085350 .sys_init_module+0x1f0/0x460
[c0000001df2afe30] c000000000011280 syscall_exit+0x0/0x18
--- Exception: c01 (System Call) at 000000000ff75728
SP (ffffe580) is in userspace
4:mon> ls ._spin_lock_irqsave
._spin_lock_irqsave: c0000000003039d0
4:mon> dic0000000003039d0
c0000000003039d0 7c0802a6 mflr r0
c0000000003039d4 fba1ffe8 std r29,-24(r1)
c0000000003039d8 fb81ffe0 std r28,-32(r1)
c0000000003039dc fbe1fff8 std r31,-8(r1)
c0000000003039e0 7c7d1b78 mr r29,r3
c0000000003039e4 f8010010 std r0,16(r1)
c0000000003039e8 f821ff71 stdu r1,-144(r1)
c0000000003039ec 7fe000a6 mfmsr r31
c0000000003039f0 7be08042 rldicl r0,r31,48,1
c0000000003039f4 78008000 rotldi r0,r0,16
c0000000003039f8 7c010164 mtmsrd r0,1
c0000000003039fc 800d0010 lwz r0,16(r13)
c000000000303a00 7d20e828 lwarx r9,r0,r29
c000000000303a04 2c090000 cmpwi r9,0
c000000000303a08 40820010 bne c000000000303a18 # ._spin_lock_irqsave+0x48/0xac
c000000000303a0c 7c00e92d stwcx. r0,r0,r29
4:mon> r
R00 = 0000000080000004 R16 = 0000000000000000
R01 = c0000001df2af560 R17 = 0000000000000000
R02 = c0000000004f7bb0 R18 = 0000000000000000
R03 = c0000080003acfa4 R19 = 0000000010001180
R04 = c0000000006c5c78 R20 = 0000000000000003
R05 = 0000000000000008 R21 = 0000000010029298
R06 = 0000000000000000 R22 = 0000000000000000
R07 = 00000000c0000001 R23 = c0000001df2af830
R08 = 00000000e1ede3c0 R24 = e00000008218a000
R09 = e00000008218a000 R25 = c0000000006c5c78
R10 = c0000000004ff260 R26 = c0000001e3d26780
R11 = c000000000013068 R27 = 00000000ffffffff
R12 = d000000000232098 R28 = 0000007fffffff80
R13 = c0000000003fc800 R29 = c0000080003acfa4
R14 = 0000000010022028 R30 = c00000000047c4c0
R15 = 0000000000000000 R31 = 8000000000009032
pc = c000000000303a00 ._spin_lock_irqsave+0x30/0xac
lr = c0000000000130b4 .free_irq+0x4c/0x174
msr = 8000000000001032 cr = 24002424
ctr = c000000000013068 xer = 0000000000000007 trap = 300
4:mon> dc0000080003acfa4
c0000080003acfa4 **************** **************** | |
Created attachment 303178 [details]
fix irq double free
When h/w is busy, ohci_pci_start() fails and calls ohci_stop(). ohci_stop()
frees allocated irq. After that, on toplevel, usb_hcd_pci_probe() calls
usb_hcd_pci_remove(), which itself again calls ohci_stop() though
hcd->driver->stop (hcd) and tries to free irq again. Ohci_stop() doesn't check
if irq was already freed => oops
Raised as an exception in 4.7: Easy to reproduce the crash, by just loading/unloading the ohci-hcd module. Customer has found this issue in 4.6.z stream as well; would be good to fix it in the next 4.6.z; therefore, proposed as an exception. p.s. per zstream maintainer: this bug can be triggered on every machine with RHEL-4 and OHCI usb host controller, not only on PPC64. therefore, reset HARDWARE component to "all" The same issue with ohci-omap, ohci-lh7a404 and ohci-sa1111 controllers. We need to fix it if we support machines with these h/w Committed in 73.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/ On x86 kernels this bug cause a kernel panic, see bz452706. *** Bug 452706 has been marked as a duplicate of this bug. *** Please, ignore c#14 and c#15, this is a different issue. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2008-0665.html |
Description of problem: While booting the latest RHEL4U6 Z-stream kernel on ppc64 machines, it failed and dropped to xmon, Welcome to Red Hat Enterprise Linux AS Press 'I' to enter interactive startup. Setting clock (utc): Tue Apr 15 21:13:42 EDT 2008 [ OK ] Starting udev: [ OK ] Initializing hardware... storage network audiocpu 0x0: Vector: 300 (Data Access) at [c0000000020372e0] pc: c000000000304cec: ._spin_lock_irqsave+0x30/0xac lr: c0000000000130c4: .free_irq+0x4c/0x174 sp: c000000002037560 msr: 8000000000001032 dar: c0000080003b0fa4 dsisr: 40010000 current = 0xc000000002031140 paca = 0xc000000000401800 pid = 1383, comm = modprobe enter ? for help 0:mon> I could not capture a vmcore, as it failed before start of netdump daemon. I have seen the failure 3 times in around 700 reboots occurred only on two machines, squad6-lp1.lab.boston.redhat.com squad3.lab.boston.redhat.com failure logs, http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=2704394 http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=2732116 http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=2733824 system info and normal boot logs, squad6-lp1.lab.boston.redhat.com http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=2696713 squad3.lab.boston.redhat.com http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=2727498 Version-Release number of selected component (if applicable): both 2.6.9-67.0.12.EL and 2.6.9-67.0.13.EL kernel How reproducible: 3 failures in 700 reboots Additional info: From Vitaly Mayatskikh, I think this is a new bug in IBM's EHEA network adapter. As I've understood that data access fault, the irq argument for free_irq() was out of bounds and get_irq_desc() (which free_irq() calls), have no any checks: #define get_irq_desc(irq) (&irq_desc[(irq)]) There are a lot of irq validity checks in upstream code, e.g.: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=c2b5a251b9feca727661f1a3278cafb1de4c80f3 http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blobdiff;f=kernel/irq/manage.c;h=81c49a4d679eb03e54b9be8bb70d5f8f98447a41;hp=3bd7226d15fa10b5c4d6183506cff3368d4b7dfd;hb=c2b5a251b9feca727661f1a3278cafb1de4c80f3;hpb=2d0ebb36038c0626cde662a3b06da9787cfb68c3 but RHEL-4 kernel is missing these checks.