Bug 443052
Summary: | kernel failed to boot and dropped to xmon | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | Qian Cai <qcai> | ||||
Component: | kernel | Assignee: | Vitaly Mayatskikh <vmayatsk> | ||||
Status: | CLOSED ERRATA | QA Contact: | Martin Jenner <mjenner> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | urgent | ||||||
Version: | 4.6.z | CC: | dhoward, dhowells, jplans, vgoyal, vmayatsk | ||||
Target Milestone: | rc | Keywords: | ZStream | ||||
Target Release: | --- | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | GSSApproved | ||||||
Fixed In Version: | RHSA-2008-0665 | Doc Type: | Bug Fix | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2008-07-24 19:29:07 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 450865, 456630 | ||||||
Attachments: |
|
Description
Qian Cai
2008-04-18 13:00:14 UTC
I could not reproduce this bug on both -67.EL and -67.0.7.EL kernel with 700 reboots each, and I can still reproduce this reliably on new kernel, so it looks like the bug is introduced by new patches. With Shell command trace on, It is actually failing in /etc/rc.sysinit here, + modprobe ohci-hcd cpu 0x5: Vector: 300 (Data Access) at [c0000000020632e0] pc: c000000000304224: ._spin_lock_irqsave+0x30/0xac lr: c0000000000130b4: .free_irq+0x4c/0x174 sp: c000000002063560 msr: 8000000000001032 dar: c0000080003acfa4 dsisr: 40010000 current = 0xc0000000f0e98540 paca = 0xc0000000003fd000 pid = 2371, comm = modprobe enter ? for help 5:mon> The following are some outputs within xmon, 5:mon> ? Commands: b show breakpoints bd set data breakpoint bi set instruction breakpoint bc clear breakpoint c print cpus stopped in xmon c# try to switch to cpu number h (in hex) C checksum d dump bytes di dump instructions df dump float values dd dump double values e print exception information f flush cache la lookup symbol+offset of specified address ls lookup address of specified symbol m examine/change memory mm move a block of memory ms set a block of memory md compare two blocks of memory ml locate a block of memory mz zero a block of memory mi show information about memory allocation p show the task list r print registers s single step S print special registers t print backtrace T Enable/Disable PPCDBG flags x exit monitor and recover X exit monitor and dont recover u dump segment table or SLB ? help zr reboot zh halt 5:mon> c cpus stopped: 0-7 5:mon> t [c0000000020635f0] c0000000000130b4 .free_irq+0x4c/0x174 [c0000000020636a0] d00000000015b8f8 .ohci_stop+0x60/0xc0 [ohci_hcd] [c000000002063730] c00000000024f0ac .usb_hcd_pci_remove+0xd0/0x1c8 [c0000000020637c0] c00000000024ef9c .usb_hcd_pci_probe+0x518/0x558 [c000000002063890] c000000000181c74 .pci_device_probe_static+0x6c/0xa8 [c000000002063920] c000000000181cf4 .__pci_device_probe+0x44/0x7c [c0000000020639b0] c000000000181d68 .pci_device_probe+0x3c/0x6c [c000000002063a40] c0000000001e0084 .bus_match+0x94/0xd8 [c000000002063ad0] c0000000001e0230 .driver_attach+0x70/0xe4 [c000000002063b60] c0000000001e0a44 .bus_add_driver+0xf4/0x158 [c000000002063c00] c0000000001e1140 .driver_register+0x38/0x4c [c000000002063c80] c00000000018219c .pci_register_driver+0x80/0xcc [c000000002063d10] d00000000015bbe4 .ohci_hcd_pci_init+0x40/0x70 [ohci_hcd] [c000000002063d90] c00000000008535c .sys_init_module+0x1f0/0x460 [c000000002063e30] c000000000011280 syscall_exit+0x0/0x18 --- Exception: c01 (System Call) at 000000000ff75728 SP (ffffe800) is in userspace 5:mon> r R00 = 0000000080000005 R16 = 0000000000000000 R01 = c000000002063560 R17 = 0000000000000000 R02 = c0000000004f7c40 R18 = 0000000000000000 R03 = c0000080003acfa4 R19 = 0000000010001180 R04 = c0000000074b6478 R20 = 0000000000000003 R05 = 0000000000000008 R21 = 00000000100292a8 R06 = 0000000000000000 R22 = 0000000000000000 R07 = 00000000c0000000 R23 = c000000002063830 R08 = 00000000f0e98540 R24 = e000000082189000 R09 = e000000082189000 R25 = c0000000074b6478 R10 = c0000000004ff260 R26 = c0000001e3d28e00 R11 = c000000000013068 R27 = 00000000ffffffff R12 = d00000000015c098 R28 = 0000007fffffff80 R13 = c0000000003fd000 R29 = c0000080003acfa4 R14 = 0000000010022030 R30 = c00000000047c550 R15 = 0000000000000000 R31 = 8000000000009032 pc = c000000000304224 ._spin_lock_irqsave+0x30/0xac lr = c0000000000130b4 .free_irq+0x4c/0x174 msr = 8000000000001032 cr = 24002424 ctr = c000000000013068 xer = 0000000000000001 trap = 300 5:mon> S msr = 8000000000001032 sprg0= 0000000000000000 pvr = 00000000003b0200 sprg1= c0000000003fd000 dec = 000000007f6dcfd1 sprg2= c0000000003fd000 sp = c000000002062cc0 sprg3= c0000000003fd000 toc = c0000000004f7c40 dar = 0000000000000000 srr0 = c000000000037a6c srr1 = 8000000000001032 5:mon> u SLB contents of cpu 5 00 c000000008000000 0000408f92c94580 01 d000000008000000 0000f09b89af5480 02 c0000001e8000000 000056fe823be580 03 0000000008000000 000070b79835dc00 04 00000000f8000000 00007bef0fef2c00 05 0000000048000000 000073b551bc9c00 06 c0000000e8000000 00004b079c20e580 07 c000008008000000 00003c02a049a580 08 0000000000000000 0000000000000000 09 0000000000000000 0000000000000000 10 0000000000000000 0000000000000000 11 0000000000000000 0000000000000000 12 0000000000000000 0000000000000000 13 0000000000000000 0000000000000000 14 0000000000000000 0000000000000000 15 0000000000000000 0000000000000000 16 0000000000000000 0000000000000000 17 0000000000000000 0000000000000000 18 0000000000000000 0000000000000000 19 0000000000000000 0000000000000000 20 0000000000000000 0000000000000000 21 0000000000000000 0000000000000000 22 0000000000000000 0000000000000000 23 0000000000000000 0000000000000000 24 0000000000000000 0000000000000000 25 0000000000000000 0000000000000000 26 0000000000000000 0000000000000000 27 0000000000000000 0000000000000000 28 0000000000000000 0000000000000000 29 0000000000000000 0000000000000000 30 0000000000000000 0000000000000000 31 0000000000000000 0000000000000000 32 0000000000000000 0000000000000000 33 0000000000000000 0000000000000000 34 0000000000000000 0000000000000000 35 0000000000000000 0000000000000000 36 0000000000000000 0000000000000000 37 0000000000000000 0000000000000000 38 0000000000000000 0000000000000000 39 0000000000000000 0000000000000000 40 0000000000000000 0000000000000000 41 0000000000000000 0000000000000000 42 0000000000000000 0000000000000000 43 0000000000000000 0000000000000000 44 0000000000000000 0000000000000000 45 0000000000000000 0000000000000000 46 0000000000000000 0000000000000000 47 0000000000000000 0000000000000000 48 0000000000000000 0000000000000000 49 0000000000000000 0000000000000000 50 0000000000000000 0000000000000000 51 0000000000000000 0000000000000000 52 0000000000000000 0000000000000000 53 0000000000000000 0000000000000000 54 0000000000000000 0000000000000000 55 0000000000000000 0000000000000000 56 0000000000000000 0000000000000000 57 0000000000000000 0000000000000000 58 c0000000f8000000 00004bc70a829580 59 e000000088000000 0000a6a2f3a2f480 60 0000000000000000 0000000000000000 61 0000000000000000 0000000000000000 62 0000000000000000 0000000000000000 63 0000000000000000 0000000000000000 5:mon> e cpu 0x5: Vector: 300 (Data Access) at [c0000000020632e0] pc: c000000000304224: ._spin_lock_irqsave+0x30/0xac lr: c0000000000130b4: .free_irq+0x4c/0x174 sp: c000000002063560 msr: 8000000000001032 dar: c0000080003acfa4 dsisr: 40010000 current = 0xc0000000f0e98540 paca = 0xc0000000003fd000 pid = 2371, comm = modprobe It looks like the problem can be reproduced by load and unload ohci-hcd module in a loop in Runlevel 1, init 1 i=0; while :; do echo $((i++)); rmmod ohci-hcd; modprobe ohci-hcd; done It takes approximately half a hour to fail into xmon, ohci_hcd 0000:c8:01.0: remove, state 1 usb usb2: USB disconnect, address 1 ohci_hcd 0000:c8:01.0: USB bus 2 deregistered ohci_hcd 0000:c8:01.1: remove, state 1 usb usb3: USB disconnect, address 1 ohci_hcd 0000:c8:01.1: USB bus 3 deregistered ohci_hcd 0000:c8:01.0: OHCI Host Controller ohci_hcd 0000:c8:01.0: irq 101, pci mem e000000082189000 ohci_hcd 0000:c8:01.0: new USB bus registered, assigned bus number 2 hub 2-0:1.0: USB hub found hub 2-0:1.0: 3 ports detected ohci_hcd 0000:c8:01.1: OHCI Host Controller ohci_hcd 0000:c8:01.1: irq 101, pci mem e00000008218a000 ohci_hcd 0000:c8:01.1: new USB bus registered, assigned bus number 3 ohci_hcd 0000:c8:01.1: init err ohci_hcd 0000:c8:01.1: can't start ohci_hcd 0000:c8:01.1: init error -16 ohci_hcd 0000:c8:01.1: remove, state 0 cpu 0x1: Vector: 300 (Data Access) at [c0000001dc9d72e0] pc: c000000000304224: ._spin_lock_irqsave+0x30/0xac lr: c0000000000130b4: .free_irq+0x4c/0x174 sp: c0000001dc9d7560 msr: 8000000000001032 dar: c0000080003acfa4 dsisr: 40010000 current = 0xc0000000e41ab3a0 paca = 0xc0000000003fb000 pid = 12968, comm = modprobe enter ? for help 1:mon> I was able to trigger this bug in ~10 minutes on kernel 2.6.9-67.EL (plain RHEL4u6, so it's not a regression in upcoming 4.6.z release): Red Hat Enterprise Linux AS release 4 (Nahant Update 6) Kernel 2.6.9-67.EL on an ppc64 squad3.lab.boston.redhat.com login: ohci_hcd 0000:c8:01.1: init err ohci_hcd 0000:c8:01.1: can't start ohci_hcd 0000:c8:01.1: init error -16 cpu 0x4: Vector: 300 (Data Access) at [c0000001df2af2e0] pc: c000000000303a00: ._spin_lock_irqsave+0x30/0xac lr: c0000000000130b4: .free_irq+0x4c/0x174 sp: c0000001df2af560 msr: 8000000000001032 dar: c0000080003acfa4 dsisr: 40010000 current = 0xc0000001e1ede3c0 paca = 0xc0000000003fc800 pid = 19226, comm = modprobe enter ? for help 4:mon> t [c0000001df2af5f0] c0000000000130b4 .free_irq+0x4c/0x174 [c0000001df2af6a0] d0000000002318f8 .ohci_stop+0x60/0xc0 [ohci_hcd] [c0000001df2af730] c00000000024eaac .usb_hcd_pci_remove+0xd0/0x1c8 [c0000001df2af7c0] c00000000024e99c .usb_hcd_pci_probe+0x518/0x558 [c0000001df2af890] c000000000181674 .pci_device_probe_static+0x6c/0xa8 [c0000001df2af920] c0000000001816f4 .__pci_device_probe+0x44/0x7c [c0000001df2af9b0] c000000000181768 .pci_device_probe+0x3c/0x6c [c0000001df2afa40] c0000000001dfa84 .bus_match+0x94/0xd8 [c0000001df2afad0] c0000000001dfc30 .driver_attach+0x70/0xe4 [c0000001df2afb60] c0000000001e0444 .bus_add_driver+0xf4/0x158 [c0000001df2afc00] c0000000001e0b40 .driver_register+0x38/0x4c [c0000001df2afc80] c000000000181b9c .pci_register_driver+0x80/0xcc [c0000001df2afd10] d000000000231be4 .ohci_hcd_pci_init+0x40/0x70 [ohci_hcd] [c0000001df2afd90] c000000000085350 .sys_init_module+0x1f0/0x460 [c0000001df2afe30] c000000000011280 syscall_exit+0x0/0x18 --- Exception: c01 (System Call) at 000000000ff75728 SP (ffffe580) is in userspace 4:mon> ls ._spin_lock_irqsave ._spin_lock_irqsave: c0000000003039d0 4:mon> dic0000000003039d0 c0000000003039d0 7c0802a6 mflr r0 c0000000003039d4 fba1ffe8 std r29,-24(r1) c0000000003039d8 fb81ffe0 std r28,-32(r1) c0000000003039dc fbe1fff8 std r31,-8(r1) c0000000003039e0 7c7d1b78 mr r29,r3 c0000000003039e4 f8010010 std r0,16(r1) c0000000003039e8 f821ff71 stdu r1,-144(r1) c0000000003039ec 7fe000a6 mfmsr r31 c0000000003039f0 7be08042 rldicl r0,r31,48,1 c0000000003039f4 78008000 rotldi r0,r0,16 c0000000003039f8 7c010164 mtmsrd r0,1 c0000000003039fc 800d0010 lwz r0,16(r13) c000000000303a00 7d20e828 lwarx r9,r0,r29 c000000000303a04 2c090000 cmpwi r9,0 c000000000303a08 40820010 bne c000000000303a18 # ._spin_lock_irqsave+0x48/0xac c000000000303a0c 7c00e92d stwcx. r0,r0,r29 4:mon> r R00 = 0000000080000004 R16 = 0000000000000000 R01 = c0000001df2af560 R17 = 0000000000000000 R02 = c0000000004f7bb0 R18 = 0000000000000000 R03 = c0000080003acfa4 R19 = 0000000010001180 R04 = c0000000006c5c78 R20 = 0000000000000003 R05 = 0000000000000008 R21 = 0000000010029298 R06 = 0000000000000000 R22 = 0000000000000000 R07 = 00000000c0000001 R23 = c0000001df2af830 R08 = 00000000e1ede3c0 R24 = e00000008218a000 R09 = e00000008218a000 R25 = c0000000006c5c78 R10 = c0000000004ff260 R26 = c0000001e3d26780 R11 = c000000000013068 R27 = 00000000ffffffff R12 = d000000000232098 R28 = 0000007fffffff80 R13 = c0000000003fc800 R29 = c0000080003acfa4 R14 = 0000000010022028 R30 = c00000000047c4c0 R15 = 0000000000000000 R31 = 8000000000009032 pc = c000000000303a00 ._spin_lock_irqsave+0x30/0xac lr = c0000000000130b4 .free_irq+0x4c/0x174 msr = 8000000000001032 cr = 24002424 ctr = c000000000013068 xer = 0000000000000007 trap = 300 4:mon> dc0000080003acfa4 c0000080003acfa4 **************** **************** | | Created attachment 303178 [details]
fix irq double free
When h/w is busy, ohci_pci_start() fails and calls ohci_stop(). ohci_stop()
frees allocated irq. After that, on toplevel, usb_hcd_pci_probe() calls
usb_hcd_pci_remove(), which itself again calls ohci_stop() though
hcd->driver->stop (hcd) and tries to free irq again. Ohci_stop() doesn't check
if irq was already freed => oops
Raised as an exception in 4.7: Easy to reproduce the crash, by just loading/unloading the ohci-hcd module. Customer has found this issue in 4.6.z stream as well; would be good to fix it in the next 4.6.z; therefore, proposed as an exception. p.s. per zstream maintainer: this bug can be triggered on every machine with RHEL-4 and OHCI usb host controller, not only on PPC64. therefore, reset HARDWARE component to "all" The same issue with ohci-omap, ohci-lh7a404 and ohci-sa1111 controllers. We need to fix it if we support machines with these h/w Committed in 73.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/ On x86 kernels this bug cause a kernel panic, see bz452706. *** Bug 452706 has been marked as a duplicate of this bug. *** Please, ignore c#14 and c#15, this is a different issue. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2008-0665.html |