Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 518160

Summary: [FOCUS] Boot hang with x3950 using MRG's -108 kernel
Product: Red Hat Enterprise MRG Reporter: IBM Bug Proxy <bugproxy>
Component: realtime-kernelAssignee: Luis Claudio R. Goncalves <lgoncalv>
Status: CLOSED ERRATA QA Contact: David Sommerseth <davids>
Severity: high Docs Contact:
Priority: low    
Version: 1.1CC: bhu, lgoncalv, ovasik
Target Milestone: 1.1.9   
Target Release: ---   
Hardware: x86_64   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-11-03 18:22:32 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
full bootlog on x3950
none
USB patch from Alan Stern none

Description IBM Bug Proxy 2009-08-19 08:40:49 UTC
=Comment: #0=================================================
John G. Stultz <johnstul.com> - 

When booting 2.6.24.7-108.el5rt on an x3950, I see the following panic:

Unable to handle kernel NULL pointer dereference at 0000000000000098 RIP: 
 [<ffffffff811c460d>] usb_kick_khubd+0xb/0x20
PGD 20102a4067 PUD 20102a5067 PMD 0 
Oops: 0000 [1] PREEMPT SMP 
CPU 31 
Modules linked in: uhci_hcd ohci_hcd ehci_hcd
Pid: 1324, comm: insmod Not tainted 2.6.24.7-108.el5rt #1
RIP: 0010:[<ffffffff811c460d>]  [<ffffffff811c460d>] usb_kick_khubd+0xb/0x20
RSP: 0018:ffff81200f483be8  EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff814000579800 RCX: 0000000000000000
RDX: ffff81200f4804c0 RSI: 0000000000000000 RDI: ffff81400009a000
RBP: ffff81200f483be8 R08: 00000000ffffffff R09: ffff81200f483858
R10: ffff81200f483b28 R11: 0000000000000002 R12: ffff812012e15070
R13: 0000000000000000 R14: ffff814000579800 R15: ffff81400009a000
FS:  0000000000a94850(0063) GS:ffff8140038505c0(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000098 CR3: 000000201029e000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process insmod (pid: 1324, threadinfo ffff81200f482000, task ffff81200f4804c0)
Stack:  ffff81200f483c08 ffffffff811c5b8b ffff81400009a000 ffff812012e15260
 ffff81200f483c58 ffffffff811c6714 00000000000000a0 0000007612e15000
 ffff812012e15070 ffff812012e150e0 ffff814000579800 ffff812012e15000
Call Trace:
 [<ffffffff811c5b8b>] usb_hc_died+0x5e/0x6f
 [<ffffffff811c6714>] usb_add_hcd+0x4e5/0x5bc
 [<ffffffff811d0ad3>] usb_hcd_pci_probe+0x1ec/0x299
 [<ffffffff8114a317>] pci_device_probe+0xda/0x141
 [<ffffffff811b357d>] driver_probe_device+0xfa/0x17e
 [<ffffffff811b3650>] __driver_attach+0x4f/0x79
 [<ffffffff811b3601>] ? __driver_attach+0x0/0x79
 [<ffffffff811b2972>] bus_for_each_dev+0x49/0x7a
 [<ffffffff811b3392>] driver_attach+0x1c/0x1e
 [<ffffffff811b2d3f>] bus_add_driver+0x86/0x1d6
 [<ffffffff811b37c7>] driver_register+0x72/0x76
 [<ffffffff8114a524>] __pci_register_driver+0x71/0xaa
 [<ffffffff8801f081>] :uhci_hcd:uhci_hcd_init+0x81/0xb2
 [<ffffffff81064493>] sys_init_module+0x1675/0x17ad
 [<ffffffff8100c22e>] system_call_ret+0x0/0x5


Code: c0 76 d4 e8 60 ee ff ff 31 c0 c9 c3 55 48 8b bf 90 02 00 00 48 89 e5 e8 06 ec ff ff c9 31 c0
c3 55 48 8b 87 e0 03 00 00 48 89 e5 <48> 8b 80 98 00 00 00 48 8b b8 90 02 00 00 e8 68 e6 ff ff c9 c3
                                        
RIP  [<ffffffff811c460d>] usb_kick_khubd+0xb/0x20
 RSP <ffff81200f483be8>
CR2: 0000000000000098
Kernel panic - not syncing: Fatal exception
Pid: 1324, comm: insmod Tainted: G      D  2.6.24.7-108.el5rt #1

Call Trace:
 [<ffffffff8103dcf8>] panic+0xaf/0x160
 [<ffffffff8119eaa2>] ? do_unblank_screen+0xf/0x11e
 [<ffffffff8119ebbc>] ? unblank_screen+0xb/0xd
 [<ffffffff812895bb>] oops_end+0x54/0x5d
 [<ffffffff8128b0f4>] do_page_fault+0x67e/0x76d
 [<ffffffff81060bbc>] ? rt_down_trylock+0x16/0x3f
 [<ffffffff811b367a>] ? __device_attach+0x0/0xb
 [<ffffffff81284a59>] ? klist_iter_exit+0x1a/0x26
 [<ffffffff81137a04>] ? kobject_get+0x1a/0x21
 [<ffffffff81289249>] error_exit+0x0/0x51
 [<ffffffff811c460d>] ? usb_kick_khubd+0xb/0x20
 [<ffffffff811c5b8b>] ? usb_hc_died+0x5e/0x6f
 [<ffffffff811c6714>] ? usb_add_hcd+0x4e5/0x5bc
 [<ffffffff811d0ad3>] ? usb_hcd_pci_probe+0x1ec/0x299
 [<ffffffff8114a317>] ? pci_device_probe+0xda/0x141
 [<ffffffff811b357d>] ? driver_probe_device+0xfa/0x17e
 [<ffffffff811b3650>] ? __driver_attach+0x4f/0x79
 [<ffffffff811b3601>] ? __driver_attach+0x0/0x79
 [<ffffffff811b2972>] ? bus_for_each_dev+0x49/0x7a
 [<ffffffff811b3392>] ? driver_attach+0x1c/0x1e
 [<ffffffff811b2d3f>] ? bus_add_driver+0x86/0x1d6
 [<ffffffff811b37c7>] ? driver_register+0x72/0x76
 [<ffffffff8114a524>] ? __pci_register_driver+0x71/0xaa
 [<ffffffff8801f081>] ? :uhci_hcd:uhci_hcd_init+0x81/0xb2
 [<ffffffff81064493>] ? sys_init_module+0x1675/0x17ad
 [<ffffffff8100c22e>] ? system_call_ret+0x0/0x5

hub 9-0:1.0: hub_port_status failed (err = -19)
=Comment: #1=================================================
John G. Stultz <johnstul.com> - 

full bootlog on x3950


=Comment: #5=================================================
John G. Stultz <johnstul.com> - 
Tried also booting w/ the older 2.6.24.7-74.el5rt and it also paniced in the USB stack (well, it
first paniced because it ran out of lowmem - on 64bit? weird - but booting w/ mem=2G moved the boot
along so it could panic at usb).

Comment 1 IBM Bug Proxy 2009-08-19 08:40:55 UTC
Created attachment 357902 [details]
full bootlog on x3950

Comment 2 IBM Bug Proxy 2009-08-19 09:20:41 UTC
------- Comment From sripathik.com 2009-08-19 05:11 EDT-------
Looks very similar to this: http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-07/msg02315.html

Comment 3 IBM Bug Proxy 2009-08-19 22:00:47 UTC
Created attachment 358000 [details]
USB patch from Alan Stern


------- Comment on attachment From johnstul.com 2009-08-19 17:50 EDT-------


Dug out this patch from Alan Stern (linked to above by Sripathi), and patched a -108 kernel with it.

It booted further then it had before, but I got the following hang:
Calgary: DMA error on CalIOC2 PHB 0x3
Calgary: 0x02000000@CSR 0x00008000@PLSSR 0xb0008000@CSMR 0x00000000@MCK
Calgary: 0x00000000@0x810 0xf6850000@0x820 0xf6850000@0x830 0x00000000@0x840 0x06000000@0x850 0x00000000@0x860 0x00000000@0x870                                 
Calgary: 0x48000000@0xcb0


Booting with calgary=disable didn't seem to do anything either.

Comment 4 IBM Bug Proxy 2009-08-19 22:10:45 UTC
------- Comment From johnstul.com 2009-08-19 18:05 EDT-------
Booting with iommu=soft got me a little further, now I've hit the following:

audit(1250693650.914:2): enforcing=1 old_enforcing=0 auid=4294967295
Kernel panic - not syncing: Attempted to kill init!
Pid: 1, comm: init Not tainted 2.6.24.7-108ibmrt2.1.08.prejohn #1

[<ffffffff8103aaa3>] ? __wake_up+0x3a/0x5b
[<ffffffff81191e06>] ? tty_write+0x1f8/0x213
[<ffffffff81056e76>] ? blocking_notifier_call_chain+0xf/0x11
[<ffffffff8104152d>] do_exit+0x8d/0x84e
[<ffffffff81041d7d>] sys_exit_group+0x0/0x14
[<ffffffff81041d8f>] sys_exit_group+0x12/0x14

------- Comment From johnstul.com 2009-08-19 18:09 EDT-------
Booting with audit=0 iommu=soft didn't seem to help get any further.

Comment 5 IBM Bug Proxy 2009-08-19 23:10:44 UTC
------- Comment From johnstul.com 2009-08-19 19:03 EDT-------
So booting with iommu=soft on 2.6.24.7-126.el5rt also boots further (seems to solve the usb issue), but I still see:

EXT3-fs: mounted filesystem with ordered data mode.
audit(1250697504.726:2): enforcing=1 old_enforcing=0 auid=4294967295
Kernel panic - not syncing: Attempted to kill init!
Pid: 1, comm: init Not tainted 2.6.24.7-126.el5rt #1

[<ffffffff8103dce0>] panic+0xaf/0x160
[<ffffffff8103aa8c>] ? __wake_up+0x3a/0x5b
[<ffffffff81191b66>] ? tty_write+0x1f8/0x213
[<ffffffff81056d9a>] ? blocking_notifier_call_chain+0xf/0x11
[<ffffffff810414fa>] do_exit+0x8d/0x840
[<ffffffff81041d3c>] sys_exit_group+0x0/0x14
[<ffffffff81041d4e>] sys_exit_group+0x12/0x14
[<ffffffff8100c23e>] system_call_ret+0x0/0x5

Comment 6 IBM Bug Proxy 2009-08-20 21:12:18 UTC
------- Comment From johnstul.com 2009-08-20 17:00 EDT-------
Trying to boot a vanilla 2.6.30 kernel I got a different but similar oops at the same spot:

type=1404 audit(1250776316.775:2): enforcing=1 old_enforcing=0 auid=4294967295 ses=4294967295
Kernel panic - not syncing: Attempted to kill init!
Pid: 1, comm: init Not tainted 2.6.30 #1
Call Trace:
[<ffffffff810444aa>] panic+0xaa/0x170
[<ffffffff81789e77>] ? _write_lock_irq+0x17/0x30
[<ffffffff81789ee6>] ? _write_unlock_irq+0x16/0x40
[<ffffffff8104ddd7>] ? exit_ptrace+0xa7/0x120
[<ffffffff81789e77>] ? _write_lock_irq+0x17/0x30
[<ffffffff81047dea>] do_exit+0x68a/0x7d0
[<ffffffff81047f6e>] do_group_exit+0x3e/0xb0
[<ffffffff81047ff2>] sys_exit_group+0x12/0x20
[<ffffffff8100bd6b>] system_call_fastpath+0x16/0x1b

Comment 7 IBM Bug Proxy 2009-08-21 01:31:21 UTC
------- Comment From johnstul.com 2009-08-20 21:16 EDT-------
Ugh.. So this ends up being a very unhelpful message. After lots of brute forcing options, I found that there was a selinux related error being printed to the console. Rebooting w/ selinux=off made the box boot further, but still had lots of error messages due to the disk being read-only.

Ends up that since this is a multi-node system, there are *two* sets of disks that have the "/" partition label. This confuses selinux and causes the problem.

So after correcting the partition label issue, I was able to boot 2.6.24.7-126.el5rt by adding iommu=soft

------- Comment From johnstul.com 2009-08-20 21:23 EDT-------
To avoid the need for iommu=soft, disabling CALGARY_IOMMU_ENABLED_BY_DEFAULT would be needed.

Clark: Do you think that config change could be made?

Comment 8 IBM Bug Proxy 2009-10-09 15:51:17 UTC
------- Comment From sripathik.com 2009-10-09 11:45 EDT-------
CONFIG_CALGARY_IOMMU_ENABLED_BY_DEFAULT is not set in 2.6.31.2-rt13.21.el5rt.

Comment 9 IBM Bug Proxy 2009-10-14 05:31:24 UTC
------- Comment From sripathik.com 2009-10-14 01:25 EDT-------
In 2.6.31-rc5.3.el5rt CONFIG_CALGARY_IOMMU_ENABLED_BY_DEFAULT=y
In 2.6.31-rt10.18.el5rt CONFIG_CALGARY_IOMMU_ENABLED_BY_DEFAULT is not set
In 2.6.31.2-rt13.21.el5rt CONFIG_CALGARY_IOMMU_ENABLED_BY_DEFAULT is not set

So looks like the problem has been fixed in recent kernels. Thanks RH!

Comment 10 Beth Uptagrafft 2009-10-26 22:36:07 UTC
in kernel -133

Comment 11 David Sommerseth 2009-10-29 13:42:51 UTC
Verified by config review on kernel-rt-2.6.24.7-137:

[root@hp-dl585g2-01 ~]# grep CONFIG_CALGARY_IOMMU /boot/config-2.6.24.7-137.el5rt
CONFIG_CALGARY_IOMMU=y
# CONFIG_CALGARY_IOMMU_ENABLED_BY_DEFAULT is not set
[root@hp-dl585g2-01 ~]#

Comment 13 errata-xmlrpc 2009-11-03 18:22:32 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1540.html