Bug 518160
| Summary: | [FOCUS] Boot hang with x3950 using MRG's -108 kernel | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise MRG | Reporter: | IBM Bug Proxy <bugproxy> | ||||||
| Component: | realtime-kernel | Assignee: | Luis Claudio R. Goncalves <lgoncalv> | ||||||
| Status: | CLOSED ERRATA | QA Contact: | David Sommerseth <davids> | ||||||
| Severity: | high | Docs Contact: | |||||||
| Priority: | low | ||||||||
| Version: | 1.1 | CC: | bhu, lgoncalv, ovasik | ||||||
| Target Milestone: | 1.1.9 | ||||||||
| Target Release: | --- | ||||||||
| Hardware: | x86_64 | ||||||||
| OS: | All | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2009-11-03 18:22:32 UTC | Type: | --- | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Attachments: |
|
||||||||
Created attachment 357902 [details]
full bootlog on x3950
------- Comment From sripathik.com 2009-08-19 05:11 EDT------- Looks very similar to this: http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-07/msg02315.html Created attachment 358000 [details]
USB patch from Alan Stern
------- Comment on attachment From johnstul.com 2009-08-19 17:50 EDT-------
Dug out this patch from Alan Stern (linked to above by Sripathi), and patched a -108 kernel with it.
It booted further then it had before, but I got the following hang:
Calgary: DMA error on CalIOC2 PHB 0x3
Calgary: 0x02000000@CSR 0x00008000@PLSSR 0xb0008000@CSMR 0x00000000@MCK
Calgary: 0x00000000@0x810 0xf6850000@0x820 0xf6850000@0x830 0x00000000@0x840 0x06000000@0x850 0x00000000@0x860 0x00000000@0x870
Calgary: 0x48000000@0xcb0
Booting with calgary=disable didn't seem to do anything either.
------- Comment From johnstul.com 2009-08-19 18:05 EDT------- Booting with iommu=soft got me a little further, now I've hit the following: audit(1250693650.914:2): enforcing=1 old_enforcing=0 auid=4294967295 Kernel panic - not syncing: Attempted to kill init! Pid: 1, comm: init Not tainted 2.6.24.7-108ibmrt2.1.08.prejohn #1 [<ffffffff8103aaa3>] ? __wake_up+0x3a/0x5b [<ffffffff81191e06>] ? tty_write+0x1f8/0x213 [<ffffffff81056e76>] ? blocking_notifier_call_chain+0xf/0x11 [<ffffffff8104152d>] do_exit+0x8d/0x84e [<ffffffff81041d7d>] sys_exit_group+0x0/0x14 [<ffffffff81041d8f>] sys_exit_group+0x12/0x14 ------- Comment From johnstul.com 2009-08-19 18:09 EDT------- Booting with audit=0 iommu=soft didn't seem to help get any further. ------- Comment From johnstul.com 2009-08-19 19:03 EDT------- So booting with iommu=soft on 2.6.24.7-126.el5rt also boots further (seems to solve the usb issue), but I still see: EXT3-fs: mounted filesystem with ordered data mode. audit(1250697504.726:2): enforcing=1 old_enforcing=0 auid=4294967295 Kernel panic - not syncing: Attempted to kill init! Pid: 1, comm: init Not tainted 2.6.24.7-126.el5rt #1 [<ffffffff8103dce0>] panic+0xaf/0x160 [<ffffffff8103aa8c>] ? __wake_up+0x3a/0x5b [<ffffffff81191b66>] ? tty_write+0x1f8/0x213 [<ffffffff81056d9a>] ? blocking_notifier_call_chain+0xf/0x11 [<ffffffff810414fa>] do_exit+0x8d/0x840 [<ffffffff81041d3c>] sys_exit_group+0x0/0x14 [<ffffffff81041d4e>] sys_exit_group+0x12/0x14 [<ffffffff8100c23e>] system_call_ret+0x0/0x5 ------- Comment From johnstul.com 2009-08-20 17:00 EDT------- Trying to boot a vanilla 2.6.30 kernel I got a different but similar oops at the same spot: type=1404 audit(1250776316.775:2): enforcing=1 old_enforcing=0 auid=4294967295 ses=4294967295 Kernel panic - not syncing: Attempted to kill init! Pid: 1, comm: init Not tainted 2.6.30 #1 Call Trace: [<ffffffff810444aa>] panic+0xaa/0x170 [<ffffffff81789e77>] ? _write_lock_irq+0x17/0x30 [<ffffffff81789ee6>] ? _write_unlock_irq+0x16/0x40 [<ffffffff8104ddd7>] ? exit_ptrace+0xa7/0x120 [<ffffffff81789e77>] ? _write_lock_irq+0x17/0x30 [<ffffffff81047dea>] do_exit+0x68a/0x7d0 [<ffffffff81047f6e>] do_group_exit+0x3e/0xb0 [<ffffffff81047ff2>] sys_exit_group+0x12/0x20 [<ffffffff8100bd6b>] system_call_fastpath+0x16/0x1b ------- Comment From johnstul.com 2009-08-20 21:16 EDT------- Ugh.. So this ends up being a very unhelpful message. After lots of brute forcing options, I found that there was a selinux related error being printed to the console. Rebooting w/ selinux=off made the box boot further, but still had lots of error messages due to the disk being read-only. Ends up that since this is a multi-node system, there are *two* sets of disks that have the "/" partition label. This confuses selinux and causes the problem. So after correcting the partition label issue, I was able to boot 2.6.24.7-126.el5rt by adding iommu=soft ------- Comment From johnstul.com 2009-08-20 21:23 EDT------- To avoid the need for iommu=soft, disabling CALGARY_IOMMU_ENABLED_BY_DEFAULT would be needed. Clark: Do you think that config change could be made? ------- Comment From sripathik.com 2009-10-09 11:45 EDT------- CONFIG_CALGARY_IOMMU_ENABLED_BY_DEFAULT is not set in 2.6.31.2-rt13.21.el5rt. ------- Comment From sripathik.com 2009-10-14 01:25 EDT------- In 2.6.31-rc5.3.el5rt CONFIG_CALGARY_IOMMU_ENABLED_BY_DEFAULT=y In 2.6.31-rt10.18.el5rt CONFIG_CALGARY_IOMMU_ENABLED_BY_DEFAULT is not set In 2.6.31.2-rt13.21.el5rt CONFIG_CALGARY_IOMMU_ENABLED_BY_DEFAULT is not set So looks like the problem has been fixed in recent kernels. Thanks RH! in kernel -133 Verified by config review on kernel-rt-2.6.24.7-137: [root@hp-dl585g2-01 ~]# grep CONFIG_CALGARY_IOMMU /boot/config-2.6.24.7-137.el5rt CONFIG_CALGARY_IOMMU=y # CONFIG_CALGARY_IOMMU_ENABLED_BY_DEFAULT is not set [root@hp-dl585g2-01 ~]# An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1540.html |
=Comment: #0================================================= John G. Stultz <johnstul.com> - When booting 2.6.24.7-108.el5rt on an x3950, I see the following panic: Unable to handle kernel NULL pointer dereference at 0000000000000098 RIP: [<ffffffff811c460d>] usb_kick_khubd+0xb/0x20 PGD 20102a4067 PUD 20102a5067 PMD 0 Oops: 0000 [1] PREEMPT SMP CPU 31 Modules linked in: uhci_hcd ohci_hcd ehci_hcd Pid: 1324, comm: insmod Not tainted 2.6.24.7-108.el5rt #1 RIP: 0010:[<ffffffff811c460d>] [<ffffffff811c460d>] usb_kick_khubd+0xb/0x20 RSP: 0018:ffff81200f483be8 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff814000579800 RCX: 0000000000000000 RDX: ffff81200f4804c0 RSI: 0000000000000000 RDI: ffff81400009a000 RBP: ffff81200f483be8 R08: 00000000ffffffff R09: ffff81200f483858 R10: ffff81200f483b28 R11: 0000000000000002 R12: ffff812012e15070 R13: 0000000000000000 R14: ffff814000579800 R15: ffff81400009a000 FS: 0000000000a94850(0063) GS:ffff8140038505c0(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000098 CR3: 000000201029e000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process insmod (pid: 1324, threadinfo ffff81200f482000, task ffff81200f4804c0) Stack: ffff81200f483c08 ffffffff811c5b8b ffff81400009a000 ffff812012e15260 ffff81200f483c58 ffffffff811c6714 00000000000000a0 0000007612e15000 ffff812012e15070 ffff812012e150e0 ffff814000579800 ffff812012e15000 Call Trace: [<ffffffff811c5b8b>] usb_hc_died+0x5e/0x6f [<ffffffff811c6714>] usb_add_hcd+0x4e5/0x5bc [<ffffffff811d0ad3>] usb_hcd_pci_probe+0x1ec/0x299 [<ffffffff8114a317>] pci_device_probe+0xda/0x141 [<ffffffff811b357d>] driver_probe_device+0xfa/0x17e [<ffffffff811b3650>] __driver_attach+0x4f/0x79 [<ffffffff811b3601>] ? __driver_attach+0x0/0x79 [<ffffffff811b2972>] bus_for_each_dev+0x49/0x7a [<ffffffff811b3392>] driver_attach+0x1c/0x1e [<ffffffff811b2d3f>] bus_add_driver+0x86/0x1d6 [<ffffffff811b37c7>] driver_register+0x72/0x76 [<ffffffff8114a524>] __pci_register_driver+0x71/0xaa [<ffffffff8801f081>] :uhci_hcd:uhci_hcd_init+0x81/0xb2 [<ffffffff81064493>] sys_init_module+0x1675/0x17ad [<ffffffff8100c22e>] system_call_ret+0x0/0x5 Code: c0 76 d4 e8 60 ee ff ff 31 c0 c9 c3 55 48 8b bf 90 02 00 00 48 89 e5 e8 06 ec ff ff c9 31 c0 c3 55 48 8b 87 e0 03 00 00 48 89 e5 <48> 8b 80 98 00 00 00 48 8b b8 90 02 00 00 e8 68 e6 ff ff c9 c3 RIP [<ffffffff811c460d>] usb_kick_khubd+0xb/0x20 RSP <ffff81200f483be8> CR2: 0000000000000098 Kernel panic - not syncing: Fatal exception Pid: 1324, comm: insmod Tainted: G D 2.6.24.7-108.el5rt #1 Call Trace: [<ffffffff8103dcf8>] panic+0xaf/0x160 [<ffffffff8119eaa2>] ? do_unblank_screen+0xf/0x11e [<ffffffff8119ebbc>] ? unblank_screen+0xb/0xd [<ffffffff812895bb>] oops_end+0x54/0x5d [<ffffffff8128b0f4>] do_page_fault+0x67e/0x76d [<ffffffff81060bbc>] ? rt_down_trylock+0x16/0x3f [<ffffffff811b367a>] ? __device_attach+0x0/0xb [<ffffffff81284a59>] ? klist_iter_exit+0x1a/0x26 [<ffffffff81137a04>] ? kobject_get+0x1a/0x21 [<ffffffff81289249>] error_exit+0x0/0x51 [<ffffffff811c460d>] ? usb_kick_khubd+0xb/0x20 [<ffffffff811c5b8b>] ? usb_hc_died+0x5e/0x6f [<ffffffff811c6714>] ? usb_add_hcd+0x4e5/0x5bc [<ffffffff811d0ad3>] ? usb_hcd_pci_probe+0x1ec/0x299 [<ffffffff8114a317>] ? pci_device_probe+0xda/0x141 [<ffffffff811b357d>] ? driver_probe_device+0xfa/0x17e [<ffffffff811b3650>] ? __driver_attach+0x4f/0x79 [<ffffffff811b3601>] ? __driver_attach+0x0/0x79 [<ffffffff811b2972>] ? bus_for_each_dev+0x49/0x7a [<ffffffff811b3392>] ? driver_attach+0x1c/0x1e [<ffffffff811b2d3f>] ? bus_add_driver+0x86/0x1d6 [<ffffffff811b37c7>] ? driver_register+0x72/0x76 [<ffffffff8114a524>] ? __pci_register_driver+0x71/0xaa [<ffffffff8801f081>] ? :uhci_hcd:uhci_hcd_init+0x81/0xb2 [<ffffffff81064493>] ? sys_init_module+0x1675/0x17ad [<ffffffff8100c22e>] ? system_call_ret+0x0/0x5 hub 9-0:1.0: hub_port_status failed (err = -19) =Comment: #1================================================= John G. Stultz <johnstul.com> - full bootlog on x3950 =Comment: #5================================================= John G. Stultz <johnstul.com> - Tried also booting w/ the older 2.6.24.7-74.el5rt and it also paniced in the USB stack (well, it first paniced because it ran out of lowmem - on 64bit? weird - but booting w/ mem=2G moved the boot along so it could panic at usb).