=Comment: #0================================================= TIMOTHY R. CHAVEZ <chavezt.com> - 2008-04-16 16:12 EDT Problem description: During a test-boot of a diskless LS21 using the 2.6.24.4-32ibmrt2.2debug kernel with a modified LSI MPP/RDAC driver, I got a "BUG: MAX_STACK_TRACE_ENTRIES too low!" followed by a stack trace (attached). With the exception of the RDAC driver, the 2.6.24.4-32ibmrt2.2debug kernel is effectively the same kernel as the standard MRG 2.6.24.4-32debug kernel (no custom patches). However, it should be noted that a standard MRG 2.6.24.5-32debut kernel has not been test-booted, yet. The machine does not hang and appears to be operational / responsive. If this is not an installation problem, Describe any custom patches installed. No custom patches applied to kernel. However, a custom LSI/MPP RDAC driver was built and installed for this kernel. Provide output from "uname -a", if possible: Linux elm3c31 2.6.24.4-32ibmrt2.2debug #1 SMP PREEMPT RT Wed Apr 16 00:47:35 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux Hardware Environment Machine type (p650, x235, SF2, etc.): LS21 Cpu type (Power4, Power5, IA-64, etc.): Dual-Core AMD Opteron(tm) Processor Describe any special hardware you think might be relevant to this problem: Possibly the dual QLogic 4GB HBA cards attached the machine(?), but I've not test-booted the debug kernel on any other configuration, so... Please provide contact information if the submitter is not the primary contact. tim.chavez.ibm.com 512-838-1317 Is this reproducible? Yes If so, how long does it (did it) take to reproduce it? Describe the steps: Boot the system with the 2.6.24.4-32ibmrt2.2debug kernel. Is the system (not just the application) hung? No If so, describe how you determined this: Additional information: This environment is an LS21 attached to a DS4700 via a couple QLogic 4GB HBA cards (thus the need for RDAC) and has no local storage. =Comment: #1================================================= TIMOTHY R. CHAVEZ <chavezt.com> - 2008-04-16 16:22 EDT The "dmesg" output showing the BUG() and resulting stack trace =Comment: #2================================================= TIMOTHY R. CHAVEZ <chavezt.com> - 2008-04-16 17:29 EDT I booted the vanilla, trace, and rt kernels on this same system / hardware configuration without hitting this bug. I'll attempt to boot the debug kernel on a system with a local storage configuration tomorrow morning and report my findings. =Comment: #3================================================= TIMOTHY R. CHAVEZ <chavezt.com> - 2008-04-22 10:44 EDT Just a note, Red Hat also seeing this in testing From Clark Williams @ Red Hat: This isn't a CONFIG_ option. Its a value defined in lockdep_internals.h and currently is defined as: #define MAX_STACK_TRACE_ENTRIES 262144UL That's pretty big...
Created attachment 306805 [details] The "dmesg" output showing the BUG() and resulting stack trace
------- Comment From jstultz.com 2008-06-18 19:02 EDT------- Has this issue been seen recently?
------- Comment From chavezt.com 2008-06-30 12:00 EDT------- I haven't see it, but then again, I haven't been booting from SAN recently. Maybe Keith has seen it? I'm adding him to CC list.
I have seen this problem on a non-SAN machine while trying to recreate bug 46204. The system took a really long time (45 minutes) to come up. BUG message seen was: BUG: MAX_STACK_TRACE_ENTRIES too low! turning off the locking correctness validator. Pid: 2112, comm: ip Not tainted 2.6.24.7-74ibmrt2.5debug #1 Call Trace: [<ffffffff810146b5>] ? save_stack_trace+0x2a/0x49 [<ffffffff8105d851>] save_trace+0x93/0x9b [<ffffffff8105d8d7>] add_lock_to_list+0x7e/0xac [<ffffffff81060eb9>] __lock_acquire+0xb43/0xcdc [<ffffffff81067443>] ? rt_mutex_slowtrylock+0x18/0x85 [<ffffffff810610e0>] lock_acquire+0x8e/0xb2 [<ffffffff81067443>] ? rt_mutex_slowtrylock+0x18/0x85 [<ffffffff812a7bd2>] __spin_lock_irqsave+0x40/0x73 [<ffffffff81067443>] rt_mutex_slowtrylock+0x18/0x85 [<ffffffff812a5694>] rt_mutex_trylock+0x9/0xb [<ffffffff812a7105>] rt_spin_lock+0x31/0x56 [<ffffffff8127e194>] ip_mc_inc_group+0x176/0x232 [<ffffffff8127e296>] ip_mc_up+0x46/0x64 [<ffffffff81279947>] inetdev_event+0x263/0x470 [<ffffffff810882d0>] ? __rcu_read_unlock+0x8c/0x95 [<ffffffff812aa943>] notifier_call_chain+0x33/0x5b [<ffffffff81058001>] __raw_notifier_call_chain+0x9/0xb [<ffffffff81058012>] raw_notifier_call_chain+0xf/0x11 [<ffffffff812305c2>] call_netdevice_notifiers+0x16/0x18 [<ffffffff81231f2f>] dev_open+0x80/0x88 [<ffffffff8123072f>] dev_change_flags+0xaf/0x16b [<ffffffff81279ee3>] devinet_ioctl+0x267/0x5f2 [<ffffffff8127a686>] inet_ioctl+0x82/0xa0 [<ffffffff81223dc1>] sock_ioctl+0x1e7/0x20c [<ffffffff810ce955>] do_ioctl+0x2d/0x83 [<ffffffff810cec20>] vfs_ioctl+0x275/0x292 [<ffffffff812a6a5b>] ? trace_hardirqs_on_thunk+0x3a/0x3f [<ffffffff810cec94>] sys_ioctl+0x57/0x7b [<ffffffff8100c248>] ? system_call+0xb8/0xef [<ffffffff8100c27f>] system_call_ret+0x0/0x6d INFO: lockdep is turned off. --------------------------- | preempt count: 00000001 ] | 1-level deep critical section nesting: ---------------------------------------- .. [<ffffffff812a7bb5>] .... __spin_lock_irqsave+0x23/0x73 .....[<ffffffff81067443>] .. ( <= rt_mutex_slowtrylock+0x18/0x85)
While working on bug #46204 (RH459478), Peter Zijlstra suggested trying a few patches recently committed to Linus' tree to see if it helps solve this problem. They did not. Then, he asked me to try higher values of MAX_STACK_TRACE_ENTRIES. I changed MAX_STACK_TRACE_ENTRIES to 1.25 times (327680) it's current value and I still saw the problem. When I made it 1.5 times (393216), I did not see the problem. I have reported these to Peter in an e-mail as well. He needs to decide whether it is okay to increase this value.
Created attachment 314559 [details] Patch to increase MAX_STACK_TRACE_ENTRIES Added patch to increase MAX_STACK_TRACE_ENTRIES by 1.5 (to 393216) as per Sripathi's tests. This should go into our -78 kernel build
Verified that patch (https://bugzilla.redhat.com/attachment.cgi?id=314559) is implemented as mrg-rt.git commit 842bb285febde3ae296de13c8c50da52e56878f7. Available in mrg-rt-2.6.24.7-81. Bug reproduced using 2.6.24.7-74 and went away with 2.6.24.7-81.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2008-0857.html