Bug 163176
Summary: | Endless loop printing traceback during kernel OOPs | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 3 | Reporter: | Issue Tracker <tao> |
Component: | kernel | Assignee: | Kiersten (Kerri) Anderson <kanderso> |
Status: | CLOSED ERRATA | QA Contact: | Brian Brock <bbrock> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 3.0 | CC: | anderson, bstevens, dff, dhowells, gavin, havill, kanderso, kreilly, lwang, lwoodman, pcormier, peterm, petrides, staubach, tao, tburke |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | RHSA-2006-0144 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2006-03-15 16:13:11 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 168424 |
Description
Issue Tracker
2005-07-13 19:08:39 UTC
Please put some useful information in this bug report. All that's indicated so far is that OpenAFS crashes the system during shutdown (which is not a RHEL problem). If there's no evidence of a RHEL kernel bug, then please close this as NOTABUG. KevinK/Guy, I don't know how we can support OpenAFS -- it's not in our source tree and we don't have anyone here with expertise in it. I'm closing this bugzilla report because there's nothing indicating a bug in RHEL3. I've also put appropriate Engineering managers on the cc: list, and if this does get reopened, it will be KevinA's responsibility to reassign it appropriately. I still do not know where the AFS sources are. Can someone help me locate them? Larry Woodman What we have done so far to help debug this problem and prevent the system from hanging kernel stack pointer corruption causes a panic it to fix show_trace() so that it does not hang. This will provide useful debugging information, allow the system to take a crash dump and allow it to reboot without manual intervention. Here is the patch that fixes show_trace() and show_stack(): -------------------------------------------------------------- --- linux-2.4.21/arch/i386/kernel/traps.c.orig +++ linux-2.4.21/arch/i386/kernel/traps.c @@ -141,6 +141,7 @@ void show_trace(unsigned long * stack) unsigned long addr; /* static to not take up stackspace; if we race here too bad */ static char buffer[512]; + unsigned long limit; if (!stack) stack = (unsigned long*)&stack; @@ -163,7 +164,8 @@ void show_trace(unsigned long * stack) out: #else i = 1; - while (((long) stack & (THREAD_SIZE-1)) != 0) { + limit = ((unsigned long)stack & ~(THREAD_SIZE - 1)) + THREAD_SIZE - 3; + while ((unsigned long)stack < limit) { addr = *stack++; if (kernel_text_address(addr)) { lookup_symbol(addr, buffer, 512); @@ -189,6 +191,7 @@ void show_stack(unsigned long * esp) { unsigned long *stack; int i; + unsigned long limit; // debugging aid: "show_stack(NULL);" prints the // back trace for this cpu. @@ -197,8 +200,9 @@ void show_stack(unsigned long * esp) esp=(unsigned long*)&esp; stack = esp; + limit = ((unsigned long)stack & ~(THREAD_SIZE - 1)) + THREAD_SIZE - 3; for(i=0; i < kstack_depth_to_print; i++) { - if (((long) stack & (THREAD_SIZE-1)) == 0) + if ((unsigned long)stack > limit) break; if (i && ((i % 8) == 0)) printk("\n "); -------------------------------------------------------------------------- http://www.openafs.org or you can grab it by AFS when IS/IT stop blocking UDP packets at the RH firewalls. Gavin/Adrian, the shortcoming in the oops traceback mechanism (which has been resolved in -37.2.EL as the fix to bug 165412) *DID NOT CAUSE* the crash reported in this bugzilla! It simply led to getting stuck dumping the stack after the crash was caused for some other reason. I suspect that the underlying cause of the crash is in AFS, but that has not yet been proven. Larry's oops traceback fix might help debug the underlying cause, but I wish you folks would understand that Larry's fix will not prevent such crashes. Another important note is that the exported global symbol "___strtok" is not actually a function, but rather that very last data symbol. The EIP for the crash originally reported in IT 75840 is a bogus address in memory far off from legitimate kernel text. Snooping at the first 24 words dumped in the kernel stack, and using the System.map from -15.9.1.ELhugemem, I've pieced together what I think is the real top-of-stack: 02134b0f __run_timers + 182 02134922 timer_bh + 98 0212f6b2 __run_task_queue + 106 This suggests that the underlying cause of the crash is some consumer of the add_timer/mod_timer/del_timer facility in the kernel, and that the consumer code (maybe AFS) left a pending timer and then freed the associated timer_list structure (which contains a pointer to the function to be invoked). It's probably the case that an overwritten timer_list struct was then used to fetch the trashed func pointer for the dispatch, and then __run_timers() called off into invalid instruction space (causing the crash). Ernie,
I was not clear in my last note. No one believes that the kernel OOPs traceback
mechinism caused the problem in AFS.
> It simply led to getting stuck dumping the stack after the crash was caused
for some other reason.
According to Adrian, this "getting stuck dumping" _is_ the problem that they
need solved.
The fact that AFS has a problem while shutting down the machine is an annoyance,
but the fact that it gets stuck in the OOPS traceback turns that annoyance into
a major headache.
Gavin/Adrian, in response to comment #45: Then *please* change the summary of this bug accordingly and close this as a dup of bug 165412. Thanks in advance. If this bug had an appropriate summary or description, then we wouldn't have wasted so much time on it. *** This bug has been marked as a duplicate of 165412 *** A fix for this problem was committed to the RHEL3 U7 patch pool on 14-Sep-2005 (in kernel version 2.4.21-37.2.EL). The patch that was committed is functionally equivalent to what's in bug 165412 comment #51. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2006-0144.html |