Bug 163176

Summary:	Endless loop printing traceback during kernel OOPs
Product:	Red Hat Enterprise Linux 3	Reporter:	Issue Tracker <tao>
Component:	kernel	Assignee:	Kiersten (Kerri) Anderson <kanderso>
Status:	CLOSED ERRATA	QA Contact:	Brian Brock <bbrock>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	3.0	CC:	anderson, bstevens, dff, dhowells, gavin, havill, kanderso, kreilly, lwang, lwoodman, pcormier, peterm, petrides, staubach, tao, tburke
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	RHSA-2006-0144	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2006-03-15 16:13:11 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	168424

Description Issue Tracker 2005-07-13 19:08:39 UTC

Escalated to Bugzilla from IssueTracker

Comment 7 Ernie Petrides 2005-07-14 20:59:16 UTC

Please put some useful information in this bug report.  All that's
indicated so far is that OpenAFS crashes the system during shutdown
(which is not a RHEL problem).  If there's no evidence of a RHEL kernel
bug, then please close this as NOTABUG.

Comment 10 Ernie Petrides 2005-07-18 18:13:05 UTC

KevinK/Guy, I don't know how we can support OpenAFS -- it's not in our
source tree and we don't have anyone here with expertise in it.  I'm
closing this bugzilla report because there's nothing indicating a bug
in RHEL3.  I've also put appropriate Engineering managers on the cc:
list, and if this does get reopened, it will be KevinA's responsibility
to reassign it appropriately.

Comment 15 Larry Woodman 2005-07-29 19:21:22 UTC

I still do not know where the AFS sources are.  Can someone help me locate them?

Larry Woodman

Comment 34 Larry Woodman 2005-08-24 14:47:59 UTC

What we have done so far to help debug this problem and prevent the system from
hanging kernel stack pointer corruption causes a panic it to fix show_trace() so
that it does not hang.  This will provide useful debugging information, allow
the system to take a crash dump and allow it to reboot without manual intervention.

Here is the patch that fixes show_trace() and show_stack():

--------------------------------------------------------------
--- linux-2.4.21/arch/i386/kernel/traps.c.orig
+++ linux-2.4.21/arch/i386/kernel/traps.c
@@ -141,6 +141,7 @@ void show_trace(unsigned long * stack)
        unsigned long addr;
        /* static to not take up stackspace; if we race here too bad */
        static char buffer[512];
+       unsigned long limit;
                                                                               
                
        if (!stack)
                stack = (unsigned long*)&stack;
@@ -163,7 +164,8 @@ void show_trace(unsigned long * stack)
 out:
 #else
        i = 1;
-       while (((long) stack & (THREAD_SIZE-1)) != 0) {
+       limit = ((unsigned long)stack & ~(THREAD_SIZE - 1)) + THREAD_SIZE - 3;
+       while ((unsigned long)stack < limit) {
                addr = *stack++;
                if (kernel_text_address(addr)) {
                        lookup_symbol(addr, buffer, 512);
@@ -189,6 +191,7 @@ void show_stack(unsigned long * esp)
 {
        unsigned long *stack;
        int i;
+       unsigned long limit;
                                                                               
                
        // debugging aid: "show_stack(NULL);" prints the
        // back trace for this cpu.
@@ -197,8 +200,9 @@ void show_stack(unsigned long * esp)
                esp=(unsigned long*)&esp;
                                                                               
                
        stack = esp;
+       limit = ((unsigned long)stack & ~(THREAD_SIZE - 1)) + THREAD_SIZE - 3;
        for(i=0; i < kstack_depth_to_print; i++) {
-               if (((long) stack & (THREAD_SIZE-1)) == 0)
+               if ((unsigned long)stack > limit)
                        break;
                if (i && ((i % 8) == 0))
                        printk("\n       ");
--------------------------------------------------------------------------

Comment 40 David Howells 2005-09-19 12:23:25 UTC

http://www.openafs.org or you can grab it by AFS when IS/IT stop blocking UDP 
packets at the RH firewalls.

Comment 44 Ernie Petrides 2005-09-24 01:10:49 UTC

Gavin/Adrian, the shortcoming in the oops traceback mechanism (which has
been resolved in -37.2.EL as the fix to bug 165412) *DID NOT CAUSE* the
crash reported in this bugzilla!

It simply led to getting stuck dumping the stack after the crash was caused
for some other reason.

I suspect that the underlying cause of the crash is in AFS, but that has not
yet been proven.  Larry's oops traceback fix might help debug the underlying
cause, but I wish you folks would understand that Larry's fix will not prevent
such crashes.

Another important note is that the exported global symbol "___strtok" is not
actually a function, but rather that very last data symbol.  The EIP for the
crash originally reported in IT 75840 is a bogus address in memory far off
from legitimate kernel text.  Snooping at the first 24 words dumped in the
kernel stack, and using the System.map from -15.9.1.ELhugemem, I've pieced
together what I think is the real top-of-stack:

  02134b0f  __run_timers + 182
  02134922  timer_bh + 98
  0212f6b2  __run_task_queue + 106

This suggests that the underlying cause of the crash is some consumer of the
add_timer/mod_timer/del_timer facility in the kernel, and that the consumer
code (maybe AFS) left a pending timer and then freed the associated timer_list
structure (which contains a pointer to the function to be invoked).  It's
probably the case that an overwritten timer_list struct was then used to
fetch the trashed func pointer for the dispatch, and then __run_timers()
called off into invalid instruction space (causing the crash).

Comment 45 Gavin Romig-Koch 2005-09-26 13:23:49 UTC

Ernie,
I was not clear in my last note.  No one believes that the kernel OOPs traceback
mechinism caused the problem in AFS.

> It simply led to getting stuck dumping the stack after the crash was caused
for some other reason.

According to Adrian, this "getting stuck dumping" _is_ the problem that they
need solved.

The fact that AFS has a problem while shutting down the machine is an annoyance,
but the fact that it gets stuck in the OOPS traceback turns that annoyance into
a major headache.

Comment 48 Ernie Petrides 2005-09-26 19:57:58 UTC

Gavin/Adrian, in response to comment #45:

Then *please* change the summary of this bug accordingly and close
this as a dup of bug 165412.  Thanks in advance.

If this bug had an appropriate summary or description, then we wouldn't
have wasted so much time on it.

Comment 49 Eido Inoue 2005-09-26 20:44:52 UTC


*** This bug has been marked as a duplicate of 165412 ***

Comment 50 Ernie Petrides 2005-09-26 21:13:41 UTC

A fix for this problem was committed to the RHEL3 U7 patch pool
on 14-Sep-2005 (in kernel version 2.4.21-37.2.EL).

The patch that was committed is functionally equivalent to what's
in bug 165412 comment #51.

Comment 54 Red Hat Bugzilla 2006-03-15 16:13:12 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2006-0144.html