151295 – stack overflow message is alarmist and confusing

Bug 151295 - stack overflow message is alarmist and confusing

Summary: stack overflow message is alarmist and confusing

Keywords:
Status:	CLOSED DUPLICATE of bug 151226
Alias:	None
Product:	Red Hat Enterprise Linux 4
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	4.0
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	low
Target Milestone:	---
Target Release:	---
Assignee:	Dave Jones
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2005-03-16 19:20 UTC by craig harmer
Modified:	2015-01-04 22:17 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2005-03-16 19:22:06 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description craig harmer 2005-03-16 19:20:13 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8b) Gecko/20050217

Description of problem:
oh boy!  i know how this bug report is going to be received ...

the message produced by the linux kernel when we may be in danger of a stack
overflow looks like:

        do_IRQ: stack overflow: 4968
         [<c0106e0f>] dump_stack+0x16/0x18
         [<c0108934>] do_IRQ+0x4f/0x1b5
         [<c02dbe9c>] common_interrupt+0x18/0x20
         [<fa90cef9>] xted_unlockmap_check+0x18a/0x192 [vxfs]
         [<fa77a5f2>] vx_unlockmap+0x23/0x38b [vxfs]
         [<fa77a4b7>] vx_holdmap+0x177/0x17f [vxfs]
         [<fa755853>] vx_extmaptran+0x8b/0x96 [vxfs]
         [<fa755797>] vx_extmapchange+0x24e/0x27f [vxfs]
         [<fa75182a>] vx_extfind+0x3d5/0x3e1 [vxfs]
         ...
                                                                                
there are two problems here.  the first is that we haven't actually overflowed,
we're only at risk of overflowing.  the second is that the stack traceback omits
useful information for investigating the problem.
                                                                                
i'd like to the message look like something like:
                                                                                
        do_IRQ: stack overflow risk: 4968 bytes left
         [<c0106e0f>] [<0xd9277950>] dump_stack+0x16/0x18
         [<c0108934>] [<0xd9277a04>] do_IRQ+0x4f/0x1b5
         ...
                                                                                
where the initial message makes clear that we're at risk of stack overflow with
4968 bytes left, but have not actually had a stack  overflow.

the stack "trace" includes the address in the stack where each function call was
found as an aid to estimating stack consumption of each function. it's also
useful when trying to decipher stack traces and skip over stale symbols that
appear in the stack, since if you know that a particular function appears in the
stack trace and the approximate size of the stack frame of the function it's
easier to skip over stale symbols in the stack trace that lie within that area.

(suggestions for alternative formats are welcome).
                                                                                
the reason the message was produced with 4968 bytes left is that we've "cranked
up" both the stack size and the warning level in the kernels we use internally
at veritas.

the code change necessary to effect this change would be (as pseudo-diffs since
our code base is further modified):
                                                                                
arch/i386/kernel/irq.c in do_IRQ():
<                       printk("do_IRQ: stack overflow: %ld\n",
>                       printk("do_IRQ: stack overflow risk: %ld bytes left\n",
                                esp - sizeof(struct thread_info));
                                                                                
./arch/i386/kernel/traps.c in print_context_stack():

#ifdef  CONFIG_FRAME_POINTER
        while (valid_stack_ptr(tinfo, (void *)ebp)) {
                addr = *(unsigned long *)(ebp + 4);
<               printk(" [<%08lx>] ", addr);
>               printk(" [<%08lx>] [<%08lx>] ", addr, ebp + 4);
                print_symbol("%s", addr);
                printk("\n");
                ebp = *(unsigned long *)ebp;
        }
#else
        while (valid_stack_ptr(tinfo, stack)) {
                addr = *stack++;
                if (__kernel_text_address(addr)) {
<                       printk(" [<%08lx>]", addr);
>                       printk(" [<%08lx>] [<%08lx>]", addr, stack - 4);
                        print_symbol(" %s", addr);
                        printk("\n");
                }

this change only affects the x86 kernel.  do we need to produce a similiar patch
for other kernel architectures that Red Hat supports?

i have not investigated the effect of this output change on ksymoops.  we would
need to take that into account and choose a suitable format.
                                                                                
                                                                    


Version-Release number of selected component (if applicable):
kernel-smp-2.6.9-5.EL

How reproducible:
Always

Steps to Reproduce:
1. load a driver that uses a large but not excessive amount of stack space
2. run the driver and wait for an interrupt to occur while the stack is deep
3. wait for the customer to call customer support and try to explain to them
that they haven't had an actual stack overflow, just close.
    

Actual Results:  i see the output that i included at the beginning of the
description.

Expected Results:  i would have liked to see the output i included in the middle
of the description.

Additional info:

we'd like to see this changed, since it will help our debugging and anyone else
who stares at deep stack messages.  but it's not a hot issue for us.

Comment 1 Dave Jones 2005-03-16 19:22:06 UTC


*** This bug has been marked as a duplicate of 151226 ***

Note You need to log in before you can comment on or make changes to this bug.