151226 – stack overflow message is alarmist and confusing

Bug 151226 - stack overflow message is alarmist and confusing

Summary: stack overflow message is alarmist and confusing

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 4
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	4.0
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Ingo Molnar
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	151295 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2005-03-16 04:35 UTC by craig harmer
Modified:	2012-06-20 16:03 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2012-06-20 16:03:34 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description craig harmer 2005-03-16 04:35:19 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8b)
Gecko/20050217

Description of problem:
oh boy!  i know how this bug report is going to be received ...

the message produced by the linux kernel when we may be in danger of a
stack overflow looks like:

        do_IRQ: stack overflow: 4968
         [<c0106e0f>] dump_stack+0x16/0x18
         [<c0108934>] do_IRQ+0x4f/0x1b5
         [<c02dbe9c>] common_interrupt+0x18/0x20
         [<fa90cef9>] xted_unlockmap_check+0x18a/0x192 [vxfs]
         [<fa77a5f2>] vx_unlockmap+0x23/0x38b [vxfs]
         [<fa77a4b7>] vx_holdmap+0x177/0x17f [vxfs]
         [<fa755853>] vx_extmaptran+0x8b/0x96 [vxfs]
         [<fa755797>] vx_extmapchange+0x24e/0x27f [vxfs]
         [<fa75182a>] vx_extfind+0x3d5/0x3e1 [vxfs]
         ...
                                                                     
          
there are two problems here.  the first is that we haven't actually
overflowed, we're only at risk of overflowing.  the second is that the
stack traceback omits useful information for investigating the problem.
                                                                     
          
i'd like to the message look like something like:
                                                                     
          
        do_IRQ: stack overflow risk: 4968 bytes left
         [<c0106e0f>] [<0xd9277950>] dump_stack+0x16/0x18
         [<c0108934>] [<0xd9277a04>] do_IRQ+0x4f/0x1b5
         ...
                                                                     
          
where the initial message makes clear that we're at risk of stack
overflow with 4968 bytes left, but have not actually had a stack 
overflow.

the stack "trace" includes the address in the stack where each
function call was found as an aid to estimating stack consumption of
each function. it's also useful when trying to decipher stack traces
and skip over stale symbols that appear in the stack, since if you
know that a particular function appears in the stack trace and the
approximate size of the stack frame of the function it's easier to
skip over stale symbols in the stack trace that lie within that area.

(suggestions for alternative formats are welcome).
                                                                     
          
the reason the message was produced with 4968 bytes left is that we've
"cranked up" both the stack size and the warning level in the kernels
we use internally at veritas.

the code change necessary to effect this change would be (as
pseudo-diffs since our code base is further modified):
                                                                     
          
arch/i386/kernel/irq.c in do_IRQ():
<                       printk("do_IRQ: stack overflow: %ld\n",
>                       printk("do_IRQ: stack overflow risk: %ld bytes
left\n",
                                esp - sizeof(struct thread_info));
                                                                     
          
./arch/i386/kernel/traps.c in print_context_stack():

#ifdef  CONFIG_FRAME_POINTER
        while (valid_stack_ptr(tinfo, (void *)ebp)) {
                addr = *(unsigned long *)(ebp + 4);
<               printk(" [<%08lx>] ", addr);
>               printk(" [<%08lx>] [<%08lx>] ", addr, ebp + 4);
                print_symbol("%s", addr);
                printk("\n");
                ebp = *(unsigned long *)ebp;
        }
#else
        while (valid_stack_ptr(tinfo, stack)) {
                addr = *stack++;
                if (__kernel_text_address(addr)) {
<                       printk(" [<%08lx>]", addr);
>                       printk(" [<%08lx>] [<%08lx>]", addr, stack - 4);
                        print_symbol(" %s", addr);
                        printk("\n");
                }

this change only affects the x86 kernel.  do we need to produce a
similiar patch for other kernel architectures that Red Hat supports?

i have not investigated the effect of this output change on ksymoops.
 we would need to take that into account and choose a suitable format.
                                                                     
          
                                                                    


Version-Release number of selected component (if applicable):
kernel-smp-2.6.9-5.EL

How reproducible:
Always

Steps to Reproduce:
1. load a driver that uses a large but not excessive amount of stack space
2. run the driver and wait for an interrupt to occur while the stack
is deep
3. wait for the customer to call customer support and try to explain
to them that they haven't had an actual stack overflow, just close.
    

Actual Results:  i see the output that i included at the beginning of
the description.

Expected Results:  i would have liked to see the output i included in
the middle of the description.

Additional info:

we'd like to see this changed, since it will help our debugging and
anyone else who stares at deep stack messages.  but it's not a hot
issue for us.

Comment 1 Dave Jones 2005-03-16 19:22:20 UTC

*** Bug 151295 has been marked as a duplicate of this bug. ***

Comment 2 Eric Sandeen 2008-09-25 01:36:08 UTC

The other problem with this is that at least on 4k stacks, the stack overflow warning message pretty much ensures that you will *actually* overflow (it consumes the remaining stack, and more).  But I think there's another bug that is addressing this at least.

I agree that the stack warning message could be improved; it should probably get fixed upstream first.

Comment 3 Jiri Pallich 2012-06-20 16:03:34 UTC

Thank you for submitting this issue for consideration in Red Hat Enterprise Linux. The release for which you requested us to review is now End of Life. 
Please See https://access.redhat.com/support/policy/updates/errata/

If you would like Red Hat to re-consider your feature request for an active release, please re-open the request via appropriate support channels and provide additional supporting details about the importance of this issue.

Note You need to log in before you can comment on or make changes to this bug.