Bug 162548

Summary: interrupt handlers run on thread's kernel stack
Product: Red Hat Enterprise Linux 4 Reporter: craig harmer <craig>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED ERRATA QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: 4.0CC: linux26port
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: RHSA-2005-514 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2005-10-05 13:39:35 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 156322    

Description craig harmer 2005-07-06 06:23:48 UTC
Description of problem:

In several discussions, Red Hat engineers told us (Veritas) that Red Hat EL 4.0
would be based on the 2.6 kernel and would move to a 4 Kbyte stack size, but
would process hardware interrupts on a seperate stack.
                                                                                
but it turns out that it's not true!  Interrupts are still being processed on
the thread's kernel stack.
                                                                                
this is a huge problem for Veritas.
                                                                                
here's an example of an interrupt handler running on the thread stack caught by
our deep stack tracking kernel:
                                                                                
Comm: find (0xea4)
[kernel]     sys_getdents64                 (+0x64  =  0x00064)
[kernel]     vfs_readdir                    (+0x24  =  0x00088)
[nfs]        nfs_readdir                    (+0x1a4 =  0x0022c)
[nfs]        readdir_search_pagecache       (+0x18  =  0x00244)
[nfs]        find_dirent_page               (+0x14  =  0x00258)
[kernel]     read_cache_page                (+0x24  =  0x0027c)
[kernel]     __read_cache_page              (+0x24  =  0x002a0)
[nfs]        nfs_readdir_filler             (+0x28  =  0x002c8)
[nfs]        nfs3_proc_readdir              (+0x104 =  0x003cc)
[nfs]        nfs3_rpc_wrapper               (+0x28  =  0x003f4)
[sunrpc]     rpc_call_sync                  (+0x28  =  0x0041c)
[sunrpc]     rpc_execute                    (+0x14  =  0x00430)
[sunrpc]     __rpc_execute                  (+0x64  =  0x00494)
[sunrpc]     call_transmit                  (+0x10  =  0x004a4)
[sunrpc]     xprt_transmit                  (+0x24  =  0x004c8)
[sunrpc]     xprt_sendmsg                   (+0x28  =  0x004f0)
[sunrpc]     xdr_sendpages                  (+0x94  =  0x00584)
[kernel]     kernel_sendmsg                 (+0x24  =  0x005a8)
[kernel]     sock_sendmsg                   (+0xec  =  0x00694)
[kernel]     __sock_sendmsg                 (+0x24  =  0x006b8)
[kernel]     inet_sendmsg                   (+0x20  =  0x006d8)
[kernel]     tcp_sendmsg                    (+0x58  =  0x00730)
[kernel]     tcp_push                       (+0x28  =  0x00758)
[kernel]     __tcp_push_pending_frames      (+0x30  =  0x00788)
[kernel]     tcp_write_xmit                 (+0x24  =  0x007ac)
[kernel]     tcp_transmit_skb               (+0x2c  =  0x007d8)
[kernel]     ip_queue_xmit                  (+0xc4  =  0x0089c)
[kernel]     dst_output                     (+0x10  =  0x008ac)
[kernel]     ip_output                      (+0x14  =  0x008c0)
[kernel]     ip_finish_output               (+0x14  =  0x008d4)
[kernel]     nf_hook_slow                   (+0x38  =  0x0090c)
[kernel]     nf_iterate                     (+0x34  =  0x00940)
[kernel]     selinux_ipv4_postroute_last    (+0x20  =  0x00960)
[kernel]     selinux_ip_postroute_last      (+0x94  =  0x009f4)
[kernel]     avc_has_perm                   (+0x48  =  0x00a3c)
[kernel]     avc_has_perm_noaudit           (+0x5c  =  0x00a98)
[kernel]     avc_lookup                     (+0x24  =  0x00abc)
[kernel]     avc_search_node                (+0x28  =  0x00ae4)
[kernel]     avc_hash                       (+0x1c  =  0x00b00)
                                                                                
====> CDROM interrupt occurs here with ~1,200 bytes remaining <===
[kernel]     do_IRQ                         (+0x74  =  0x00b74)
[kernel]     handle_IRQ_event               (+0x20  =  0x00b94)
[kernel]     ide_intr                       (+0x28  =  0x00bbc)
[kernel]     cdrom_read_intr                (+0x20  =  0x00bdc)
[kernel]     ide_end_request                (+0x24  =  0x00c00)
[kernel]     __ide_end_request              (+0x28  =  0x00c28)
[kernel]     end_that_request_first         (+0x18  =  0x00c40)
[kernel]     __end_that_request_first       (+0x2c  =  0x00c6c)
[kernel]     bio_endio                      (+0x20  =  0x00c8c)
[kernel]     bounce_end_io_read             (+0x1c  =  0x00ca8)
[kernel]     __bounce_end_io_read           (+0x18  =  0x00cc0)
[kernel]     bounce_end_io                  (+0x24  = *0x00ce4)
[kernel]     bio_endio                      (+0x20  = *0x00d04)
[kernel]     end_bio_bh_io_sync             (+0x20  = *0x00d24)
[kernel]     end_buffer_async_read          (+0x24  = *0x00d48)
[kernel]     unlock_page                    (+0xc   = *0x00d54)
[kernel]     wake_up_page                   (+0x14  = *0x00d68)
[kernel]     __wake_up                      (+0x1c  = *0x00d84)
[kernel]     __wake_up_common               (+0x28  = *0x00dac)
[kernel]     page_wake_function             (+0x1c  = *0x00dc8)
[kernel]     autoremove_wake_function       (+0x20  = *0x00de8)
[kernel]     default_wake_function          (+0x1c  = *0x00e04)
[kernel]     try_to_wake_up                 (+0x48  = *0x00e4c)
[kernel]     wake_idle                      (+0x20  = *0x00e6c)
[kernel]     find_next_bit                  (+0x38  = *0x00ea4)
                                                                                
(CDROM interrupt consumes 0xea4 - 0xb00 + 0x74 = 1,048 bytes of stack.)
                                                                                
Note that frame sizes and stack depths shown here are roughly 10% larger than on
a production Redhat kernel, since our deepstack tracking kernel is compiled with
frame pointers and with "-mregparm=0" to make out debugging easier)
                                                                                
Also note that interrupts on Linux can nest.  the ~1,000 bytes consumed by the
CDROM interrupt could easily have had another ~500 bytes added to it by the
ethernet driver and another ~500 bytes added to it by the QLogic FC driver.  So,
under the right confluence of events this could have been a stack overflow
involving only kernel code shipped by Redhat.
                                                                                
you're probably wondering why we're only reporting this problem now ...
                                                                                
the problem is that veritas does most of it's testing using custom kernels built
with an kdb, frame-pointers, "-mregparm=0", and an 8 Kbyte stack. because we
have larger stack frames due to passing arguments on the stack, extra debugging
code, and kdb we need additional stack space (it really sucks when dropping into
kdb causes a stack overrun; in addition, we used to have problems with deep
stacks in our production code, although we believe they've all been resolved).
                                                                                
when we built our custome kernels, we used "#define CONFIG_4KSTACKS" because it
enables the interrupt stack switching code and because we *assumed* that's what
Red Hat was doing to get 4 Kbyte kernel stacks.
                                                                                
that was a mistake.  it turns out Red Hat builds their kernels with a custom
patch that enables 4 Kbyte stacks but disables interrupt stack switching.  that
patch is:
                                                                                
        linux-2.6.5-x86-nostack.patch
                                                                                
it strips out every "#ifdef CONFIG_4KSTACKS" in the kernel *except* for the
#ifdef around the interrupt stack switching code in do_IRQ() (in
arch/i386/kernel/irq.c), which explains why we're in this situation.
 
i'd really like to know why that patch was added.
                                                                                
Veritas has done some limited testing on Red Hat production kernels (most
recently the rhel4 Update 1 RC 1 drop) and hasn't seen any actual stack
overflows, or even any stack overflow warning messages.  but our stack depth
tracking kernels were being built using CONFIG_4KSTACKS so we weren't exploring
this issue with most of our testing.
                                                                                
at this point it's difficult to know what the actual risk is, but currently we
don't think we can release our products for the i386 (or i686) with this
ncreased risk of stack overflow (since we do know overflow *might* occur if the
conditions were right).
                                                                                
so we're urgently looking for Red Hat to make kernels available that actually
perform hardware interrupt handling on a different stack.


Version-Release number of selected component (if applicable):
kernel-2.6.9-11.EL

How reproducible: every time


Steps to Reproduce:
1. build a kernel with stack depth tracking
2. run an i/o intensive test like SpecSFS
3. 
  
Actual results:
interrupts are handled on thread's kernel stack, not interrupt stack

Expected results:
interrupts handled on dedicated interrupt stack


Additional info:

Comment 1 Jason Baron 2005-07-06 16:06:01 UTC
Hi Craig, Thanks for the bug report. This was simply an oversight, and we have
corrected this issue by re-enabling 4k irq stacks during for U2. The bug noting
the issue is 162257. Thus, i'm closing this one as a duplicate of that. thanks.

-Jason

*** This bug has been marked as a duplicate of 162257 ***

Comment 2 Red Hat Bugzilla 2005-10-05 13:39:35 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2005-514.html