102678 – LTC3926-[XAPIC] /etc/init.d/irqbalance causes panics, ext3 corruption

Bug 102678 - LTC3926-[XAPIC] /etc/init.d/irqbalance causes panics, ext3 corruption

Summary: LTC3926-[XAPIC] /etc/init.d/irqbalance causes panics, ext3 corruption

Keywords:
Status:	CLOSED RAWHIDE
Alias:	None
Product:	Red Hat Enterprise Linux 3
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	3.0
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Assignee:	Ingo Molnar
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	97942
TreeView+	depends on / blocked

Reported:	2003-08-19 19:53 UTC by IBM Bug Proxy
Modified:	2007-11-30 22:06 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2003-09-02 15:04:15 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
odd error message followed by panic (16.99 KB, text/plain) 2003-08-19 19:58 UTC, IBM Bug Proxy	no flags	Details
RFC'ed 2.4.22-rc2 patch sent to lkml (5.07 KB, patch) 2003-08-23 02:03 UTC, john stultz	no flags	Details \| Diff
get_irq_list() overflow fix (3.89 KB, patch) 2003-08-23 07:50 UTC, Ingo Molnar	no flags	Details \| Diff
View All

Description IBM Bug Proxy 2003-08-19 19:53:57 UTC

The following has be reported by IBM LTC:  
[BETA] /etc/init.d/irqbalance causes panics, ext3 corruption
Please fill in each of the sections below. 
 
Hardware Environment: 
x440 16way 
 
Software Environment: 
distro: RHAS 3.0b1 
kernel: 2.4.21-1.1931.2.389.entbigmem 
 
Steps to Reproduce: 
1. Disable the irqbalance service 
2. Boot the specified kernel 
3. Note system seems stable. Kernel compiles, bk pulls, etc work fine 
4. Start irqbalance service 
5. Start kernel build 
 
Actual Results: 
Ususally odd error messages displayed, shortly then followed by a panic. 
 
Expected Results: 
System behaves normally, no panics 
 
Additional Information: 
 
This one resulted in file corruption: 
 
elm3a16 login: EXT3-fs error (device sd(8,3)): ext3_readdir: bad entry in direct 
ory #2706163: directory entry across blocks - offset=0, inode=942879541, rec_len 
=8224, name_len=32                                                               
EXT3-fs error (device sd(8,3)): ext3_readdir: bad entry in directory #2706168: d 
irectory entry across blocks - offset=0, inode=538976288, rec_len=8224, name_len 
=32               
 
0215c5c6                                                                         
*pde = 00000000                                                                  
Oops: 000b                                                                       
iptable_filter ip_tables parport_pc lp parport ide-cd cdrom autofs tg3 floppy ke 
ybdev mousedev hid input usb-uhci usbcore ext3 jbd qla2200 qla2300 aic7xxx sd_   
CPU:    17                                                                       
EIP:    0060:[<0215c5c6>]    Not tainted                                         
EFLAGS: 00010206                                                                 
                                                                                 
EIP is at page_referenced [kernel] 0x316 (2.4.21-1.1931.2.389.entbigmem)         
eax: 00000005   ebx: ffd74020   ecx: 02008ba0   edx: 20302163                    
esi: 7b78b4a4   edi: 0300002c   ebp: 00000020   esp: e0bbff78                    
ds: 0068   es: 0068   ss: 0068                                                   
Process kscand (pid: 68, stackpage=e0bbf000)                                     
Stack: df62b680 0000000e 00000000 df786200 00000000 00000000 e0bbffb4 
06c956cc   
       06c956cc 06c95544 0239f358 0239e280 02152852 06c64f40 00000003 
00000001   
       0239e280 0239f358 00000003 e0bbe000 02153d60 0239e280 00000003 
0239f358   
Call Trace:   [<02152852>] scan_active_list [kernel] 0xa2 (0xe0bbffa8)           
[<02153d60>] kscand [kernel] 0xa0 (0xe0bbffc8)                                   
[<02153cc0>] kscand [kernel] 0x0 (0xe0bbffe0)                                    
[<0210956d>] kernel_thread_helper [kernel] 0x5 (0xe0bbfff0)                    
                                                            
Code: f0 0f b3 03 19 c0 8b 54 24 14 42 85 c0 0f 44 54 24 14 81 fbalternate panic: 
 
 
Oops: 0000                                                                       
iptable_filter ip_tables parport_pc lp parport ide-cd cdrom autofs tg3 floppy ke 
ybdev mousedev hid input usb-uhci usbcore ext3 jbd qla2200 qla2300 aic7xxx sd_   
CPU:    23                                                                       
EIP:    0060:[<0217a341>]    Not tainted                                         
EFLAGS: 00010203                                                                 
                                                                                 
EIP is at d_lookup [kernel] 0x71 (2.4.21-1.1931.2.389.entbigmem)                 
eax: e18203f0   ebx: 00000000   ecx: 00000013   edx: e1800000                    
esi: fffffff0   edi: 00000000   ebp: 0141dc5c   esp: d9327ee0                    
ds: 0068   es: 0068   ss: 0068                                                   
Process make (pid: 15821, stackpage=d9327000)                                    
Stack: fee8709c 00000009 0215e618 d9326000 fffffff0 e18203f0 dd723013 
00000004   
       dd723013 dd723017 00000000 d9327f40 0216ebfb db31e280 d9327f40 
dd723013   
       0216f334 db31e280 d9327f40 00000000 00000009 00000000 d9327f98 
00000000   
Call Trace:   [<0215e618>] str_vm [kernel] 0x1f8 (0xd9327ee8)                    
[<0216ebfb>] cached_lookup [kernel] 0x1b (0xd9327f10)                            
[<0216f334>] link_path_walk [kernel] 0x424 (0xd9327f20)                          
[<0216f8b9>] path_lookup [kernel] 0x39 (0xd9327f60)                              
[<0216fc09>] __user_walk [kernel] 0x49 (0xd9327f70)                              
[<0216ad8f>] sys_stat64 [kernel] 0x1f (0xd9327f8c)                               
                                                                                 
                                                                                 
Code:  Bad EIP value.                    Created an attachment (id=1377)
odd error message, followed by panic

Yet another panic. The error messages the preceeded it were quite lengthy so
made this an attachmentAlso reproduced using kernel 2.4.21-1.1931.2.389.entsmp 
 
Unable to handle kernel paging request at virtual address 20202054 
 printing eip:                                                     
c0179661       
*pde = 32a03001 
*pte = 00000000 
Oops: 0000      
iptable_filter ip_tables parport_pc lp parport ide-cd cdrom autofs tg3 floppy 
microcode keybdev mousedev hid input usb-uhci usbcore ext3 jbd qla2200 qla2300 
a 
CPU:    5                                                                      
EIP:    0060:[<c0179661>]    Not tainted 
EFLAGS: 00010203                         
                 
EIP is at d_lookup [kernel] 0x71 (2.4.21-1.1931.2.389.entsmp) 
eax: f7919c00   ebx: 20202020   ecx: 00000013   edx: f7800000 
esi: 20202010   edi: 00000000   ebp: 4eba9cb7   esp: f2129ee0 
ds: 0068   es: 0068   ss: 0068                                
Process find (pid: 9825, stackpage=f2129000) 
Stack: f2129f90 c0178ab0 f3ccfd80 c044c780 20202010 f7919c00 f5366000 
0000000a  
       f5366000 f536600a 00000000 f2129f40 c016dd5b f3ce6e80 f2129f40 
f5366000  
       c016e495 f3ce6e80 f2129f40 00000000 00000008 00000000 f2129f98 
00000000  
Call Trace:   [<c0178ab0>] dput [kernel] 0x30 (0xf2129ee4)                      
[<c016dd5b>] cached_lookup [kernel] 0x1b (0xf2129f10)      
[<c016e495>] link_path_walk [kernel] 0x425 (0xf2129f20) 
[<c016ea19>] path_lookup [kernel] 0x39 (0xf2129f60)     
[<c016ed69>] __user_walk [kernel] 0x49 (0xf2129f70) 
[<c0169f2f>] sys_lstat64 [kernel] 0x1f (0xf2129f8c) 
                                                    
 
Code: 39 6e 44 8b 1b 75 e8 8b 7c 24 34 39 7e 0c 75 df 8b 57 4c 85  
 Does not seem to be reproduceable on similar 4way or 8way x440 configurations.
Greg/Glen - this is a RHEL3 beta1 bug, so please submit this to Red Hat.

John - since this does not happen on 4-way or 8-way x440 configs and only
       happens on a 16-way, are you (or other members of your team)
       actively investigating this ?

Comment 1 IBM Bug Proxy 2003-08-19 19:58:36 UTC

Created attachment 93759 [details]
odd error message followed by panic

Comment 2 Ingo Molnar 2003-08-20 11:59:23 UTC

Can you trigger the memory corruption if you keep irqbalance disabled, but set
the IRQ affinities explicitly, via /proc/irq/<irqnr>/smp_affinity? [this is what
irqbalance does as well]

by default the irqs will be routed to CPU0 (check this in /proc/interrupts),
does the error happen if you try the affinity masks 0x0001 ... 0x8000 to route
the irq(s) to CPU0 ... CPU15? Also, please check that the affinities actually
work as expected - ie. in /proc/interrupts you should see the given irq source
being redirected to the CPU you specify in the affinity mask.

Comment 3 IBM Bug Proxy 2003-08-20 19:03:15 UTC

Please leave LTC3926 in header to aid in tracking on IBM side. Thanks.

Comment 4 Arjan van de Ven 2003-08-21 18:24:46 UTC

question: can you find out which of qla2200/2300 is used on the corrupted
partition ?

Comment 6 john stultz 2003-08-22 23:29:37 UTC

Further investigation seems to point to /proc/interrupts causing memory corruption 
on systems w/ 32 cpus.  Similar panics and odd machine behavoir can be seen by 
just repeatedly catting /proc/interrupts.  it looks like get_irq_list() takes a pointer to a 
page but does not do any bounds checking, thus causing overflows on high cpu 
count boxes.  prematurely returning from get_irq_list() seems to resolve the issue. 
 
I'm guessing irqblanace reads /proc/interrupts to decide where to route interrupts, 
thus that is why it seems to trigger the problem as well.

Comment 7 Rik van Riel 2003-08-23 00:14:14 UTC

oh dear, /proc/interrupts again

We actually ran into this bug before, for PPC64 pSeries, and fixed it just for
that platform because we weren't quite sure of the bugfix and wanted to test it
there first before enabling it on the other architectures.

We'll enabled the seq_file /proc/interrupts ASAP.

Comment 8 john stultz 2003-08-23 02:03:14 UTC

Created attachment 93873 [details]
RFC'ed 2.4.22-rc2 patch sent to lkml

posted to lkml:
"	Recently I've been seeing memory corruption issues related to
/proc/interrupts on a 16way (32x w/ HT) x440. Basically get_irq_list() does not
do any bounds checking on the page it is given, and can easily overrun the
buffer when the cpu and interrupt count is high enough. 

This patch backports the 2.5 seq_file implementation of /proc/interrupts for
CONFIG_X86 (hopefully leaving other arches alone). "

Comment 9 Ingo Molnar 2003-08-23 07:50:56 UTC

Created attachment 93878 [details]
get_irq_list() overflow fix

Could you please give the attached patch a go? The taroon sources differ
significantly from upstream 2.4 in this area, so the patch is quite different
as well - but taroon has the same problem.

Comment 10 Arjan van de Ven 2003-08-26 12:32:14 UTC

fix merged; pending confirmation

Comment 11 john stultz 2003-08-26 18:55:20 UTC

Tested w/ 2.4.21-1.1931.2.411.entsmp and all looks well

Comment 12 Jay Turner 2003-09-02 15:04:15 UTC

Closing out on confirmation from reporter.  The cables to chain together the
x445 machines we have in the lab haven't arrived from IBM, so depending on them
for the confirmation this issue is resolved.

Comment 13 IBM Bug Proxy 2004-02-12 19:59:15 UTC

----- Additional Comments From jstultz.com(prefers email via johnstul.com)  2004-02-12 15:00 -------
This is stale. The issue was fixed. (see comment #16)

Note You need to log in before you can comment on or make changes to this bug.