110889 – SMP race fixes from rmap 15k are missing

Bug 110889 - SMP race fixes from rmap 15k are missing

Summary: SMP race fixes from rmap 15k are missing

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Linux
Classification:	Retired
Component:	kernel
Sub Component:
Version:	9
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Dave Jones
QA Contact:	Brian Brock
Docs Contact:
URL:	http://linuxvm.bkbits.net:8080/linux-...
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2003-11-25 10:04 UTC by Martin Wilck
Modified:	2015-01-04 22:04 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2004-01-05 03:44:16 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2003:394	0	normal	SHIPPED_LIVE	Updated 2.4 kernel fixes various bugs	2003-12-23 05:00:00 UTC

Description Martin Wilck 2003-11-25 10:04:12 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; de-AT; rv:1.4) Gecko/20030624

Description of problem:
The latest rmap patch (rmap-15k) contains at least two fixes for SMP
race conditions (BK changesets
http://linuxvm.bkbits.net:8080/linux-2.4-rmap/cset@1.930.150.29
and http://linuxvm.bkbits.net:8080/linux-2.4-rmap/cset@1.930.150.30)
that are not yet included in the latest kernel update.
164.70.13.2

We and our partners at Fujitsu have experienced several different
kernel panics lately that originate from corrupted VM data structures.
These problems seem to be fixed when the two rmap fixes mentioned
above are applied to the 2.4.20-20.9 kernel source.

When will RedHat publish an errata kernel for RH9 that contains the
above fixes?


Version-Release number of selected component (if applicable):
2.4.20-20.9

How reproducible:
Always

Steps to Reproduce:
1. Run RH9 with 2.4.20-20.9smp or 2.4.20-20.9enterprise on a system
with Intel Pentium IV, Hyperthreading enabled.
2. Run a IO-intensive stress test
 

Actual Results:  System freezes after several minutes to hours, panic
message indicates corrupt VM data structures.


Expected Results:  Test should run forever.

Additional info:

Comment 1 Martin Wilck 2003-11-25 10:10:17 UTC

Here is a sample panic:

CPU:    0
EIP:    0060:[<c01496b2>]    Tainted: P
EFLAGS: 00010202

EIP is at rmqueue [kernel] 0x312 (2.4.20-20.9smp)
eax: 01040088   ebx: 0000efd0   ecx: 00001000   edx: 000054c9
esi: c1000030   edi: c0343400   ebp: c1128c28   esp: c6233e80
ds: 0068   es: 0068   ss: 0068
Process Bonnie (pid: 2676, stackpage=c6233000)
Stack: 00001000 c6232000 00000000 000044c9 000044c8 00000203 00000000
c0343400
       c0343400 c0345924 00000001 00000001 c01497b7 c034592c 00000000
000001d2
       00000000 c01498f1 c0345920 00000000 00000001 00000001

The bug happens in the DEBUG_LRU_PAGE() macro in rmqueue when it is
found that the page flags (%eax) have the PG_inactive_dirty flag set.

Comment 2 Martin Wilck 2003-11-25 10:13:48 UTC

Here is another one, this time in
lru_cache_del()/del_page_from_inactive_clean_list() 
(invalid next pointer in list)

     ==>   next->prev=prev        
0xc0145656 <__lru_cache_del+742>:       mov    %edx,0x4(%eax)

*pde = 00000000
Oops: 0002
parport_pc lp parport autofs nfs lockd sunrpc e1000 keybdev mousedev
hid input usb-ohci usbcore ext3 jbd aic79xx sd_mod scsi_mod  
CPU:    1
EIP:    0060:[<c0145656>]    Not tainted
EFLAGS: 00210206

EIP is at __lru_cache_del [kernel] 0x2e6 (2.4.20-20.9smp)
eax: 00000000   ebx: c0344680   ecx: c1cc176c   edx: 00000000
esi: c1cc1750   edi: 000001fe   ebp: 00000000   esp: f6475e00
ds: 0068   es: 0068   ss: 0068

Process tdnum (pid: 5846, stackpage=f6475000)
Stack: c1cc1750 00000000 c0145724 c1cc1750 c014904f 00200296 f6474000
c1cc1750 
       000001d6 c013c4b8 140ac000 00000000 f6474000 00000000 00000000
c1cc1750 
       00000000 000001fe c0344680 c014704f c1cc1750 000001f4 c0345840
c01477cc 
Call Trace:   [<c0145724>] lru_cache_del [kernel] 0x44 (0xf6475e08))
[<c014904f>] __free_pages_ok [kernel] 0x3f (0xf6475e10))
[<c013c4b8>] wait_on_page_timeout [kernel] 0xc8 (0xf6475e24))
[<c014704f>] rebalance_laundry_zone [kernel] 0x11f (0xf6475e4c))
[<c01477cc>] rebalance_dirty_zone [kernel] 0x9c (0xf6475e5c))
[<c01478d5>] rebalance_inactive_zone [kernel] 0x85 (0xf6475e7c))
[<c0147988>] rebalance_inactive [kernel] 0x48 (0xf6475e9c))
[<c01479ef>] do_try_to_free_pages [kernel] 0x1f (0xf6475ec0))
[<c01480f1>] try_to_free_pages [kernel] 0x51 (0xf6475ed4))
[<c0149957>] __alloc_pages [kernel] 0x167 (0xf6475ee4))
[<c0156d2c>] generic_commit_write [kernel] 0x8c (0xf6475f00))
[<c013f1b4>] generic_file_write [kernel] 0x394 (0xf6475f24))
[<c0152e07>] sys_write [kernel] 0x97 (0xf6475f94))
[<c01098cf>] system_call [kernel] 0x33 (0xf6475fc0))

        ==> prev->next=next
0xc0145659 <__lru_cache_del+745>:       mov    %eax,(%edx)
        ==> entry->next=entry->prev=NULL ;

Comment 3 Martin Wilck 2003-12-02 10:20:50 UTC

Just looked at 2.4.20-24.9, it does NOT include the fixes I mention
above, as I had hoped. I am disappointed. This is a real bug that
crashes real systems!!!

Comment 4 Giuseppe Raimondi 2003-12-03 12:37:25 UTC

Customer would like to know a bit more about expected time of fixing
the bug.
thanks
Giuseppe

Comment 5 Dave Jones 2003-12-03 13:38:34 UTC

2.4.20-24.9 was released to fix the recent do_brk security bug, and no
non-security fixes went into that tree. A seperate 'bug fix' update is
going to be released very soon.

I'll look into these patches for that update.

Note You need to log in before you can comment on or make changes to this bug.