Bug 110889 - SMP race fixes from rmap 15k are missing
Summary: SMP race fixes from rmap 15k are missing
Alias: None
Product: Red Hat Linux
Classification: Retired
Component: kernel   
(Show other bugs)
Version: 9
Hardware: i686 Linux
Target Milestone: ---
Assignee: Dave Jones
QA Contact: Brian Brock
URL: http://linuxvm.bkbits.net:8080/linux-...
Depends On:
TreeView+ depends on / blocked
Reported: 2003-11-25 10:04 UTC by Martin Wilck
Modified: 2015-01-04 22:04 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2004-01-05 03:44:16 UTC
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2003:394 normal SHIPPED_LIVE Updated 2.4 kernel fixes various bugs 2003-12-23 05:00:00 UTC

Description Martin Wilck 2003-11-25 10:04:12 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; de-AT; rv:1.4) Gecko/20030624

Description of problem:
The latest rmap patch (rmap-15k) contains at least two fixes for SMP
race conditions (BK changesets
and http://linuxvm.bkbits.net:8080/linux-2.4-rmap/cset@1.930.150.30)
that are not yet included in the latest kernel update.

We and our partners at Fujitsu have experienced several different
kernel panics lately that originate from corrupted VM data structures.
These problems seem to be fixed when the two rmap fixes mentioned
above are applied to the 2.4.20-20.9 kernel source.

When will RedHat publish an errata kernel for RH9 that contains the
above fixes?

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1. Run RH9 with 2.4.20-20.9smp or 2.4.20-20.9enterprise on a system
with Intel Pentium IV, Hyperthreading enabled.
2. Run a IO-intensive stress test

Actual Results:  System freezes after several minutes to hours, panic
message indicates corrupt VM data structures.

Expected Results:  Test should run forever.

Additional info:

Comment 1 Martin Wilck 2003-11-25 10:10:17 UTC
Here is a sample panic:

CPU:    0
EIP:    0060:[<c01496b2>]    Tainted: P
EFLAGS: 00010202

EIP is at rmqueue [kernel] 0x312 (2.4.20-20.9smp)
eax: 01040088   ebx: 0000efd0   ecx: 00001000   edx: 000054c9
esi: c1000030   edi: c0343400   ebp: c1128c28   esp: c6233e80
ds: 0068   es: 0068   ss: 0068
Process Bonnie (pid: 2676, stackpage=c6233000)
Stack: 00001000 c6232000 00000000 000044c9 000044c8 00000203 00000000
       c0343400 c0345924 00000001 00000001 c01497b7 c034592c 00000000
       00000000 c01498f1 c0345920 00000000 00000001 00000001

The bug happens in the DEBUG_LRU_PAGE() macro in rmqueue when it is
found that the page flags (%eax) have the PG_inactive_dirty flag set.

Comment 2 Martin Wilck 2003-11-25 10:13:48 UTC
Here is another one, this time in
(invalid next pointer in list)

     ==>   next->prev=prev        
0xc0145656 <__lru_cache_del+742>:       mov    %edx,0x4(%eax)

*pde = 00000000
Oops: 0002
parport_pc lp parport autofs nfs lockd sunrpc e1000 keybdev mousedev
hid input usb-ohci usbcore ext3 jbd aic79xx sd_mod scsi_mod  
CPU:    1
EIP:    0060:[<c0145656>]    Not tainted
EFLAGS: 00210206

EIP is at __lru_cache_del [kernel] 0x2e6 (2.4.20-20.9smp)
eax: 00000000   ebx: c0344680   ecx: c1cc176c   edx: 00000000
esi: c1cc1750   edi: 000001fe   ebp: 00000000   esp: f6475e00
ds: 0068   es: 0068   ss: 0068

Process tdnum (pid: 5846, stackpage=f6475000)
Stack: c1cc1750 00000000 c0145724 c1cc1750 c014904f 00200296 f6474000
       000001d6 c013c4b8 140ac000 00000000 f6474000 00000000 00000000
       00000000 000001fe c0344680 c014704f c1cc1750 000001f4 c0345840
Call Trace:   [<c0145724>] lru_cache_del [kernel] 0x44 (0xf6475e08))
[<c014904f>] __free_pages_ok [kernel] 0x3f (0xf6475e10))
[<c013c4b8>] wait_on_page_timeout [kernel] 0xc8 (0xf6475e24))
[<c014704f>] rebalance_laundry_zone [kernel] 0x11f (0xf6475e4c))
[<c01477cc>] rebalance_dirty_zone [kernel] 0x9c (0xf6475e5c))
[<c01478d5>] rebalance_inactive_zone [kernel] 0x85 (0xf6475e7c))
[<c0147988>] rebalance_inactive [kernel] 0x48 (0xf6475e9c))
[<c01479ef>] do_try_to_free_pages [kernel] 0x1f (0xf6475ec0))
[<c01480f1>] try_to_free_pages [kernel] 0x51 (0xf6475ed4))
[<c0149957>] __alloc_pages [kernel] 0x167 (0xf6475ee4))
[<c0156d2c>] generic_commit_write [kernel] 0x8c (0xf6475f00))
[<c013f1b4>] generic_file_write [kernel] 0x394 (0xf6475f24))
[<c0152e07>] sys_write [kernel] 0x97 (0xf6475f94))
[<c01098cf>] system_call [kernel] 0x33 (0xf6475fc0))

        ==> prev->next=next
0xc0145659 <__lru_cache_del+745>:       mov    %eax,(%edx)
        ==> entry->next=entry->prev=NULL ; 

Comment 3 Martin Wilck 2003-12-02 10:20:50 UTC
Just looked at 2.4.20-24.9, it does NOT include the fixes I mention
above, as I had hoped. I am disappointed. This is a real bug that
crashes real systems!!!

Comment 4 Giuseppe Raimondi 2003-12-03 12:37:25 UTC
Customer would like to know a bit more about expected time of fixing
the bug.

Comment 5 Dave Jones 2003-12-03 13:38:34 UTC
2.4.20-24.9 was released to fix the recent do_brk security bug, and no
non-security fixes went into that tree. A seperate 'bug fix' update is
going to be released very soon.

I'll look into these patches for that update.

Note You need to log in before you can comment on or make changes to this bug.