From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; de-AT; rv:1.4) Gecko/20030624 Description of problem: The latest rmap patch (rmap-15k) contains at least two fixes for SMP race conditions (BK changesets http://linuxvm.bkbits.net:8080/linux-2.4-rmap/cset@1.930.150.29 and http://linuxvm.bkbits.net:8080/linux-2.4-rmap/cset@1.930.150.30) that are not yet included in the latest kernel update. 164.70.13.2 We and our partners at Fujitsu have experienced several different kernel panics lately that originate from corrupted VM data structures. These problems seem to be fixed when the two rmap fixes mentioned above are applied to the 2.4.20-20.9 kernel source. When will RedHat publish an errata kernel for RH9 that contains the above fixes? Version-Release number of selected component (if applicable): 2.4.20-20.9 How reproducible: Always Steps to Reproduce: 1. Run RH9 with 2.4.20-20.9smp or 2.4.20-20.9enterprise on a system with Intel Pentium IV, Hyperthreading enabled. 2. Run a IO-intensive stress test Actual Results: System freezes after several minutes to hours, panic message indicates corrupt VM data structures. Expected Results: Test should run forever. Additional info:
Here is a sample panic: CPU: 0 EIP: 0060:[<c01496b2>] Tainted: P EFLAGS: 00010202 EIP is at rmqueue [kernel] 0x312 (2.4.20-20.9smp) eax: 01040088 ebx: 0000efd0 ecx: 00001000 edx: 000054c9 esi: c1000030 edi: c0343400 ebp: c1128c28 esp: c6233e80 ds: 0068 es: 0068 ss: 0068 Process Bonnie (pid: 2676, stackpage=c6233000) Stack: 00001000 c6232000 00000000 000044c9 000044c8 00000203 00000000 c0343400 c0343400 c0345924 00000001 00000001 c01497b7 c034592c 00000000 000001d2 00000000 c01498f1 c0345920 00000000 00000001 00000001 The bug happens in the DEBUG_LRU_PAGE() macro in rmqueue when it is found that the page flags (%eax) have the PG_inactive_dirty flag set.
Here is another one, this time in lru_cache_del()/del_page_from_inactive_clean_list() (invalid next pointer in list) ==> next->prev=prev 0xc0145656 <__lru_cache_del+742>: mov %edx,0x4(%eax) *pde = 00000000 Oops: 0002 parport_pc lp parport autofs nfs lockd sunrpc e1000 keybdev mousedev hid input usb-ohci usbcore ext3 jbd aic79xx sd_mod scsi_mod CPU: 1 EIP: 0060:[<c0145656>] Not tainted EFLAGS: 00210206 EIP is at __lru_cache_del [kernel] 0x2e6 (2.4.20-20.9smp) eax: 00000000 ebx: c0344680 ecx: c1cc176c edx: 00000000 esi: c1cc1750 edi: 000001fe ebp: 00000000 esp: f6475e00 ds: 0068 es: 0068 ss: 0068 Process tdnum (pid: 5846, stackpage=f6475000) Stack: c1cc1750 00000000 c0145724 c1cc1750 c014904f 00200296 f6474000 c1cc1750 000001d6 c013c4b8 140ac000 00000000 f6474000 00000000 00000000 c1cc1750 00000000 000001fe c0344680 c014704f c1cc1750 000001f4 c0345840 c01477cc Call Trace: [<c0145724>] lru_cache_del [kernel] 0x44 (0xf6475e08)) [<c014904f>] __free_pages_ok [kernel] 0x3f (0xf6475e10)) [<c013c4b8>] wait_on_page_timeout [kernel] 0xc8 (0xf6475e24)) [<c014704f>] rebalance_laundry_zone [kernel] 0x11f (0xf6475e4c)) [<c01477cc>] rebalance_dirty_zone [kernel] 0x9c (0xf6475e5c)) [<c01478d5>] rebalance_inactive_zone [kernel] 0x85 (0xf6475e7c)) [<c0147988>] rebalance_inactive [kernel] 0x48 (0xf6475e9c)) [<c01479ef>] do_try_to_free_pages [kernel] 0x1f (0xf6475ec0)) [<c01480f1>] try_to_free_pages [kernel] 0x51 (0xf6475ed4)) [<c0149957>] __alloc_pages [kernel] 0x167 (0xf6475ee4)) [<c0156d2c>] generic_commit_write [kernel] 0x8c (0xf6475f00)) [<c013f1b4>] generic_file_write [kernel] 0x394 (0xf6475f24)) [<c0152e07>] sys_write [kernel] 0x97 (0xf6475f94)) [<c01098cf>] system_call [kernel] 0x33 (0xf6475fc0)) ==> prev->next=next 0xc0145659 <__lru_cache_del+745>: mov %eax,(%edx) ==> entry->next=entry->prev=NULL ;
Just looked at 2.4.20-24.9, it does NOT include the fixes I mention above, as I had hoped. I am disappointed. This is a real bug that crashes real systems!!!
Customer would like to know a bit more about expected time of fixing the bug. thanks Giuseppe
2.4.20-24.9 was released to fix the recent do_brk security bug, and no non-security fixes went into that tree. A seperate 'bug fix' update is going to be released very soon. I'll look into these patches for that update.