From Bugzilla Helper: User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.4) Gecko/20030624 Description of problem: Under stress test (using the stress tool with 8 cpu threads, 4 io threads, 3 vm threads, and 3 hdd threads), a pagefault occurs in the kernel after approximately 14 hours of load. In addition, there is a problem in the page fault routines as well. This is the second crash we have experienced while testing with this tool (the first appeared to be in the SCSI modules, but not enough instrumentation was being used to capture the fault). Here is the stack trace: invalid kernel-mode pagefault 2! [addr:00000004, eip:02154c51] Pid/TGid: 560/560, comm: stress EIP: 0060:[<02154c51>] CPU: 0 EIP is at __free_pages_ok [kernel] 0x2b1 (2.4.21-1.1931.2.349.2.2.entsmp) ESP: 0002:02399f80 EFLAGS: 00010046 Not tainted EAX: 00000000 EBX: 03778dcc ECX: 03778e08 EDX: 00000000 ESI: 02399f80 EDI: 0001ee19 EBP: 0300002c DS: 0068 ES: 0068 FS: 0000 GS: 0033 CR0: 8005003b CR2: 00000004 CR3: 00101000 CR4: 000006f0 Call Trace: [<02145339>] wait_on_page_timeout [kernel] 0xc9 (0x20a73dac) [<021524ef>] rebalance_laundry_zone [kernel] 0x12f (0x20a73de8) [<02153054>] do_try_to_free_pages [kernel] 0x134 (0x20a73e1c) [<02153691>] try_to_free_pages [kernel] 0x51 (0x20a73e38) [<021553a7>] __alloc_pages [kernel] 0x167 (0x20a73e48) [<02140070>] do_anonymous_page [kernel] 0xf0 (0x20a73e88) [<02140b63>] handle_mm_fault [kernel] 0xf3 (0x20a73ec0) [<0211f5cc>] do_page_fault [kernel] 0x1bc (0x20a73ef4) [<0212518e>] context_switch [kernel] 0x9e (0x20a73f40) [<02123377>] schedule [kernel] 0x2f7 (0x20a73f5c) [<0211f410>] do_page_fault [kernel] 0x0 (0x20a73fa0) [<0211f410>] do_page_fault [kernel] 0x0 (0x20a73fb0) invalid operand: 0000 iptable_filter ip_tables ide-cd cdrom autofs eepro100 mii microcode keybdev mousedev hid input usbcore ext3 jbd ips aic7xxx sd_mod scsi_mod CPU: 0 EIP: 0060:[<0211f488>] Not tainted EFLAGS: 00010006 EIP is at do_page_fault [kernel] 0x78 (2.4.21-1.1931.2.349.2.2.entsmp) eax: 00000001 ebx: 00000004 ecx: 00000001 edx: 02375e14 esi: 00000002 edi: 0211f410 ebp: 00000002 esp: 20a73ca4 ds: 0068 es: 0068 ss: 0068 Process stress (pid: 560, stackpage=20a73000) Stack: 20a74000 00000002 00000004 02154c51 00000080 00000001 02109dcb 00000004 00000008 00000400 0372f320 0372f2e4 0241f000 20a9e580 fffd7d8d 0242d180 00000000 00000000 20a9e000 20a9e000 20a73d70 0242d200 21c70068 02420068 Call Trace: [<02154c51>] __free_pages_ok [kernel] 0x2b1 (0x20a73cb0) [<02109dcb>] __switch_to [kernel] 0x2fb (0x20a73cbc) [<021233a7>] schedule [kernel] 0x327 (0x20a73d08) [<021cbd8c>] submit_bh_rsector [kernel] 0x4c (0x20a73d20) [<02121b80>] wake_up_cpu [kernel] 0x20 (0x20a73d48) [<0211f410>] do_page_fault [kernel] 0x0 (0x20a73d5c) [<02154c51>] __free_pages_ok [kernel] 0x2b1 (0x20a73d98) [<02145339>] wait_on_page_timeout [kernel] 0xc9 (0x20a73dac) [<021524ef>] rebalance_laundry_zone [kernel] 0x12f (0x20a73de8) [<02153054>] do_try_to_free_pages [kernel] 0x134 (0x20a73e1c) [<02153691>] try_to_free_pages [kernel] 0x51 (0x20a73e38) [<021553a7>] __alloc_pages [kernel] 0x167 (0x20a73e48) [<02140070>] do_anonymous_page [kernel] 0xf0 (0x20a73e88) [<02140b63>] handle_mm_fault [kernel] 0xf3 (0x20a73ec0) [<0211f5cc>] do_page_fault [kernel] 0x1bc (0x20a73ef4) [<0212518e>] context_switch [kernel] 0x9e (0x20a73f40) [<02123377>] schedule [kernel] 0x2f7 (0x20a73f5c) [<0211f410>] do_page_fault [kernel] 0x0 (0x20a73fa0) [<0211f410>] do_page_fault [kernel] 0x0 (0x20a73fb0) Code: Bad EIP value. Version-Release number of selected component (if applicable): kernel-smp-2.4.21-1.1931.2.349.2.2.ent How reproducible: Didn't try Steps to Reproduce: 1. Get the stress tool from http://weather.ou.edu/~apw/projects/stress/ 2. Run stress with command line options "-c 8 -i 4 -m 3 -d 3" 3. Wait for crash (approx. 14-15 hours) Actual Results: After 14 hours have elapsed, the kernel will pagefault with the above trace. Expected Results: Stress tool should have continued until interrupted. Additional info: Hardware: IBM x330, dual P3 1.2GHz, 512 MB, ServeRAID 4MX RAID card with dual 36 GB drives attached (RAID 1) Will update further. Currently attempting to reproduce.
This bug is reproduceable. The second attempt produced the same pagefault and then invalid operand in the pagefault routines after less than 5 hours.
can you reproduce this with the latest kernel available via RHN ?
Yes, but it now appears as a NULL pointer dereference, rather than a pagefault. This is reproduceable as well (first time happened after 13 hours, second after less than 5). Here's the dump: Unable to handle kernel NULL pointer dereference at virtual address 00000004 printing eip: c0154ac1 *pde = 11db7001 *pte = 1fc2e067 Oops: 0002 iptable_filter ip_tables ide-cd cdrom autofs e100 microcode keybdev mousedev hid input usbcore ext3 jbd ips aic7xxx sd_mod scsi_mod CPU: 0 EIP: 0060:[<c0154ac1>] Not tainted EFLAGS: 00010046 EIP is at __free_pages_ok [kernel] 0x2b1 (2.4.21-1.1931.2.393.entsmp) eax: 00000000 ebx: c17508cc ecx: c1750908 edx: 00000000 esi: c03a0f80 edi: 0001e359 ebp: c100002c esp: dc6cbd8c ds: 0068 es: 0068 ss: 0068 Process sleep (pid: 1006, stackpage=dc6cb000) Stack: c03a0f80 00000002 c129f558 c03a3e10 c1781330 c03a21dc c1750908 c03a0f80 c103c02c c03a2158 00000282 ffffffff 0000f1ac c1750908 00000000 00000011 c03a0f80 c015235f c1750908 000001f4 00000000 c03a2148 c0152c0c c137a7f0 Call Trace: [<c015235f>] rebalance_laundry_zone [kernel] 0x12f (0xdc6cbdd0) [<c0152c0c>] rebalance_dirty_zone [kernel] 0x9c (0xdc6cbde4) [<c0152ec4>] do_try_to_free_pages [kernel] 0x134 (0xdc6cbe04) [<c0153501>] try_to_free_pages [kernel] 0x51 (0xdc6cbe20) [<c0155217>] __alloc_pages [kernel] 0x167 (0xdc6cbe30) [<c0140c08>] do_no_page [kernel] 0x398 (0xdc6cbe70) [<c0143b9d>] unmap_fixup [kernel] 0x11d (0xdc6cbea0) [<c0140f11>] handle_mm_fault [kernel] 0xd1 (0xdc6cbec0) [<c011f5ec>] do_page_fault [kernel] 0x13c (0xdc6cbef4) [<c014343d>] do_mmap_pgoff [kernel] 0x4ad (0xdc6cbf08) [<c0112eb5>] old_mmap [kernel] 0x105 (0xdc6cbf64) [<c011f4b0>] do_page_fault [kernel] 0x0 (0xdc6cbfb0) Code: 89 50 04 89 02 c7 43 04 00 00 00 00 c7 03 00 00 00 00 d1 64
Rik: what's the version number of the kernel that got the flags update patch ?
The page->flags atomic update patch went into kernel 2.4.21-1.1931.2.399 The symptoms of this bug report suggest that the problem may be fixed by the page->flags fix. However, I am not 100% sure. Todd, could you please test kernel .399 or newer to verify whether the bug still exists ? Thank you, Rik
Tested with the .399 kernel. It moved slightly, and it took 34 hours to generate, but still basically the same problem. Unable to handle kernel NULL pointer dereference at virtual address 00000004 printing eip: c0154c31 *pde = 1e6ee001 *pte = 1e6e8067 Oops: 0002 iptable_filter ip_tables ide-cd cdrom autofs e100 microcode keybdev mousedev hid input usbcore ext3 jbd ips aic7xxx sd_mod scsi_mod CPU: 0 EIP: 0060:[<c0154c31>] Not tainted EFLAGS: 00010046 EIP is at __free_pages_ok [kernel] 0x2c1 (2.4.21-1.1931.2.399.entsmp) eax: 00000000 ebx: c119bff8 ecx: c119bfbc edx: 00000000 esi: c03a0f80 edi: 00005ddc ebp: c100002c esp: de6edda4 ds: 0068 es: 0068 ss: 0068 Process stress (pid: 584, stackpage=de6ed000) Stack: c03a0f80 00000002 c0145a09 c03a3e10 c1781b70 c03a21dc c119bfbc c03a0f80 c103c02c c03a2158 00000282 ffffffff 00002eee c119bfbc 00000000 0000003a c03a0f80 c01524af c119bfbc 000001f4 00000000 c03a2148 c1632664 c03a0f80 Call Trace: [<c0145a09>] wait_on_page_timeout [kernel] 0xc9 (0xde6eddac) [<c01524af>] rebalance_laundry_zone [kernel] 0x12f (0xde6edde8) [<c0153024>] do_try_to_free_pages [kernel] 0x134 (0xde6ede1c) [<c0153661>] try_to_free_pages [kernel] 0x51 (0xde6ede38) [<c0155387>] __alloc_pages [kernel] 0x167 (0xde6ede48) [<c0140470>] do_anonymous_page [kernel] 0xf0 (0xde6ede88) [<c0140f31>] handle_mm_fault [kernel] 0xd1 (0xde6edec0) [<c011f60c>] do_page_fault [kernel] 0x13c (0xde6edef4) [<e080e461>] scsi_finish_command [scsi_mod] 0x81 (0xde6edf44) [<e080e1b6>] scsi_softirq_handler [scsi_mod] 0x76 (0xde6edf58) [<c010dbd8>] do_IRQ [kernel] 0x148 (0xde6edf98) [<c011f4d0>] do_page_fault [kernel] 0x0 (0xde6edfb0) Code: 89 50 04 89 02 c7 43 04 00 00 00 00 c7 03 00 00 00 00 d1 64
I just started the crash tool on my test system (also a dual CPU system with 512MB RAM). I am running the .411 kernel. I'll let you know if/when I reproduce the crash.
Is this bug reproducible on any other machine, or has it only been seen on this one system ? I've been running the test here now and all that happens is that my little test system gets so overloaded the cron jobs can't finish in time for new ones to be started and system load spirals higher and higher. There are no crashes, though ...
I had been testing a couple other systems, but power problems with the storms has caused problems keeping the systems online during the tests. I will start a test today, to run over the weekend (or until systems crash) with several platforms.
We have reproduced an eerily similar (quite likely the same) bug on an AMD64 system here. We know which line of code is causing the oops. What we don't yet know is why... The oops is happening in the list_del() in the following piece of code from __free_pages_ok(): if (BAD_RANGE(zone,buddy1)) BUG(); if (BAD_RANGE(zone,buddy2)) BUG(); list_del(&buddy1->list); mask <<= 1; area++; index >>= 1; page_idx &= mask; } I am about to audit the VM code (and our individual VM patches) to figure out what could cause this problem.
Created attachment 94076 [details] Description of bug-hunting session Short form of the story: somehow the zone free area lists are getting screwed up such that the self-pointers for two lists end up getting swapped, with bad results. Read the attachment for details I booby-trapped list_add() to try and catch this state of affairs at its creation, I'm currently running stress tests to try and make it happen.
*** Bug 101946 has been marked as a duplicate of this bug. ***
Another run; this time the zone free lists weren't corrupted but one of the buddies in buddy coalescing was: (gdb) print *((struct page *)$r9) $18 = {list = {next = 0x100015fc4a0, prev = 0xffffffff8045f628}, mapping = 0x0, index = 0x3893a, next_hash = 0x10002101f98, count = { counter = 0x0}, flags = 0x100000000000000, lru = {next = 0x0, prev = 0x0}, pte = {chain = 0x0, direct = 0x0}, age = 0xfe, pprev_hash = 0x0, buffers = 0x0} here list->area->prev is pointing to one of the free area entries rather than a valid page. It could be that the corruption is happening on page lists and these lists eventually get linked into the free area lists... Still investigating.
Here is a mail message Jim sent me, discussing how to reproduce the issue on x86. Note: I wan unable to reproduce the problem on pro5 (dual xeon w/HT). -Jeff -------------------------------------------------------------------------------- Here's the x86 version of the test I used to reproduce BZ#102282: First, obtain and install the following RPMs from ~jparadis/QA: contest-170403-4.i386.rpm stress-kernel-1.2.15-16.3.i386.rpm If they complain about missing pieces, install them with --nodeps... they oughta still work. Next, open up two shell windows. In the first window, cd to a Linux source tree (e.g. /usr/src/linux-2.4) Issue the following command: % contest -n 10 io_load mem_load In the second window, cd to /usr/bin/ctcs and issue the command: % ./hell-hound.sh Select the default (0) for additional memory, then answer "no" to all of the tests *except* the memory test. Answer "yes" to "Proceed" Hopefully, your system will crash in about half an hour... Let me know how it goes! --jim
Actually, that was my formula for reproducing it on one of the hammer boxes. I've since tried it on taroon-latest with other hammer boxes and been unable to reproduce. jmoyer has tried on x86 and also been unable to reproduce. I'm going to try different loads (maybe run them over the weekend) and see if this recurs or not...
I've been able to replicate once this morning on a single-proc i686 with 512M RAM running contest and stress in parallel. Unfortunately I didn't have a serial console attached when I did it, but I'm attempting to replicate again with one attached and will post the results.
can we get a module list of every system that reproduced this? (and see if there's something in common that's not on systems where we tried hard but failed to reproduce)
I re-ran my tests over the weekend on i686 hardware. I was able to duplicate the issue (using just the stress tool originally mentioned) on 3 systems: a Xeon 2proc with HT enabled, and 2 separate P3 2proc systems. All systems were running the 421 SMP kernel with all of the latest Taroon packages installed (as of Friday afternoon). OS configurations were identical. The P3 systems run with the following modules (according to /proc/modules): in use: e100, usbcore, ext3, jdb, ips, sd_mod, scsi_mod loaded: ide-cd, cdrom, autofs, microcode, keybdev, mousedev, hid, input, aic7xxx The Xeon systems have the following: in use: tg3, usbcore, ext3, jbd, mptscsih, mptbase, sd_mod, scsi_mod loaded: autofs, microcode, keybdev, mousedev, hid, input, mptctl
I'm getting a slightly different oops, but it appears to be related to the same thing. I've seen this repeatedly on a couple of machines here in Centennial (both UP and SMP.) In addition, I was able to replicate on a kernel with CONFIG_HUGETLBFS off. I have some debug code from Ingo that I'm going to apply and will post results of that as soon as the machines fall over. VM: reclaim_page, found unknown page Unable to handle kernel NULL pointer dereference at virtual address 00000004 printing eip: c0144cb0 *pde = 00000000 Oops: 0002 parport_pc lp parport nfs lockd sunrpc e100 floppy microcode keybdev mousedev hid input ehci-hcd usb-uhci usbcore ext3 jbd CPU: 0 EIP: 0060:[<c0144cb0>] Not tainted EFLAGS: 00010246 EIP is at __lru_cache_del [kernel] 0x1e0 (2.4.21-1.1931.2.423.ent) eax: 00000000 ebx: c14e9240 ecx: c14e925c edx: 00000000 esi: c0349e80 edi: 0000003f ebp: 00000000 esp: c1d85d98 ds: 0068 es: 0068 ss: 0068 Process stress (pid: 3638, stackpage=c1d85000) Stack: c1d84000 c14e9240 00000000 c0144d65 c0148461 c14e9240 00000141 c013b612 c14e9240 c16fb0d0 00000000 c1d84000 00000000 00000000 00000000 c1019dd8 c01571fe d9dd2988 00000000 c14e9240 0000003f 00000000 c0146239 c14e9240 Call Trace: [<c0144d65>] lru_cache_del [kernel] 0x5 (0xc1d85da4) [<c0148461>] __free_pages_ok [kernel] 0x31 (0xc1d85da8) [<c013b612>] wait_on_page_timeout [kernel] 0xc2 (0xc1d85db4) [<c01571fe>] try_to_free_buffers [kernel] 0x8e (0xc1d85dd8) [<c0146239>] rebalance_laundry_zone [kernel] 0xd9 (0xc1d85df0) [<c0146c04>] do_try_to_free_pages [kernel] 0x134 (0xc1d85e14) [<c0147231>] try_to_free_pages [kernel] 0x51 (0xc1d85e30) [<c0148dd7>] __alloc_pages [kernel] 0x167 (0xc1d85e40) [<c013b150>] add_to_page_cache_unique [kernel] 0x50 (0xc1d85e54) [<c0149d2c>] read_swap_cache_async [kernel] 0xac (0xc1d85e84) [<c01367c1>] swapin_readahead [kernel] 0x51 (0xc1d85ea4) [<c0136a2f>] do_swap_page [kernel] 0x24f (0xc1d85ec0) [<c0137344>] handle_mm_fault [kernel] 0xf4 (0xc1d85edc) [<c011a80c>] do_page_fault [kernel] 0x13c (0xc1d85f0c) [<e0850e12>] rh_init_int_timer [usb-uhci] 0x62 (0xc1d85f2c) [<e0850d60>] rh_int_timer_do [usb-uhci] 0x0 (0xc1d85f34) [<c012b66e>] __run_timers [kernel] 0xae (0xc1d85f38) [<c012b327>] timer_bh [kernel] 0x47 (0xc1d85f64) [<c012a63d>] tqueue_bh [kernel] 0x1d (0xc1d85f6c) [<c011e59b>] context_switch [kernel] 0x7b (0xc1d85f84) [<c011d105>] schedule [kernel] 0x125 (0xc1d85fa0) [<c011a6d0>] do_page_fault [kernel] 0x0 (0xc1d85fb0) Code: 89 50 04 89 02 c7 41 04 00 00 00 00 c7 43 1c 00 00 00 00 0f Kernel panic: Fatal exception
could you please also try a test with swap turned off - if that is possible without OOM-ing quickly.
Ingo: ~jparadis/QA is on the Boston share... I'll copy the rpms over to an equivalent place on the Centennial share.
Here's another possible way to reproduce this bug a bit faster: I've only run this once and lost the output (durn screen blanking) but I *think* this is another way of reproducing the bug a bit faster: run "stress" (not stress-kernel, but the original stress test that tpalino used) opposite "contest -c -n 10 io_load mem_load". The "-c" flag prevents the swapon/swapoff that contest usually does. Without it stress gets oom-killed too much to be useful. I tried this last nite and logged uptime to a file, then went home. Came in the next day and discovered that my system (dual-Xeon DELL WS, 512Mb) only stayed up 38 minutes after I left...
Jay, is there any chance you can capture a vmcore dump of one of these failures?
I will indeed try. To this point, I've had code running on three boxes since yesterday and nothing has fallen over. It must know that we're looking for it :-)
I reproduced this on a UP celeron w/ 128MB of RAM running a UP kernel with Ingo's patch and hugetlbfs turned off. The bug was reproduced using the instructions from Jim's email posted above. I'll attach the log output.
Created attachment 94370 [details] log from test run
Ditto for me, also on a UP system but with 512M RAM. I have the netdump log, vmcore and serial console output if anyone is interested (already handed off vmcore and log to sct.)
Can someone make Jay's latest vmcore, log and console output available? Larry
yakko.test.redhat.com:/var/crash (root/standard testlab password)
Crashed my SMP ix86 box again running the 431 kernel. vmcore and log file on yakko.test.redhat.com:/var/crash (root/standard testlab password)
Ingo asked a few comments above whether we could reproduce without swap. Well, my last 4 reproducer runs all died with solid lockups after exhausting swap, so I've been forced to add another 1G LVM swap on my 256MB test box. Looks like swapless reproducer is a non-starter, at least with the reproducer recipes we've been using so far.
Created attachment 94438 [details] Page-free debug patch First of two debug booby-trap patch I've been using while chasing this: Catches any double-frees simply by tracking the free state of a page independently of page->count: set the free bit in __free_pages_ok(), clear it in rmqueue(), and BUG() if it's already set/clear.
Created attachment 94439 [details] VM list-operation debug patch Second patch, relies on the page-free debug patch. Trap any list add/del operations in the VM if the page is already free. Also adds a BUG() to a couple of VM corner cases which are marked as "can't happen" and which currently only printk a warning.
Bug was found, fix is currently undergoing test.
I built a kernel which included Arjan's vmdebug.patch and Stephen's reclaim-fix.patch against 431. That kernel just oops'd on an SMP machine. Looking at the trace, it appears that the swapoff process got OOMkilled, which led to an oops from the swapoff process. Definitely not the same footprint as the earlier oops, but I'm not sure if it's something to be worried about of not. When OOMkill kicks in, all bets are kind of off for what's going to happen. Anyway, the full log, the kernel in question (2.4.21-1.1931.2.431.jkt3entsmp) as well as a bzip'd vmcore are available on yakko.test.redhat.com (root/standard lab password.)
EIP is at atomic_dec_and_lock [kernel] 0x10 (2.4.21-1.1931.2.431.jkt3entsmp) Call Trace: [<c0178950>] dput [kernel] 0x30 (0xcaa21f58) [<c017df0d>] __mntput [kernel] 0x1d (0xcaa21f6c) [<c0156d42>] sys_swapoff [kernel] 0x252 (0xcaa21f7c) [<c01443fb>] sys_munmap [kernel] 0x4b (0xcaa21fa4)
Just another datapoint: I applied arjan's and sct's patches to a .431 kernel and tried my stress tests on the particular hammer box that tends to reproduce the problem quickly. Without the patches it crashed in under 40 minutes. With the patches the system ran the same stress tests for twelve hours and stayed up afterwards. I'd say that's a strong indication that this one is fixed.
amending my previous comment to state that the *original* issue appears to be fixed, the new swapoff issue is still obviously open.
Closing MODIFIED bugs as fixed. Please reopen if the problem perists.