Description of problem: Version-Release number of selected component (if applicable): How reproducible: Every time, with a particular database. Steps to Reproduce: 1. Export data from Oracle 9 (our specific data set) to an Oracle 8 machine. 2. Import to Oracle 8. 3. Watch the system consume all memory/swap and become unresponsive. Actual results: The database imports and when integrity constraints are being applied, the kernel keeps allowing memory to be consumed until the system dies. This is on a patched RHEL 3 system. I watched the free memory/swap go away with top. Expected results: I would expect overcommit_memory to be set, or something would stop the process from consuming ALL of memory. This has worked in the past, I don't know what has changed in the data set, however, the kernel still should not allow oracle to keeps spawning threads to consume all of memory. Additional info: Unfortunately, the data set has patient information and we can not provide it.
Since this refers to a RHEL3 system, not a RHEL4 one, I'm modifying the version number accordingly.
Please privide lots more information so I can start debugging this problem: top output, processor type, exact kernel version string, AltSysrq-M outputs, etc. Larry Woodman
Here is the uname output: Linux beatle.verinform.com 2.4.21-27.0.2.EL #1 Wed Jan 12 23:46:37 EST 2005 i686 i686 i386 GNU/Linux And the AltSysrq-M output will be attached. I'm unclear about what else you want. Do you want the top output before the system hangs?
Created attachment 111542 [details] AltSysRqM output
Hisashi, the AltSysrq-M doen not show any problems with memory. Please get me several AltSysrq-P outputs and one AltSysrq-W and one AltSysrq-T output when the system is hung so I can see what is running on each CPU and what each process in blocked on. Thanks, Larry Woodman
OK, I think I see the problem here: On one CPU kswapd calls launder_page() which increments the page->count and calls page_cache_release() with the zone->lru_lock held when that page is being re-activated. On another CPU if the process last process that maps that page calls exit, page_cache_release()gets called for the same page. If thats the last reference to the page and it races with kswapd, launder_page() will call __free_pages_ok() with the and zone->lru_lock held deadlock. This patch fixes this problem: ------------------------------------------------------------------- --- linux-2.4.21/mm/vmscan.c.orig +++ linux-2.4.21/mm/vmscan.c @@ -315,7 +315,9 @@ int launder_page(zone_t * zone, int gfp_ if (cache_ratio(zone) > cache_limits.max && page_anon(page) && free_min(zone) < 0) { add_page_to_active_list(page, INITIAL_AGE); + lru_unlock(zone); page_cache_release(page); + lru_lock(zone); return 0; }
I have some more AltSysRq output, but now the process dies properly. Now, if you can forward my errors to Oracle, somehow, since their imp triggered this kernel bug.
Created attachment 113241 [details] AltSysRq[PWT] output from syslog
Hisashi, from looking at this AltSysrq-M output it appears that the system hung because 182589 pages of anonymous memory was VM_LOCK'd: >>>aa:182592 ac:1665 id:54 il:0 ic:0 fr:636 Is your application doing something that mlock()s memory or something??? Larry Woodman
Unfortunately, all I'm doing is running "imp" from Oracle 8.1.7.0. We mere mortals aren't privvy to the inner workings of Oracle programs.
Ah, wait! You have no swap space free!! Thats the problem!!! >>>Free swap: 0kB Fix that and the problem will go away. Larry Woodman
Please read the opening bug. Having all my swap consumed is what this bug is all about.
OK, sorry. What is /proc/sys/vm/overcommit_memory? Larry
OK, in clarification: the latest Alt-SysRq info was from a PATCHED system, from the patch send by Larry via the web page. Larry's fix now causes the program to crash after consuming all memory, which is better than the old behavior which was a hang forever.
/proc/sys/vm/overcommit_memory is the default, 0.
A fix for this problem has just been committed to the RHEL3 U6 patch pool this evening (in kernel version 2.4.21-32.2.EL).
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2005-663.html