From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.1) Gecko/20020823 Netscape/7.0 Description of problem: We have a 64 node cluster. We run a scientific job that heavily depends on memory and cpu. Here is the uname output from a node: Linux mach-0-0 2.4.18-17.7.xsmp #6 Tue Dec 17 16:41:44 PST 2002 i686 unknown The error below can be caused by any process such as (bash, sh, kswapd, etc..). I also turned off SMP and gave the test a try without a single crash. When I turned SMP back on the nodes would start to die. We loose between 5-10 nodes out of 64 each run and usually within the first 10-15 minutes. Nov 22 18:51:59 mach-0-35 kernel: kernel BUG at page_alloc.c:220! Nov 22 18:51:59 mach-0-35 kernel: invalid operand: 0000 Nov 22 18:51:59 mach-0-35 kernel: CPU: 0 Nov 22 18:51:59 mach-0-35 kernel: EIP: 0010:[rmqueue+525/592] Not tainted Nov 22 18:51:59 mach-0-35 kernel: EIP: 0010:[<c0132c6d>] Not tainted Nov 22 18:51:59 mach-0-35 kernel: EFLAGS: 00010202 Nov 22 18:51:59 mach-0-35 kernel: eax: 00000040 ebx: c23bc8f0 ecx: 00038000 edx: 0006942f Nov 22 18:51:59 mach-0-35 kernel: esi: c028b128 edi: 00048000 ebp: c1000020 esp: efe31dcc Nov 22 18:51:59 mach-0-35 kernel: ds: 0018 es: 0018 ss: 0018 Nov 22 18:51:59 mach-0-35 kernel: Process mlsl2 (pid: 1928, stackpage=efe31000) Nov 22 18:51:59 mach-0-35 kernel: Stack: 00038000 0003142f 00000296 00000000 c028b128 c028b200 000001ff 00000000 Nov 22 18:51:59 mach-0-35 kernel: 00000025 c0132f01 c028b128 c028b1fc 000001d2 00000018 00104025 00000000 Nov 22 18:51:59 mach-0-35 kernel: 00000001 00000025 c0127ded 69430025 00000000 f69451c0 f61bec60 efef2118 Nov 22 18:51:59 mach-0-35 kernel: Call Trace: [__alloc_pages+81/384] [do_anonymous_page+93/368] [do_no_page+71/576] [it_real_fn+16/80] [han dle_mm_fault+154/288] Nov 22 18:51:59 mach-0-35 kernel: Call Trace: [<c0132f01>] [<c0127ded>] [<c0127f47>] [<c011c5e0>] [<c01281da>] Nov 22 18:51:59 mach-0-35 kernel: [<c011d57b>] [<c011d431>] [<c012900a>] [<c010a64d>] [<c011472a>] [<c012939b>] Nov 22 18:51:59 mach-0-35 kernel: [<c01293ab>] [<c010ea9e>] [<c0114570>] [<c0108bfc>] Nov 22 18:51:59 mach-0-35 kernel: Code: 0f 0b dc 00 81 4b 25 c0 8b 43 18 a9 80 00 00 00 74 08 0f 0b Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1.We have a program that processes satellite data using PVM. 2.Tried it with and without PVM. Same results. Actual Results: 5-10 nodes would die. Expected Results: No crash. Additional info: 64 node cluster configuration. The drives are IDE, we used RedHat 7.2, ext3, 2 GB virtual memory and 4gb swap.
Ack sorry..ignore this one. I had the wrong window open and hit enter. Must of recreated the same bug as # 79924
*** This bug has been marked as a duplicate of 79924 ***
Changed to 'CLOSED' state since 'RESOLVED' has been deprecated.