From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.1) Gecko/20020823 Netscape/7.0 Description of problem: We have a 64 node cluster. We run a scientific job that heavily depends on memory and cpu. Here is the uname output from a node: Linux mach-0-0 2.4.18-17.7.xsmp #6 Tue Dec 17 16:41:44 PST 2002 i686 unknown The error below can be caused by any process such as (bash, sh, kswapd, etc..). I also turned off SMP and gave the test a try without a single crash. When I turned SMP back on the nodes would start to die. We loose between 5-10 nodes out of 64 each run and usually within the first 10-15 minutes. Nov 22 18:51:59 mach-0-35 kernel: kernel BUG at page_alloc.c:220! Nov 22 18:51:59 mach-0-35 kernel: invalid operand: 0000 Nov 22 18:51:59 mach-0-35 kernel: CPU: 0 Nov 22 18:51:59 mach-0-35 kernel: EIP: 0010:[rmqueue+525/592] Not tainted Nov 22 18:51:59 mach-0-35 kernel: EIP: 0010:[<c0132c6d>] Not tainted Nov 22 18:51:59 mach-0-35 kernel: EFLAGS: 00010202 Nov 22 18:51:59 mach-0-35 kernel: eax: 00000040 ebx: c23bc8f0 ecx: 00038000 edx: 0006942f Nov 22 18:51:59 mach-0-35 kernel: esi: c028b128 edi: 00048000 ebp: c1000020 esp: efe31dcc Nov 22 18:51:59 mach-0-35 kernel: ds: 0018 es: 0018 ss: 0018 Nov 22 18:51:59 mach-0-35 kernel: Process mlsl2 (pid: 1928, stackpage=efe31000) Nov 22 18:51:59 mach-0-35 kernel: Stack: 00038000 0003142f 00000296 00000000 c028b128 c028b200 000001ff 00000000 Nov 22 18:51:59 mach-0-35 kernel: 00000025 c0132f01 c028b128 c028b1fc 000001d2 00000018 00104025 00000000 Nov 22 18:51:59 mach-0-35 kernel: 00000001 00000025 c0127ded 69430025 00000000 f69451c0 f61bec60 efef2118 Nov 22 18:51:59 mach-0-35 kernel: Call Trace: [__alloc_pages+81/384] [do_anonymous_page+93/368] [do_no_page+71/576] [it_real_fn+16/80] [han dle_mm_fault+154/288] Nov 22 18:51:59 mach-0-35 kernel: Call Trace: [<c0132f01>] [<c0127ded>] [<c0127f47>] [<c011c5e0>] [<c01281da>] Nov 22 18:51:59 mach-0-35 kernel: [<c011d57b>] [<c011d431>] [<c012900a>] [<c010a64d>] [<c011472a>] [<c012939b>] Nov 22 18:51:59 mach-0-35 kernel: [<c01293ab>] [<c010ea9e>] [<c0114570>] [<c0108bfc>] Nov 22 18:51:59 mach-0-35 kernel: Code: 0f 0b dc 00 81 4b 25 c0 8b 43 18 a9 80 00 00 00 74 08 0f 0b Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1.We have a program that processes satellite data using PVM. 2.Tried it with and without PVM. Same results. Actual Results: 5-10 nodes would die. Expected Results: No crash. Additional info: 64 node cluster configuration. The drives are IDE, we used RedHat 7.2, ext3, 2 GB virtual memory and 4gb swap.
First of all this trace sort of looks to be from a modified kernel. Can you attach dmesg, lsmod and lspci from such a system before it oopses?
Hi, Well I have used my own 2.4.19 modified kernel, 2.4.18 xsmp kernel, and a modified 2.4.18 redhat source (I removed almost everything and smp for a test)but still had the same crashes. The only time all the nodes have not crashed was when I disabled SMP. I could provide more errors. Another one from a different node has been placed at the end. Here is lsmod: [root@mach-0-35 root]# lsmod Module Size Used by Not tainted [root@mach-0-35 root]# [root@mach-0-35 root]# lspci 00:00.0 Host bridge: ServerWorks: Unknown device 0012 (rev 13) 00:00.1 Host bridge: ServerWorks: Unknown device 0012 00:00.2 Host bridge: ServerWorks: Unknown device 0000 00:02.0 VGA compatible controller: ATI Technologies Inc Rage XL (rev 27) 00:04.0 Ethernet controller: Intel Corporation 82557 [Ethernet Pro 100] (rev 0d) 00:0f.0 ISA bridge: ServerWorks CSB5 South Bridge (rev 93) 00:0f.1 IDE interface: ServerWorks CSB5 IDE Controller (rev 93) 00:0f.3 Host bridge: ServerWorks: Unknown device 0225 00:10.0 Host bridge: ServerWorks: Unknown device 0101 (rev 03) 00:10.2 Host bridge: ServerWorks: Unknown device 0101 (rev 03) 00:11.0 Host bridge: ServerWorks: Unknown device 0101 (rev 03) 00:11.2 Host bridge: ServerWorks: Unknown device 0101 (rev 03) 01:03.0 Ethernet controller: BROADCOM Corporation NetXtreme BCM5701 Gigabit Ethernet (rev 15) Linux version 2.4.18-17.7.x (root@mach-0-0) (gcc version 2.96 20000731 (Red Hat Linux 7.1 2.96-98)) #6 Tue Dec 17 16:41:44 PST 2002 BIOS-provided physical RAM map: BIOS-e820: 0000000000000000 - 000000000009f400 (usable) BIOS-e820: 000000000009f400 - 000000000009f800 (reserved) BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved) BIOS-e820: 0000000000100000 - 0000000080000000 (usable) BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved) BIOS-e820: 00000000fff80000 - 0000000100000000 (reserved) 1152MB HIGHMEM available. 896MB LOWMEM available. On node 0 totalpages: 524288 zone(0): 4096 pages. zone(1): 225280 pages. zone(2): 294912 pages. Kernel command line: auto BOOT_IMAGE=bzImage ro root=303 BOOT_FILE=/boot/bzImage console=ttyS0 Initializing CPU#0 Detected 2199.941 MHz processor. Console: colour VGA+ 80x25 Calibrating delay loop... 4364.15 BogoMIPS Memory: 2061592k/2097152k available (1205k kernel code, 30948k reserved, 337k data, 236k init, 1179648k highmem) Dentry cache hash table entries: 262144 (order: 9, 2097152 bytes) Inode cache hash table entries: 131072 (order: 8, 1048576 bytes) Mount cache hash table entries: 32768 (order: 6, 262144 bytes) ramfs: mounted with options: <defaults> ramfs: max_pages=258227 max_file_pages=0 max_inodes=0 max_dentries=258227 Buffer cache hash table entries: 131072 (order: 7, 524288 bytes) Page-cache hash table entries: 524288 (order: 9, 2097152 bytes) CPU: Before vendor init, caps: 3febfbff 00000000 00000000, vendor = 0 CPU: L1 I cache: 0K, L1 D cache: 8K CPU: L2 cache: 512K CPU: After vendor init, caps: 3febfbff 00000000 00000000 00000000 Intel machine check architecture supported. Intel machine check reporting enabled on CPU#0. CPU: After generic, caps: 3febfbff 00000000 00000000 00000000 CPU: Common caps: 3febfbff 00000000 00000000 00000000 CPU: Intel(R) XEON(TM) CPU 2.20GHz stepping 04 Enabling fast FPU save and restore... done. Enabling unmasked SIMD FPU exception support... done. Checking 'hlt' instruction... OK. POSIX conformance testing by UNIFIX mtrr: v1.40 (20010327) Richard Gooch (rgooch.au) mtrr: detected mtrr type: Intel PCI: PCI BIOS revision 2.10 entry at 0xfdba1, last bus=4 PCI: Using configuration type 1 PCI: Probing PCI hardware PCI: Discovered primary peer bus 01 [IRQ] PCI: Discovered primary peer bus 02 [IRQ] PCI: Discovered primary peer bus 03 [IRQ] PCI: Discovered primary peer bus 04 [IRQ] Linux NET4.0 for Linux 2.4 Based upon Swansea University Computer Society NET3.039 Initializing RT netlink socket cpufreq: Intel(R) SpeedStep(TM) support $Revision: 1.34 $ cpufreq: Intel(R) SpeedStep(TM) for this chipset not (yet) available. cpufreq: CPU#0 P4/Xeon(TM) CPU On-Demand Clock Modulation available CPU clock: 2199.941 MHz (219.994-2199.941 MHz) Starting kswapd allocated 64 pages and 64 bhs reserved for the highmem bounces Journalled Block Device driver loaded Installing knfsd (copyright (C) 1996 okir.de). pty: 2048 Unix98 ptys configured Serial driver version 5.05c (2001-07-08) with MANY_PORTS SHARE_IRQ SERIAL_PCI enabled ttyS0 at 0x03f8 (irq = 4) is a 16550A ttyS1 at 0x02f8 (irq = 3) is a 16550A Real Time Clock Driver v1.10e oprofile: can't get RTC I/O Ports block: 1024 slots per queue, batch=256 Uniform Multi-Platform E-IDE driver Revision: 6.31 ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx SvrWks CSB5: IDE controller on PCI bus 00 dev 79 SvrWks CSB5: chipset revision 147 SvrWks CSB5: not 100% native mode: will probe irqs later ide0: BM-DMA at 0xffa0-0xffa7, BIOS settings: hda:DMA, hdb:pio ide1: BM-DMA at 0xffa8-0xffaf, BIOS settings: hdc:pio, hdd:pio hda: ST340016A, ATA DISK drive ide0 at 0x1f0-0x1f7,0x3f6 on irq 14 hda: 78165360 sectors (40021 MB) w/2048KiB Cache, CHS=77545/16/63, UDMA(100) Partition check: hda: hda1 hda2 hda3 hda4 < hda5 hda6 hda7 > RAMDISK driver initialized: 16 RAM disks of 4096K size 1024 blocksize eepro100.c:v1.09j-t 9/29/99 Donald Becker http://www.scyld.com/network/eepro100.html eepro100.c: $Revision: 1.36 $ 2000/11/17 Modified by Andrey V. Savochkin <saw.com.sg> and others eth0: OEM i82557/i82558 10/100 Ethernet, 00:30:48:51:7E:7E, IRQ 10. Board assembly 000000-000, Physical connectors present: RJ45 Primary interface chip i82555 PHY #1. General self-test: passed. Serial sub-system self-test: passed. Internal registers self-test: passed. ROM checksum self-test: passed (0xb874c1d3). tg3.c:v1.1 (Aug 30, 2002) eth1: Tigon3 [partno(BCM95700A6) rev 0105 PHY(5701)] (PCIX:100MHz:64-bit) 10/100/1000BaseT Ethernet 00:30:48:51:7c:8d NET4: Linux TCP/IP 1.0 for NET4.0 IP Protocols: ICMP, UDP, TCP, IGMP IP: routing cache hash table of 16384 buckets, 128Kbytes TCP: Hash tables configured (established 262144 bind 65536) Linux IP multicast router 0.06 plus PIM-SM ip_conntrack (8192 buckets, 65536 max) NET4: Unix domain sockets 1.0/SMP for Linux NET4.0. kjournald starting. Commit interval 5 seconds EXT3-fs: mounted filesystem with ordered data mode. VFS: Mounted root (ext3 filesystem) readonly. Freeing unused kernel memory: 236k freed Adding Swap: 2048216k swap-space (priority -1) Adding Swap: 2048248k swap-space (priority -2) EXT3 FS 2.4-0.9.18, 14 May 2002 on ide0(3,3), internal journal kjournald starting. Commit interval 5 seconds EXT3 FS 2.4-0.9.18, 14 May 2002 on ide0(3,1), internal journal EXT3-fs: mounted filesystem with ordered data mode. kjournald starting. Commit interval 5 seconds EXT3 FS 2.4-0.9.18, 14 May 2002 on ide0(3,5), internal journal EXT3-fs: mounted filesystem with ordered data mode. tg3: eth1: Link is up at 1000 Mbps, full duplex. tg3: eth1: Flow control is off for TX and off for RX. Nov 20 18:57:06 mach-0-51 kernel: kernel BUG at page_alloc.c:220! Nov 20 18:57:06 mach-0-51 kernel: invalid operand: 0000 Nov 20 18:57:06 mach-0-51 kernel: CPU: 0 Nov 20 18:57:06 mach-0-51 kernel: EIP: 0010:[<c0130cdd>] Not tainted Nov 20 18:57:06 mach-0-51 kernel: EFLAGS: 00010202 Nov 20 18:57:06 mach-0-51 kernel: eax: 00000040 ebx: c1b39ea0 ecx: 00038000 edx: 0003bdf8 Nov 20 18:57:06 mach-0-51 kernel: esi: c02a9a88 edi: 00048000 ebp: c1000020 esp: f55f3e14 Nov 20 18:57:06 mach-0-51 kernel: ds: 0018 es: 0018 ss: 0018 Nov 20 18:57:06 mach-0-51 kernel: Process mlsl2 (pid: 1415, stackpage=f55f3000) Nov 20 18:57:06 mach-0-51 kernel: Stack: 00038000 00003df8 00000286 00000000 c02a9a88 c02a9b60 000001ff 00000000 Nov 20 18:57:06 mach-0-51 kernel: 00181002 c0130f71 c02a9a88 c02a9b5c 000001d2 00002945 00000000 00000000 Nov 20 18:57:06 mach-0-51 kernel: 0c1ab98c 00181002 c0131674 00000002 00000000 00000008 f60789ac c01260dd Nov 20 18:57:06 mach-0-51 kernel: Call Trace: [<c0130f71>] [<c0131674>] [<c01260dd>] [<c0126121>] [<c0126592>] Nov 20 18:57:06 mach-0-51 kernel: [<c01086ad>] [<c0113502>] [<c011f3c6>] [<c011f619>] [<c011c10b>] [<c011bfc1>] Nov 20 18:57:06 mach-0-51 kernel: [<c011bd4b>] [<c0113350>] [<c0108bfc>] Nov 20 18:57:06 mach-0-51 kernel: Code: 0f 0b dc 00 56 aa 26 c0 8b 43 18 a9 80 00 00 00 74 08 0f 0b below is another node: Nov 19 18:02:48 mach-0-55 kernel: kernel BUG at page_alloc.c:220! Nov 19 18:02:48 mach-0-55 kernel: invalid operand: 0000 Nov 19 18:02:48 mach-0-55 kernel: CPU: 0 Nov 19 18:02:48 mach-0-55 kernel: EIP: 0010:[<c0130cdd>] Not tainted Nov 19 18:02:48 mach-0-55 kernel: EFLAGS: 00010202 Nov 19 18:02:48 mach-0-55 kernel: eax: 00000040 ebx: c122dea0 ecx: 00001000 edx: 0000b9f8 Nov 19 18:02:48 mach-0-55 kernel: esi: c02a99d4 edi: 00037000 ebp: c1000020 esp: f60dfe24 Nov 19 18:02:48 mach-0-55 kernel: ds: 0018 es: 0018 ss: 0018 Nov 19 18:02:48 mach-0-55 kernel: Process mlsl2 (pid: 1573, stackpage=f60df000) Nov 19 18:02:48 mach-0-55 kernel: Stack: 00001000 0000a9f8 00000286 00000000 c02a99d4 c02a9b64 000003fd 00000000 Nov 19 18:02:48 mach-0-55 kernel: f6ae4180 c0130f71 c02a9a88 c02a9b5c 000001d2 00000018 00000001 f67e4a80 Nov 19 18:02:48 mach-0-55 kernel: 00104025 f6ae4180 c012623b f67e4a80 63052000 f67e4a80 f6ae4180 c0126344 Nov 19 18:02:48 mach-0-55 kernel: Call Trace: [<c0130f71>] [<c012623b>] [<c0126344>] [<c0126582>] [<c0126fe9>] Nov 19 18:02:48 mach-0-55 kernel: [<c0113502>] [<c010ea2e>] [<c011bd4b>] [<c0113350>] [<c0108bfc>] Nov 19 18:02:48 mach-0-55 kernel: Nov 19 18:02:48 mach-0-55 kernel: Code: 0f 0b dc 00 56 aa 26 c0 8b 43 18 a9 80 00 00 00 74 08 0f 0b
well in your modified kernel you have disabled the config option that we enable to get usable backtraces........ makes it hard to investigate you know
Hi, Here is some more information, do any f these help? If not I will put back the RedHat XSMP kernel and get more info or I can enable all the kernel debugging options? Dec 5 01:06:01 mach-0-7 kernel: kernel BUG at page_alloc.c:220! Dec 5 01:06:01 mach-0-7 kernel: invalid operand: 0000 Dec 5 01:06:01 mach-0-7 kernel: CPU: 0 Dec 5 01:06:01 mach-0-7 kernel: EIP: 0010:[rmqueue+525/592] Not tainted Dec 5 01:06:01 mach-0-7 kernel: EIP: 0010:[<c0132c6d>] Not tainted Dec 5 01:06:01 mach-0-7 kernel: EFLAGS: 00010202 Dec 5 01:06:01 mach-0-7 kernel: eax: 00000040 ebx: c2020d70 ecx: 00038000 edx: 00056047 Dec 5 01:06:01 mach-0-7 kernel: esi: c028b128 edi: 00048000 ebp: c1000020 esp: ef16fdcc Dec 5 01:06:01 mach-0-7 kernel: ds: 0018 es: 0018 ss: 0018 Dec 5 01:06:01 mach-0-7 kernel: Process mlsl2 (pid: 2775, stackpage=ef16f000) Dec 5 01:06:01 mach-0-7 kernel: Stack: 00038000 0001e047 00000296 00000000 c028b128 c028b200 000001ff 00000000 Dec 5 01:06:01 mach-0-7 kernel: 00000025 c0132f01 c028b128 c028b1fc 000001d2 00000044 00104025 00000000 Dec 5 01:06:01 mach-0-7 kernel: 00000001 00000025 c0127ded 66e48025 00000000 f6997400 f6a48920 ef25c0d8 Dec 5 01:06:01 mach-0-7 kernel: Call Trace: [__alloc_pages+81/384] [do_anonymous_page+93/368] [do_no_page+71/576] [handle_mm_fault+154/288] [ip_frag_create+16/192] Dec 5 01:06:01 mach-0-7 kernel: Call Trace: [<c0132f01>] [<c0127ded>] [<c0127f47>] [<c01281da>] [<c0206860>] Dec 5 01:06:01 mach-0-7 kernel: [ip_frag_queue+516/864] [ip_frag_queue+128/864] [eth_type_trans+115/192] [do_page_fault+442/1401] [ip_frag_queue+128/864] [set_rx_mode+369/1504] Dec 5 01:06:01 mach-0-7 kernel: [<c0206b14>] [<c0206990>] [<c01fec83>] [<c011472a>] [<c0206990>] [<c01c4b41>] Dec 5 01:06:01 mach-0-7 kernel: [update_wall_time+38/80] [timer_bh+73/976] [update_process_times+48/160] [do_page_fault+0/1401] [error_code+52/60] Dec 5 01:06:01 mach-0-7 kernel: [<c0120836>] [<c0120a89>] [<c0120980>] [<c0114570>] [<c0108bfc>] Dec 5 01:06:01 mach-0-7 kernel: Dec 5 01:06:01 mach-0-7 kernel: Code: 0f 0b dc 00 81 4b 25 c0 8b 43 18 a9 80 00 00 00 74 08 0f 0b Nov 26 17:53:00 mach-0-39 kernel: kernel BUG at page_alloc.c:220! Nov 26 17:53:00 mach-0-39 kernel: invalid operand: 0000 Nov 26 17:53:00 mach-0-39 kernel: CPU: 0 Nov 26 17:53:00 mach-0-39 kernel: EIP: 0010:[rmqueue+525/592] Not tainted Nov 26 17:53:00 mach-0-39 kernel: EIP: 0010:[<c0132c6d>] Not tainted Nov 26 17:53:00 mach-0-39 kernel: EFLAGS: 00010202 Nov 26 17:53:00 mach-0-39 kernel: eax: 00000040 ebx: c25e5cd0 ecx: 00038000 edx: 00074c99 Nov 26 17:53:00 mach-0-39 kernel: esi: c028b128 edi: 00048000 ebp: c1000020 esp: e0507dbc Nov 26 17:53:00 mach-0-39 kernel: ds: 0018 es: 0018 ss: 0018 Nov 26 17:53:00 mach-0-39 kernel: Process mlsl2 (pid: 1441, stackpage=e0507000) Nov 26 17:53:00 mach-0-39 kernel: Stack: 00038000 0003cc99 00000292 00000000 c028b128 c028b200 000001ff 00000000 Nov 26 17:53:00 mach-0-39 kernel: 0005ca02 c0132f01 c028b128 c028b1fc 000001d2 0005c902 00000000 00000000 Nov 26 17:53:00 mach-0-39 kernel: 0c0854f6 0005ca02 c0133604 00000002 00000002 00000008 00000000 c0127bfd Nov 26 17:53:00 mach-0-39 kernel: Call Trace: [__alloc_pages+81/384] [read_swap_cache_async+116/158] [swapin_readahead+77/80] [do_swap_page+70/400] [handle_mm_fault+180/288] Nov 26 17:53:00 mach-0-39 kernel: Call Trace: [<c0132f01>] [<c0133604>] [<c0127bfd>] [<c0127c46>] [<c01281f4>] Nov 26 17:53:00 mach-0-39 kernel: [sys_getsockname+44/128] [sys_sendto+166/240] [ip_route_output_slow+740/1648] [ip_route_output_slow+304/1648] [neigh_proxy_process+243/288] [do_page_fault+442/1401] Nov 26 17:53:00 mach-0-39 kernel: [<c01f483c>] [<c01f49b6>] [<c0206b44>] [<c0206990>] [<c01fec83>] [<c011472a>] Nov 26 17:53:00 mach-0-39 kernel: [process_timeout+0/96] [update_wall_time+38/80] [timer_bh+73/976] [update_process_times+48/160] [smp_apic_timer_interrupt+239/288] [do_page_fault+0/1401] Nov 26 17:53:00 mach-0-39 kernel: [<c01151f0>] [<c0120836>] [<c0120a89>] [<c0120980>] [<c0112b4f>] [<c0114570>] Nov 26 17:53:00 mach-0-39 kernel: [error_code+52/60] Nov 26 17:53:00 mach-0-39 kernel: [<c0108bfc>] Nov 26 17:53:00 mach-0-39 kernel: Nov 26 17:53:00 mach-0-39 kernel: Code: 0f 0b dc 00 81 4b 25 c0 8b 43 18 a9 80 00 00 00 74 08 0f 0b Nov 27 12:53:48 mach-0-39 kernel: CPU: 0 Nov 27 12:53:48 mach-0-39 kernel: EIP: 0010:[rmqueue+525/592] Not tainted Nov 27 12:53:48 mach-0-39 kernel: EIP: 0010:[<c0132c6d>] Not tainted Nov 27 12:53:48 mach-0-39 kernel: EFLAGS: 00010202 Nov 27 12:53:48 mach-0-39 kernel: eax: 00000040 ebx: c1fc8a80 ecx: 00038000 edx: 000542e2 Nov 27 12:53:48 mach-0-39 kernel: esi: c028b128 edi: 00048000 ebp: c1000020 esp: ef9e9dcc Nov 27 12:53:48 mach-0-39 kernel: ds: 0018 es: 0018 ss: 0018 Nov 27 12:53:48 mach-0-39 kernel: Process mlsl2 (pid: 2040, stackpage=ef9e9000) Nov 27 12:53:48 mach-0-39 kernel: Stack: 00038000 0001c2e2 00000296 00000000 c028b128 c028b200 000001ff 00000000 Nov 27 12:53:48 mach-0-39 kernel: 00000025 c0132f01 c028b128 c028b1fc 000001d2 00000018 00104025 00000000 Nov 27 12:53:48 mach-0-39 kernel: 00000001 00000025 c0127ded 442b7025 00000000 f65a5f20 f6421b60 c4fc7848 Nov 27 12:53:48 mach-0-39 kernel: Call Trace: [__alloc_pages+81/384] [do_anonymous_page+93/368] [do_no_page+71/576] [handle_mm_fault+154/288] [ip_route_output_slow+0/1648] Nov 27 12:53:48 mach-0-39 kernel: Call Trace: [<c0132f01>] [<c0127ded>] [<c0127f47>] [<c01281da>] [<c0206860>] Nov 27 12:53:48 mach-0-39 kernel: [ip_route_output_slow+692/1648] [ip_route_output_slow+304/1648] [neigh_proxy_process+243/288] [do_page_fault+442/1401] [update_process_times+48/160] [sys_brk+202/240] Nov 27 12:53:48 mach-0-39 kernel: [<c0206b14>] [<c0206990>] [<c01fec83>] [<c011472a>] [<c0120980>] [<c012874a>] Nov 27 12:53:48 mach-0-39 kernel: [do_page_fault+0/1401] [error_code+52/60] Nov 27 12:53:48 mach-0-39 kernel: [<c0114570>] [<c0108bfc>] Nov 27 12:53:48 mach-0-39 kernel: Nov 27 12:53:48 mach-0-39 kernel: Code: 0f 0b dc 00 81 4b 25 c0 8b 43 18 a9 80 00 00 00 74 08 0f 0b Nov 27 12:53:49 mach-0-39 kernel: kernel BUG at page_alloc.c:220! Nov 27 12:53:49 mach-0-39 kernel: invalid operand: 0000 Nov 27 12:53:49 mach-0-39 kernel: CPU: 0 Nov 27 12:53:49 mach-0-39 kernel: EIP: 0010:[rmqueue+525/592] Not tainted Nov 27 12:53:49 mach-0-39 kernel: EIP: 0010:[<c0132c6d>] Not tainted Nov 27 12:53:49 mach-0-39 kernel: EFLAGS: 00010202 Nov 27 12:53:49 mach-0-39 kernel: eax: 00000040 ebx: c1ccbfc0 ecx: 00038000 edx: 000443fe Nov 27 12:53:49 mach-0-39 kernel: esi: c028b128 edi: 00048000 ebp: c1000020 esp: ef773dd0 Nov 27 12:53:49 mach-0-39 kernel: ds: 0018 es: 0018 ss: 0018 Nov 27 12:53:49 mach-0-39 kernel: Process pvmd3 (pid: 1993, stackpage=ef773000) Nov 27 12:53:49 mach-0-39 kernel: Stack: 00038000 0000c3fe 00000282 00000000 c028b128 c028b200 000001ff 00000000 Nov 27 12:53:49 mach-0-39 kernel: 0016f502 c0132f01 c028b128 c028b1fc 000001d2 0016f502 00000000 00000000 Nov 27 12:53:49 mach-0-39 kernel: 0c197ff6 0016f502 c0133604 0016f502 00000000 080876c4 00000000 c0127c4c Nov 27 12:53:49 mach-0-39 kernel: Call Trace: [__alloc_pages+81/384] [read_swap_cache_async+116/158] [do_swap_page+76/400] [handle_mm_fault+180/288] [do_page_fault+442/1401] Nov 27 12:53:49 mach-0-39 kernel: Call Trace: [<c0132f01>] [<c0133604>] [<c0127c4c>] [<c01281f4>] [<c011472a>] Nov 27 12:53:49 mach-0-39 kernel: [copy_page_range+397/624] [build_mmap_rb+84/96] [do_fork+1746/2048] [sys_close+4/112] [do_page_fault+0/1401] [error_code+52/60] Nov 27 12:53:49 mach-0-39 kernel: [<c01267bd>] [<c0129994>] [<c0117e02>] [<c0139784>] [<c0114570>] [<c0108bfc>] Nov 27 12:53:49 mach-0-39 kernel: Nov 27 12:53:49 mach-0-39 kernel: Code: 0f 0b dc 00 81 4b 25 c0 8b 43 18 a9 80 00 00 00 74 08 0f 0b Nov 27 12:53:49 mach-0-39 kernel: kernel BUG at page_alloc.c:220! Nov 27 12:53:49 mach-0-39 kernel: invalid operand: 0000 Nov 27 12:53:49 mach-0-39 kernel: CPU: 0 Nov 27 12:53:49 mach-0-39 kernel: EIP: 0010:[rmqueue+525/592] Not tainted Nov 27 12:53:49 mach-0-39 kernel: EIP: 0010:[<c0132c6d>] Not tainted Nov 27 12:53:49 mach-0-39 kernel: EFLAGS: 00010202 Nov 27 12:53:49 mach-0-39 kernel: eax: 00000040 ebx: c20c4c60 ecx: 00038000 edx: 000596ec Nov 27 12:53:49 mach-0-39 kernel: esi: c028b128 edi: 00048000 ebp: c1000020 esp: e246ddcc Nov 27 12:53:49 mach-0-39 kernel: ds: 0018 es: 0018 ss: 0018 Nov 27 12:53:49 mach-0-39 kernel: Process mlsl2 (pid: 2047, stackpage=e246d000) Nov 27 12:53:49 mach-0-39 kernel: Stack: 00038000 000216ec 00000296 00000000 c028b128 c028b200 000001ff 00000000 Nov 27 12:53:49 mach-0-39 kernel: 00000025 c0132f01 c028b128 c028b1fc 000001d2 0000055e 00104025 00000000 Nov 27 12:53:49 mach-0-39 kernel: 00000001 00000025 c0127ded 00000000 f65a5f20 f65a5f20 f6421aa0 ef7e8008 Nov 27 12:53:49 mach-0-39 kernel: Call Trace: [__alloc_pages+81/384] [do_anonymous_page+93/368] [do_no_page+71/576] [handle_mm_fault+154/288] [svcauth_null+192/240] Nov 27 12:53:49 mach-0-39 kernel: Call Trace: [<c0132f01>] [<c0127ded>] [<c0127f47>] [<c01281da>] [<c0246120>] Nov 27 12:53:49 mach-0-39 kernel: [__vma_link+116/192] [do_page_fault+442/1401] [do_mmap_pgoff+1220/1392] [blk_ioctl+407/1184] [old_mmap+238/304] [do_page_fault+0/1401] Nov 27 12:53:49 mach-0-39 kernel: [<c0128854>] [<c011472a>] [<c0128e74>] [<c01bb4d7>] [<c010ea9e>] [<c0114570>] Nov 27 12:53:49 mach-0-39 kernel: [error_code+52/60] Nov 27 12:53:49 mach-0-39 kernel: [<c0108bfc>] Nov 27 12:53:49 mach-0-39 kernel: Nov 27 12:53:49 mach-0-39 kernel: Code: 0f 0b dc 00 81 4b 25 c0 8b 43 18 a9 80 00 00 00 74 08 0f 0b Nov 27 13:00:00 mach-0-39 kernel: kernel BUG at page_alloc.c:220! Nov 27 13:00:00 mach-0-39 kernel: invalid operand: 0000 Nov 27 13:00:00 mach-0-39 kernel: CPU: 0 Nov 27 13:00:00 mach-0-39 kernel: EIP: 0010:[rmqueue+525/592] Not tainted Nov 27 13:00:00 mach-0-39 kernel: EIP: 0010:[<c0132c6d>] Not tainted Nov 27 13:00:00 mach-0-39 kernel: EFLAGS: 00010202 Nov 27 13:00:00 mach-0-39 kernel: eax: 00000040 ebx: c2210cd0 ecx: 00038000 edx: 00060599 Nov 27 13:00:00 mach-0-39 kernel: esi: c028b128 edi: 00048000 ebp: c1000020 esp: e246ddcc Nov 27 13:00:00 mach-0-39 kernel: ds: 0018 es: 0018 ss: 0018 Nov 27 13:00:00 mach-0-39 kernel: Process sh (pid: 2049, stackpage=e246d000) Nov 27 13:00:00 mach-0-39 kernel: Stack: 00038000 00028599 00000296 00000000 c028b128 c028b200 000001ff 00000000 Nov 27 13:00:00 mach-0-39 kernel: 00000025 c0132f01 c028b128 c028b1fc 000001d2 00000132 00104025 00000000 Nov 27 13:00:00 mach-0-39 kernel: 00000001 00000025 c0127ded 00000000 f4ebe3c0 f4ebe3c0 f63ee920 f53f1af8 Nov 27 13:00:00 mach-0-39 kernel: Call Trace: [__alloc_pages+81/384] [do_anonymous_page+93/368] [do_no_page+71/576] [handle_mm_fault+154/288] [sys_munmap+2/80] Nov 27 13:00:00 mach-0-39 kernel: Call Trace: [<c0132f01>] [<c0127ded>] [<c0127f47>] [<c01281da>] [<c01296c2>] Nov 27 13:00:00 mach-0-39 kernel: [__vma_link+116/192] [do_page_fault+442/1401] [do_mmap_pgoff+1220/1392] [zap_page_range+945/1056] [unmap_fixup+115/352] [sys_munmap+2/80] Nov 27 13:00:00 mach-0-39 kernel: [<c0128854>] [<c011472a>] [<c0128e74>] [<c0126c51>] [<c01292c3>] [<c01296c2>] Nov 27 13:00:00 mach-0-39 kernel: [sys_close+4/112] [sys_munmap+67/80] [do_page_fault+0/1401] [error_code+52/60] Nov 27 13:00:00 mach-0-39 kernel: [<c0139784>] [<c0129703>] [<c0114570>] [<c0108bfc>] Nov 27 13:00:00 mach-0-39 kernel: Nov 27 13:00:00 mach-0-39 kernel: Code: 0f 0b dc 00 81 4b 25 c0 8b 43 18 a9 80 00 00 00 74 08 0f 0b
Hi, I was just wondering if the output above helped any. I went into the kernel and turned on all the debugging information under the kernel debug section. I enabled smp and lost 4 seperate nodes now. I will bring them back up and see what information they can provide me.
This bug has been inappropriately marked MODIFIED. Please review the bug life cycle information at http://bugzilla.redhat.com/bugzilla/bug_status.cgi
Hi, I was wondering if anyone has been able to respond??? Thanks, Pauld
Hi, After looking through all the logs I noticed this on each machine that is common: ..MP-BIOS bug: 8254 timer not connected to IO-APIC [root@mach-0-30 log]# cat /proc/interrupts CPU0 CPU1 0: 288851 0 IO-APIC-edge timer 1: 2 0 IO-APIC-edge keyboard 2: 0 0 XT-PIC cascade 4: 207 0 IO-APIC-edge serial 8: 1 0 IO-APIC-edge rtc 14: 8073 0 IO-APIC-edge ide0 30: 53277 0 IO-APIC-level eth0 31: 161010 0 IO-APIC-level eth1 NMI: 0 0 LOC: 288532 288541 ERR: 0 MIS: 0 [root@mach-0-30 log]# [root@mach-0-30 log]# uname -a Linux mach-0-30 2.4.19 #7 SMP Thu Dec 12 13:49:51 PST 2002 i686 unknown [root@mach-0-30 log]#
This happened to me over the weekend. Is there anything else I can provide to help? [root@kmc2 log]# ksymoops -k ./ksyms.3 < ./oops.txt ksymoops 2.4.1 on i686 2.4.7-10enterprise. Options used -V (default) -k ./ksyms.3 (specified) -l /proc/modules (default) -o /lib/modules/2.4.7-10enterprise/ (default) -m /boot/System.map-2.4.7-10enterprise (default) Error (expand_objects): cannot stat(/lib/aic7xxx.o) for aic7xxx ksymoops: No such file or directory Error (expand_objects): cannot stat(/lib/sd_mod.o) for sd_mod ksymoops: No such file or directory Error (expand_objects): cannot stat(/lib/scsi_mod.o) for scsi_mod ksymoops: No such file or directory Warning (compare_ksyms_lsmod): module 3c59x is in lsmod but not in ksyms, probably no symbols exported Warning (compare_ksyms_lsmod): module appletalk is in lsmod but not in ksyms, probably no symbols exported Warning (compare_ksyms_lsmod): module eepro100 is in lsmod but not in ksyms, probably no symbols exported Warning (compare_ksyms_lsmod): module ipx is in lsmod but not in ksyms, probably no symbols exported Warning (compare_maps): mismatch on symbol partition_name , ksyms_base says c01c09e0, System.map says c0160900. Ignoring ksyms_base entry Warning (compare_maps): mismatch on symbol sd , sd_mod says f881cce4, /lib/modules/2.4.7-10enterprise/kernel/drivers/scsi/sd_mod.o says f881cba0. Ignoring /lib/modules/2.4.7-10enterprise/kernel/drivers/scsi/sd_mod.o entry Warning (compare_maps): mismatch on symbol proc_scsi , scsi_mod says f8818088, /lib/modules/2.4.7-10enterprise/kernel/drivers/scsi/scsi_mod.o says f8816910. Ignoring /lib/modules/2.4.7-10enterprise/kernel/drivers/scsi/scsi_mod.o entry Warning (compare_maps): mismatch on symbol scsi_devicelist , scsi_mod says f88180b4, /lib/modules/2.4.7-10enterprise/kernel/drivers/scsi/scsi_mod.o says f881693c. Ignoring /lib/modules/2.4.7-10enterprise/kernel/drivers/scsi/scsi_mod.o entry Warning (compare_maps): mismatch on symbol scsi_hostlist , scsi_mod says f88180b0, /lib/modules/2.4.7-10enterprise/kernel/drivers/scsi/scsi_mod.o says f8816938. Ignoring /lib/modules/2.4.7-10enterprise/kernel/drivers/scsi/scsi_mod.o entry Warning (compare_maps): mismatch on symbol scsi_hosts , scsi_mod says f88180b8, /lib/modules/2.4.7-10enterprise/kernel/drivers/scsi/scsi_mod.o says f8816940. Ignoring /lib/modules/2.4.7-10enterprise/kernel/drivers/scsi/scsi_mod.o entry Warning (compare_maps): mismatch on symbol scsi_logging_level , scsi_mod says f8818084, /lib/modules/2.4.7-10enterprise/kernel/drivers/scsi/scsi_mod.o says f881690c. Ignoring /lib/modules/2.4.7-10enterprise/kernel/drivers/scsi/scsi_mod.o entry kernel BUG at page_alloc.c:220! invalid operand: 0000 CPU: 0 EIP: 0010:[<c013620a>] Using defaults from ksymoops -t elf32-i386 -a i386 EFLAGS: 00010086 eax: 00000020 ebx: c0263540 ecx: c02616dc edx: 12f58a7a esi: c0263540 edi: 00000000 ebp: 00000000 esp: e6101da8 ds: 0018 es: 0018 ss: 0018 Process sh (pid: 5906, stackpage=e6101000) Stack: c02490a3 000000dc 00000000 00000283 c0263964 00000000 c0263540 c0263540 c0263a24 00000000 000000d2 c01365c4 00000001 000000d2 dcbf5220 00000000 c99bd464 c01367df 000000d2 00000000 c0263a20 d8f030c0 dcbf5220 00104000 Call Trace: [<c02490a3>] [<c01365c4>] [<c01367df>] [<c0129d75>] [<c012a973>] [<c01172c0>] [<c0117466>] [<c0125443>] [<c01172c0>] [<c0107268>] Code: 0f 0b 59 8b 56 08 5b 89 d3 8b 53 04 8b 03 89 50 04 89 02 ff >>EIP; c013620a <rmqueue+7a/300> <===== Trace; c02490a3 <call_spurious_interrupt+1eaca/24d47> Trace; c01365c4 <_wrapped_alloc_pages+74/280> Trace; c01367df <__alloc_pages+f/a0> Trace; c0129d75 <do_wp_page+1b5/410> Trace; c012a973 <handle_mm_fault+103/150> Trace; c01172c0 <do_page_fault+0/540> Trace; c0117466 <do_page_fault+1a6/540> Trace; c0125443 <sys_rt_sigaction+93/f0> Trace; c01172c0 <do_page_fault+0/540> Trace; c0107268 <error_code+38/40> Code; c013620a <rmqueue+7a/300> 00000000 <_EIP>: Code; c013620a <rmqueue+7a/300> <===== 0: 0f 0b ud2a <===== Code; c013620c <rmqueue+7c/300> 2: 59 pop %ecx Code; c013620d <rmqueue+7d/300> 3: 8b 56 08 mov 0x8(%esi),%edx Code; c0136210 <rmqueue+80/300> 6: 5b pop %ebx Code; c0136211 <rmqueue+81/300> 7: 89 d3 mov %edx,%ebx Code; c0136213 <rmqueue+83/300> 9: 8b 53 04 mov 0x4(%ebx),%edx Code; c0136216 <rmqueue+86/300> c: 8b 03 mov (%ebx),%eax Code; c0136218 <rmqueue+88/300> e: 89 50 04 mov %edx,0x4(%eax) Code; c013621b <rmqueue+8b/300> 11: 89 02 mov %eax,(%edx) Code; c013621d <rmqueue+8d/300> 13: ff 00 incl (%eax) kernel BUG at page_alloc.c:220! invalid operand: 0000 CPU: 0 EIP: 0010:[<c013620a>] EFLAGS: 00010086 Warning (Oops_read): Code line not seen, dumping what data is available >>EIP; c013620a <rmqueue+7a/300> <===== 12 warnings and 3 errors issued. Results may not be reliable. [root@kmc2 log]#
*** Bug 80023 has been marked as a duplicate of this bug. ***
Thanks for the bug report. However, Red Hat no longer maintains this version of the product. Please upgrade to the latest version and open a new bug if the problem persists. The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, and if you believe this bug is interesting to them, please report the problem in the bug tracker at: http://bugzilla.fedora.us/