Bug 116134 - Machines die under heavy load
Summary: Machines die under heavy load
Alias: None
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: kernel
Version: 3.0
Hardware: x86_64
OS: Linux
Target Milestone: ---
Assignee: Jim Paradis
QA Contact:
Depends On:
TreeView+ depends on / blocked
Reported: 2004-02-18 15:40 UTC by matt poepping
Modified: 2013-08-06 01:03 UTC (History)
3 users (show)

Clone Of:
Last Closed: 2004-04-15 20:04:14 UTC

Attachments (Terms of Use)
image of the crash? (75.91 KB, text/plain)
2004-02-26 15:37 UTC, matt poepping
no flags Details
memeory debug info (426.12 KB, text/plain)
2004-03-04 00:43 UTC, matt poepping
no flags Details
task info (593.63 KB, text/plain)
2004-03-04 00:43 UTC, matt poepping
no flags Details
dmi output (13.90 KB, text/plain)
2004-04-15 19:33 UTC, matt poepping
no flags Details

Description matt poepping 2004-02-18 15:40:33 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6)
Gecko/20040207 Firefox/0.8

Description of problem:
We have 9 dual cpu opterons runnning 2.4.21-9.ELsmp (they also crashed
with the default kernel that came with RHES3.0) that constantly crash
when a heavy load is applied to them.

all of them are registered with the redhat network if you want to get
any information from them on what they are running. they are under the
account xilinx-world and they are all the machines that start with lx64

Let me know what you need for this.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1. give machine heavy load on cpu

Additional info:

Comment 1 Jim Paradis 2004-02-19 16:25:17 UTC
At minimum we need you to capture the console output of the crash and
attach it to this bug report.  The easiest way to do this is to boot
with the parameters "console=ttyS0,57600 console=tty0" and set up a
second system to capture the serial output.

If you could also give us some idea of the application mix you were
running at the time of the crash that would help too.

Comment 2 matt poepping 2004-02-25 22:51:17 UTC
This is what it did when it booted up, unfortunately, it did not say
anything to the console when it died.

Bootdata ok (command line is ro root=LABEL=/ console=ttyS0,9600
Linux version 2.4.21-9.ELsmp (bhcompile@thor.perf.redhat.com) (gcc
version 3.2.3 20030502 (Red Hat Linux 3.2.3-26)) #1 SMP Thu Jan 8
16:52:31 EST 2004
BIOS-provided physical RAM map:
 BIOS-e820: 0000000000000000 - 000000000009fc00 (usable)
 BIOS-e820: 000000000009fc00 - 00000000000a0000 (reserved)
 BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
 BIOS-e820: 0000000000100000 - 00000000fbff0000 (usable)
 BIOS-e820: 00000000fbff0000 - 00000000fbfff000 (ACPI data)
 BIOS-e820: 00000000fbfff000 - 00000000fc000000 (ACPI NVS)
 BIOS-e820: 00000000ff7c0000 - 0000000100000000 (reserved)
kernel direct mapping tables upto 10100000000 @ 8000-d000
Scanning NUMA topology in Northbridge 24
Node 0 MemBase 0000000000000000 Limit 000000007fffffff
Node 1 MemBase 0000000080000000 Limit 00000000fbff0000
Using node hash shift of 24
Bootmem setup node 0 0000000000000000-000000007fffffff
Bootmem setup node 1 0000000080000000-00000000fbff0000
found SMP MP-table at 000ff780
hm, page 000ff000 reserved twice.
hm, page 00100000 reserved twice.
hm, page 000f9000 reserved twice.
hm, page 000fa000 reserved twice.
setting up node 0 0-7ffff
On node 0 totalpages: 524287
zone(0): 4096 pages.
zone(1): 520191 pages.
zone(2): 0 pages.
setting up node 1 80000-fbff0
On node 1 totalpages: 507888
zone(0): 0 pages.
zone(1): 507888 pages.
zone(2): 0 pages.
ACPI: RSDP (v000 ACPIAM                     ) @ 0x00000000000f4500
ACPI: RSDT (v001 A M I  OEMRSDT  01024.00801) @ 0x00000000fbff0000
ACPI: FADT (v001 A M I  OEMFACP  01024.00801) @ 0x00000000fbff0200
ACPI: MADT (v001 A M I  OEMAPIC  01024.00801) @ 0x00000000fbff0380
ACPI: OEMB (v001 A M I  OEMBIOS  01024.00801) @ 0x00000000fbfff040
ACPI: ASF! (v001 AMIASF AMDSTRET 00000.00001) @ 0x00000000fbff3340
ACPI: DSDT (v001  0ABCF 0ABCF007 00000.00007) @ 0x0000000000000000
ACPI: BIOS passes blacklist
ACPI: Local APIC address 0xfee00000
ACPI: LAPIC (acpi_id[0x01] lapic_id[0x00] enabled)
Processor #0 15:5 APIC version 16
ACPI: LAPIC (acpi_id[0x02] lapic_id[0x01] enabled)
Processor #1 15:5 APIC version 16
ACPI: IOAPIC (id[0x02] address[0xfec00000] global_irq_base[0x0])
IOAPIC[0]: Assigned apic_id 2
IOAPIC[0]: apic_id 2, version 17, address 0xfec00000, IRQ 0-23
ACPI: IOAPIC (id[0x03] address[0xfebff000] global_irq_base[0x18])
IOAPIC[1]: Assigned apic_id 3
IOAPIC[1]: apic_id 3, version 17, address 0xfebff000, IRQ 24-27
ACPI: IOAPIC (id[0x04] address[0xfebfe000] global_irq_base[0x1c])
IOAPIC[2]: Assigned apic_id 4
IOAPIC[2]: apic_id 4, version 17, address 0xfebfe000, IRQ 28-31
ACPI: INT_SRC_OVR (bus[0] irq[0x0] global_irq[0x2] polarity[0x0]
ACPI: INT_SRC_OVR (bus[0] irq[0x0] global_irq[0x2] polarity[0x0]
Using ACPI (MADT) for SMP configuration information
Kernel command line: ro root=LABEL=/ console=ttyS0,9600 console=tty0
Initializing CPU#0
spurious 8259A interrupt: IRQ7.
time.c: Detected 1.193182 MHz PIT timer.
time.c: Detected 1794.449 MHz TSC timer.
Console: colour VGA+ 80x25
Calibrating delay loop... 3578.26 BogoMIPS
Memory: 4012448k/4128704k available (1888k kernel code, 0k reserved,
1939k data, 224k init)
Dentry cache hash table entries: 262144 (order: 10, 4194304 bytes)
Inode cache hash table entries: 262144 (order: 10, 4194304 bytes)
Mount cache hash table entries: 256 (order: 0, 4096 bytes)
Buffer cache hash table entries: 262144 (order: 9, 2097152 bytes)
Page-cache hash table entries: 524288 (order: 10, 4194304 bytes)
CPU: L1 I Cache: 64K (64 bytes/line/2 way), D cache 64K (64
bytes/line/2 way)
CPU: L2 Cache: 1024K (64 bytes/line/8 way)
Machine Check Reporting enabled for CPU#0
POSIX conformance testing by UNIFIX
mtrr: v2.02 (20020716))
CPU: L1 I Cache: 64K (64 bytes/line/2 way), D cache 64K (64
bytes/line/2 way)
CPU: L2 Cache: 1024K (64 bytes/line/8 way)
CPU0: AMD Opteron(tm) Processor 244 stepping 01
per-CPU timeslice cutoff: 5120.13 usecs.
task migration cache decay timeout: 10 msecs.
Booting processor 1/1 rip 6000 page 000001000443e000
Initializing CPU#1
Calibrating delay loop... 3578.26 BogoMIPS
CPU: L1 I Cache: 64K (64 bytes/line/2 way), D cache 64K (64
bytes/line/2 way)
CPU: L2 Cache: 1024K (64 bytes/line/8 way)
Machine Check Reporting enabled for CPU#1
CPU1: AMD Opteron(tm) Processor 244 stepping 01
Total of 2 processors activated (7156.53 BogoMIPS).
..TIMER: vector=0x31 pin1=2 pin2=0
testing the IO APIC.......................
.................................... done.
Using local APIC timer interrupts.
Detected 12.461 MHz APIC timer.
cpu: 0, clocks: 1993833, slice: 664611
cpu: 1, clocks: 1993833, slice: 664611
checking TSC synchronization across CPUs: passed.
time.c: Using PIT/TSC based timekeeping.
Starting migration thread for cpu 0
Starting migration thread for cpu 1
ACPI: Subsystem revision 20030619
PCI: Using configuration type 1
 tbxface-0117 [03] acpi_load_tables      : ACPI Tables successfully
Parsing all Control
[DSDT](id F004) - 428 Objects with 37 Devices 132 Methods 13 Regions
ACPI Namespace successfully loaded at root ffffffff80564600
evxfevnt-0093 [04] acpi_enable           : Transition to ACPI mode
evgpeblk-0748 [06] ev_create_gpe_block   : GPE 00 to 15 [_GPE] 2 regs
at 0000000000005020 on int 9
evgpeblk-0748 [06] ev_create_gpe_block   : GPE 16 to 47 [_GPE] 4 regs
at 00000000000050B0 on int 9
Completing Region/Field/Buffer/Package
Initialized 12/14 Regions 26/26 Fields 36/36 Buffers 16/16 Packages
(437 nodes)
Executing all Device _STA and_INI
methods:......................................38 Devices found
containing: 38 _STA, 0 _INI methods
ACPI: Interpreter enabled
ACPI: Using IOAPIC for interrupt routing
ACPI: System [ACPI] (supports S0 S1 S4 S5)
ACPI: PCI Root Bridge [PCI0] (00:00)
ACPI: PCI Interrupt Link [LNKA] (IRQs 3 4 5 6 7 9 10 11 12 14 15,
ACPI: PCI Interrupt Link [LNKB] (IRQs 3 4 5 6 7 *9 10 11 12 14 15)
ACPI: PCI Interrupt Link [LNKC] (IRQs 3 4 5 6 7 9 10 *11 12 14 15)
ACPI: PCI Interrupt Link [LNKD] (IRQs 3 4 5 6 7 9 *10 11 12 14 15)
ACPI: PCI Interrupt Link [LNKA] enabled at IRQ 5
PCI: Using ACPI for IRQ routing
Linux agpgart interface v0.99 (c) Jeff Hartmann
agpgart: Maximum main memory to use for agp memory: 3868M
PCI-DMA: Disabling IOMMU.
Linux NET4.0 for Linux 2.4
Based upon Swansea University Computer Society NET3.039
Initializing RT netlink socket
Starting kswapd
VFS: Disk quotas vdquot_6.5.1
aio_setup: num_physpages = 258044
aio_setup: sizeof(struct page) = 104
Hugetlbfs mounted.
Total HugeTLB memory allocated, 0
IA32 emulation $Id: sys_ia32.c,v 1.56 2003/04/10 10:45:37 ak Exp $
initialize_kbd: Keyboard reset failed, no ACK
pty: 2048 Unix98 ptys configured
Serial driver version 5.05c (2001-07-08) with MANY_PORTS MULTIPORT
ttyS0 at 0x03f8 (irq = 4) is a 16550A
ttyS1 at 0x02f8 (irq = 3) is a 16550A
register_serial(): autoconfig failed
register_serial(): autoconfig failed
Real Time Clock Driver v1.10e
NET4: Frame Diverter 0.46
RAMDISK driver initialized: 256 RAM disks of 8192K size 1024 blocksize
Uniform Multi-Platform E-IDE driver Revision: 7.00beta4-2.4
ide: Assuming 33MHz system bus speed for PIO modes; override with
AMD8111: IDE controller at PCI slot 00:07.1
AMD8111: chipset revision 3
AMD8111: not 100% native mode: will probe irqs later
ide: Assuming 33MHz system bus speed for PIO modes; override with
AMD_IDE: Advanced Micro Devices [AMD] AMD-8111 IDE (rev 03) UDMA100
controller on pci00:07.1
    ide0: BM-DMA at 0xffa0-0xffa7, BIOS settings: hda:DMA, hdb:pio
    ide1: BM-DMA at 0xffa8-0xffaf, BIOS settings: hdc:pio, hdd:DMA
hda: ST380011A, ATA DISK drive
hdd: CD-232E, ATAPI CD/DVD-ROM drive
ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
ide1 at 0x170-0x177,0x376 on irq 15
hda: attached ide-disk driver.
hda: host protected area => 1
hda: 156301488 sectors (80026 MB) w/2048KiB Cache, CHS=9729/255/63,
ide-floppy driver 0.99.newide
Partition check:
 hda: hda1 hda2 hda3
ide-floppy driver 0.99.newide
md: md driver 0.90.0 MAX_MD_DEVS=256, MD_SB_DISKS=27
md: Autodetecting RAID arrays.
md: autorun ...
md: ... autorun DONE.
Initializing Cryptographic API
NET4: Linux TCP/IP 1.0 for NET4.0
IP: routing cache hash table of 16384 buckets, 256Kbytes
TCP: Hash tables configured (established 262144 bind 65536)
Linux IP multicast router 0.06 plus PIM-SM
Initializing IPsec netlink socket
NET4: Unix domain sockets 1.0/SMP for Linux NET4.0.
RAMDISK: Compressed image found at block 0
VFS: Mounted root (ext2 filesystem).
Journalled Block Device driver loaded
EXT3-fs: INFO: recovery required on readonly filesystem.
EXT3-fs: write access will be enabled during recovery.
kjournald starting.  Commit interval 5 seconds
EXT3-fs: ide0(3,3): orphan cleanup on readonly fs
EXT3-fs: ide0(3,3): 1 orphan inode deleted
EXT3-fs: recovery complete.
EXT3-fs: mounted filesystem with ordered data mode.
Freeing unused kernel memory: 224k freed

Comment 3 Jim Paradis 2004-02-26 04:30:43 UTC
Can you characterize exactly how it "died" then?  Did the system
simply stop responding (i.e. hang)?  Did it show any anomalous
symptoms of impending failure prior to the crash?

You can try to get more information when the systems die in the
following way:

Enable the "Magic sysrq key" as follows:

    # echo 1 > /proc/sys/kernel/sysrq

Then issue the following key sequences at the console to get debug info:

    Alt-SysRq-t  (prints running tasks)
    Alt-SysRq-m  (prints memory info)
    Alt-SysRq-p  (prints current state of CPU)

If you get any useful output from these, capture it and attach it to
this bug report.

Comment 4 matt poepping 2004-02-26 14:34:51 UTC
When the machine dies, it just hangs.  I'll try running the Magic
sysrq key, to see if I can get any info when the machine dies.

Comment 5 matt poepping 2004-02-26 15:37:25 UTC
Created attachment 98074 [details]
image of the crash?

Comment 6 matt poepping 2004-02-26 15:39:30 UTC
The above attachement is when I did a process information with the
sysrq. it just spewed the pid 4990 over and over, and the machine
eventually crashed, not sure if this is due to my usage of sysrq, or
if this is the actual bug that is causing the machines to crash.

the attachment is a jpg.

Comment 7 Jim Paradis 2004-02-26 16:55:12 UTC
If you have your system set up for serial console capture as outlined
above, the output of SysRq requests should go there as well.

Can you capture the output of Alt-SysRq-t and Alt-SysRq-m in this manner?

Comment 8 Jim Paradis 2004-03-03 20:26:24 UTC
Any more info on this bug?

Comment 9 matt poepping 2004-03-04 00:43:04 UTC
Created attachment 98273 [details]
memeory debug info

Comment 10 matt poepping 2004-03-04 00:43:34 UTC
Created attachment 98274 [details]
task info

Comment 11 matt poepping 2004-03-04 00:43:53 UTC
Let me know if you need any more information.


Comment 12 Jim Paradis 2004-03-04 21:23:44 UTC
Could you do the "task info" dump again?  The one you attached is
garbled in the middle.  Did you hit "Alt-sysrq-t" several times?  If
you hit it while it was still printing out, the second CPU may have
gotten it and started processing it as well (I know, it's a bug in
SysRQ processing, but still...).  Try Alt-sysrq-t *just once* and send
me the results... thanks!

Comment 13 Ernie Petrides 2004-03-05 23:50:28 UTC
I just unmarked Jim's last comment from being private,
which seemed to be a mistake.

Comment 14 matt poepping 2004-03-09 18:45:43 UTC
We might have a solution on our side. The vendor gave us a bios
upgrade, and so far the machines that have had the bios upgrade have
not crashed.

I would give it a few more days and then close this ticket.


Comment 15 Jim Paradis 2004-03-15 17:58:59 UTC
Are your machines still running okay?

In any case, could you run /usr/sbin/dmidecode and attach its output
to this bug report?  That way we'll know what BIOS rev you're running.

Comment 16 matt poepping 2004-04-15 19:33:39 UTC
Created attachment 99455 [details]
dmi output

Comment 17 matt poepping 2004-04-15 19:35:52 UTC
the problem was resolved with the bios upgrade, the dmidecode output
is from the machine after the bios update was applied to it.

Comment 18 Jim Paradis 2004-04-15 20:04:14 UTC
Closing as NOTABUG since it was a system firmware issue.

Note You need to log in before you can comment on or make changes to this bug.