From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6) Gecko/20040207 Firefox/0.8 Description of problem: We have 9 dual cpu opterons runnning 2.4.21-9.ELsmp (they also crashed with the default kernel that came with RHES3.0) that constantly crash when a heavy load is applied to them. all of them are registered with the redhat network if you want to get any information from them on what they are running. they are under the account xilinx-world and they are all the machines that start with lx64 Let me know what you need for this. Version-Release number of selected component (if applicable): 2.4.21-9.ELsmp How reproducible: Always Steps to Reproduce: 1. give machine heavy load on cpu Additional info:
At minimum we need you to capture the console output of the crash and attach it to this bug report. The easiest way to do this is to boot with the parameters "console=ttyS0,57600 console=tty0" and set up a second system to capture the serial output. If you could also give us some idea of the application mix you were running at the time of the crash that would help too.
This is what it did when it booted up, unfortunately, it did not say anything to the console when it died. Bootdata ok (command line is ro root=LABEL=/ console=ttyS0,9600 console=tty0) Linux version 2.4.21-9.ELsmp (bhcompile.redhat.com) (gcc version 3.2.3 20030502 (Red Hat Linux 3.2.3-26)) #1 SMP Thu Jan 8 16:52:31 EST 2004 BIOS-provided physical RAM map: BIOS-e820: 0000000000000000 - 000000000009fc00 (usable) BIOS-e820: 000000000009fc00 - 00000000000a0000 (reserved) BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved) BIOS-e820: 0000000000100000 - 00000000fbff0000 (usable) BIOS-e820: 00000000fbff0000 - 00000000fbfff000 (ACPI data) BIOS-e820: 00000000fbfff000 - 00000000fc000000 (ACPI NVS) BIOS-e820: 00000000ff7c0000 - 0000000100000000 (reserved) kernel direct mapping tables upto 10100000000 @ 8000-d000 Scanning NUMA topology in Northbridge 24 Node 0 MemBase 0000000000000000 Limit 000000007fffffff Node 1 MemBase 0000000080000000 Limit 00000000fbff0000 Using node hash shift of 24 Bootmem setup node 0 0000000000000000-000000007fffffff Bootmem setup node 1 0000000080000000-00000000fbff0000 found SMP MP-table at 000ff780 hm, page 000ff000 reserved twice. hm, page 00100000 reserved twice. hm, page 000f9000 reserved twice. hm, page 000fa000 reserved twice. setting up node 0 0-7ffff On node 0 totalpages: 524287 zone(0): 4096 pages. zone(1): 520191 pages. zone(2): 0 pages. setting up node 1 80000-fbff0 On node 1 totalpages: 507888 zone(0): 0 pages. zone(1): 507888 pages. zone(2): 0 pages. ACPI: RSDP (v000 ACPIAM ) @ 0x00000000000f4500 ACPI: RSDT (v001 A M I OEMRSDT 01024.00801) @ 0x00000000fbff0000 ACPI: FADT (v001 A M I OEMFACP 01024.00801) @ 0x00000000fbff0200 ACPI: MADT (v001 A M I OEMAPIC 01024.00801) @ 0x00000000fbff0380 ACPI: OEMB (v001 A M I OEMBIOS 01024.00801) @ 0x00000000fbfff040 ACPI: ASF! (v001 AMIASF AMDSTRET 00000.00001) @ 0x00000000fbff3340 ACPI: DSDT (v001 0ABCF 0ABCF007 00000.00007) @ 0x0000000000000000 ACPI: BIOS passes blacklist ACPI: Local APIC address 0xfee00000 ACPI: LAPIC (acpi_id[0x01] lapic_id[0x00] enabled) Processor #0 15:5 APIC version 16 ACPI: LAPIC (acpi_id[0x02] lapic_id[0x01] enabled) Processor #1 15:5 APIC version 16 ACPI: IOAPIC (id[0x02] address[0xfec00000] global_irq_base[0x0]) IOAPIC[0]: Assigned apic_id 2 IOAPIC[0]: apic_id 2, version 17, address 0xfec00000, IRQ 0-23 ACPI: IOAPIC (id[0x03] address[0xfebff000] global_irq_base[0x18]) IOAPIC[1]: Assigned apic_id 3 IOAPIC[1]: apic_id 3, version 17, address 0xfebff000, IRQ 24-27 ACPI: IOAPIC (id[0x04] address[0xfebfe000] global_irq_base[0x1c]) IOAPIC[2]: Assigned apic_id 4 IOAPIC[2]: apic_id 4, version 17, address 0xfebfe000, IRQ 28-31 ACPI: INT_SRC_OVR (bus[0] irq[0x0] global_irq[0x2] polarity[0x0] trigger[0x0]) ACPI: INT_SRC_OVR (bus[0] irq[0x0] global_irq[0x2] polarity[0x0] trigger[0x0]) Using ACPI (MADT) for SMP configuration information Kernel command line: ro root=LABEL=/ console=ttyS0,9600 console=tty0 Initializing CPU#0 spurious 8259A interrupt: IRQ7. time.c: Detected 1.193182 MHz PIT timer. time.c: Detected 1794.449 MHz TSC timer. Console: colour VGA+ 80x25 Calibrating delay loop... 3578.26 BogoMIPS Memory: 4012448k/4128704k available (1888k kernel code, 0k reserved, 1939k data, 224k init) Dentry cache hash table entries: 262144 (order: 10, 4194304 bytes) Inode cache hash table entries: 262144 (order: 10, 4194304 bytes) Mount cache hash table entries: 256 (order: 0, 4096 bytes) Buffer cache hash table entries: 262144 (order: 9, 2097152 bytes) Page-cache hash table entries: 524288 (order: 10, 4194304 bytes) CPU: L1 I Cache: 64K (64 bytes/line/2 way), D cache 64K (64 bytes/line/2 way) CPU: L2 Cache: 1024K (64 bytes/line/8 way) Machine Check Reporting enabled for CPU#0 POSIX conformance testing by UNIFIX mtrr: v2.02 (20020716)) CPU: L1 I Cache: 64K (64 bytes/line/2 way), D cache 64K (64 bytes/line/2 way) CPU: L2 Cache: 1024K (64 bytes/line/8 way) CPU0: AMD Opteron(tm) Processor 244 stepping 01 per-CPU timeslice cutoff: 5120.13 usecs. task migration cache decay timeout: 10 msecs. Booting processor 1/1 rip 6000 page 000001000443e000 Initializing CPU#1 Calibrating delay loop... 3578.26 BogoMIPS CPU: L1 I Cache: 64K (64 bytes/line/2 way), D cache 64K (64 bytes/line/2 way) CPU: L2 Cache: 1024K (64 bytes/line/8 way) Machine Check Reporting enabled for CPU#1 CPU1: AMD Opteron(tm) Processor 244 stepping 01 Total of 2 processors activated (7156.53 BogoMIPS). ENABLING IO-APIC IRQs ..TIMER: vector=0x31 pin1=2 pin2=0 testing the IO APIC....................... .................................... done. Using local APIC timer interrupts. Detected 12.461 MHz APIC timer. cpu: 0, clocks: 1993833, slice: 664611 CPU0<T0:1993824,T1:1329200,D:13,S:664611,C:1993833> cpu: 1, clocks: 1993833, slice: 664611 CPU1<T0:1993824,T1:664592,D:10,S:664611,C:1993833> checking TSC synchronization across CPUs: passed. time.c: Using PIT/TSC based timekeeping. Starting migration thread for cpu 0 Starting migration thread for cpu 1 ACPI: Subsystem revision 20030619 PCI: Using configuration type 1 tbxface-0117 [03] acpi_load_tables : ACPI Tables successfully acquired Parsing all Control Methods:....................................................................................................................................Table [DSDT](id F004) - 428 Objects with 37 Devices 132 Methods 13 Regions ACPI Namespace successfully loaded at root ffffffff80564600 evxfevnt-0093 [04] acpi_enable : Transition to ACPI mode successful evgpeblk-0748 [06] ev_create_gpe_block : GPE 00 to 15 [_GPE] 2 regs at 0000000000005020 on int 9 evgpeblk-0748 [06] ev_create_gpe_block : GPE 16 to 47 [_GPE] 4 regs at 00000000000050B0 on int 9 Completing Region/Field/Buffer/Package initialization:.......................................................................................... Initialized 12/14 Regions 26/26 Fields 36/36 Buffers 16/16 Packages (437 nodes) Executing all Device _STA and_INI methods:......................................38 Devices found containing: 38 _STA, 0 _INI methods ACPI: Interpreter enabled ACPI: Using IOAPIC for interrupt routing ACPI: System [ACPI] (supports S0 S1 S4 S5) ACPI: PCI Root Bridge [PCI0] (00:00) ACPI: PCI Interrupt Link [LNKA] (IRQs 3 4 5 6 7 9 10 11 12 14 15, disabled) ACPI: PCI Interrupt Link [LNKB] (IRQs 3 4 5 6 7 *9 10 11 12 14 15) ACPI: PCI Interrupt Link [LNKC] (IRQs 3 4 5 6 7 9 10 *11 12 14 15) ACPI: PCI Interrupt Link [LNKD] (IRQs 3 4 5 6 7 9 *10 11 12 14 15) ACPI: PCI Interrupt Link [LNKA] enabled at IRQ 5 PCI: Using ACPI for IRQ routing Linux agpgart interface v0.99 (c) Jeff Hartmann agpgart: Maximum main memory to use for agp memory: 3868M PCI-DMA: Disabling IOMMU. Linux NET4.0 for Linux 2.4 Based upon Swansea University Computer Society NET3.039 Initializing RT netlink socket Starting kswapd VFS: Disk quotas vdquot_6.5.1 aio_setup: num_physpages = 258044 aio_setup: sizeof(struct page) = 104 Hugetlbfs mounted. Total HugeTLB memory allocated, 0 IA32 emulation $Id: sys_ia32.c,v 1.56 2003/04/10 10:45:37 ak Exp $ initialize_kbd: Keyboard reset failed, no ACK pty: 2048 Unix98 ptys configured Serial driver version 5.05c (2001-07-08) with MANY_PORTS MULTIPORT SHARE_IRQ SERIAL_PCI SERIAL_ACPI enabled ttyS0 at 0x03f8 (irq = 4) is a 16550A ttyS1 at 0x02f8 (irq = 3) is a 16550A register_serial(): autoconfig failed register_serial(): autoconfig failed Real Time Clock Driver v1.10e NET4: Frame Diverter 0.46 RAMDISK driver initialized: 256 RAM disks of 8192K size 1024 blocksize Uniform Multi-Platform E-IDE driver Revision: 7.00beta4-2.4 ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx AMD8111: IDE controller at PCI slot 00:07.1 AMD8111: chipset revision 3 AMD8111: not 100% native mode: will probe irqs later ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx AMD_IDE: Advanced Micro Devices [AMD] AMD-8111 IDE (rev 03) UDMA100 controller on pci00:07.1 ide0: BM-DMA at 0xffa0-0xffa7, BIOS settings: hda:DMA, hdb:pio ide1: BM-DMA at 0xffa8-0xffaf, BIOS settings: hdc:pio, hdd:DMA hda: ST380011A, ATA DISK drive hdd: CD-232E, ATAPI CD/DVD-ROM drive ide0 at 0x1f0-0x1f7,0x3f6 on irq 14 ide1 at 0x170-0x177,0x376 on irq 15 hda: attached ide-disk driver. hda: host protected area => 1 hda: 156301488 sectors (80026 MB) w/2048KiB Cache, CHS=9729/255/63, UDMA(100) ide-floppy driver 0.99.newide Partition check: hda: hda1 hda2 hda3 ide-floppy driver 0.99.newide md: md driver 0.90.0 MAX_MD_DEVS=256, MD_SB_DISKS=27 md: Autodetecting RAID arrays. md: autorun ... md: ... autorun DONE. Initializing Cryptographic API NET4: Linux TCP/IP 1.0 for NET4.0 IP: routing cache hash table of 16384 buckets, 256Kbytes TCP: Hash tables configured (established 262144 bind 65536) Linux IP multicast router 0.06 plus PIM-SM Initializing IPsec netlink socket NET4: Unix domain sockets 1.0/SMP for Linux NET4.0. RAMDISK: Compressed image found at block 0 VFS: Mounted root (ext2 filesystem). Journalled Block Device driver loaded EXT3-fs: INFO: recovery required on readonly filesystem. EXT3-fs: write access will be enabled during recovery. kjournald starting. Commit interval 5 seconds EXT3-fs: ide0(3,3): orphan cleanup on readonly fs EXT3-fs: ide0(3,3): 1 orphan inode deleted EXT3-fs: recovery complete. EXT3-fs: mounted filesystem with ordered data mode. Freeing unused kernel memory: 224k freed
Can you characterize exactly how it "died" then? Did the system simply stop responding (i.e. hang)? Did it show any anomalous symptoms of impending failure prior to the crash? You can try to get more information when the systems die in the following way: Enable the "Magic sysrq key" as follows: # echo 1 > /proc/sys/kernel/sysrq Then issue the following key sequences at the console to get debug info: Alt-SysRq-t (prints running tasks) Alt-SysRq-m (prints memory info) Alt-SysRq-p (prints current state of CPU) If you get any useful output from these, capture it and attach it to this bug report.
When the machine dies, it just hangs. I'll try running the Magic sysrq key, to see if I can get any info when the machine dies.
Created attachment 98074 [details] image of the crash?
The above attachement is when I did a process information with the sysrq. it just spewed the pid 4990 over and over, and the machine eventually crashed, not sure if this is due to my usage of sysrq, or if this is the actual bug that is causing the machines to crash. the attachment is a jpg.
If you have your system set up for serial console capture as outlined above, the output of SysRq requests should go there as well. Can you capture the output of Alt-SysRq-t and Alt-SysRq-m in this manner?
Any more info on this bug?
Created attachment 98273 [details] memeory debug info
Created attachment 98274 [details] task info
Let me know if you need any more information. thansk, matt
Could you do the "task info" dump again? The one you attached is garbled in the middle. Did you hit "Alt-sysrq-t" several times? If you hit it while it was still printing out, the second CPU may have gotten it and started processing it as well (I know, it's a bug in SysRQ processing, but still...). Try Alt-sysrq-t *just once* and send me the results... thanks!
I just unmarked Jim's last comment from being private, which seemed to be a mistake.
We might have a solution on our side. The vendor gave us a bios upgrade, and so far the machines that have had the bios upgrade have not crashed. I would give it a few more days and then close this ticket. Thanks, matt
Are your machines still running okay? In any case, could you run /usr/sbin/dmidecode and attach its output to this bug report? That way we'll know what BIOS rev you're running. Thanks!
Created attachment 99455 [details] dmi output
the problem was resolved with the bios upgrade, the dmidecode output is from the machine after the bios update was applied to it.
Closing as NOTABUG since it was a system firmware issue.