Bug 170239

Summary: Kernel BUG at panic:74, invalid operand: 0000 [1] SMP
Product: Red Hat Enterprise Linux 4 Reporter: Mark Williamson <mjw>
Component: kernelAssignee: Jim Paradis <jparadis>
Status: CLOSED CANTFIX QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: 4.0CC: jbaron, peterm
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-02-24 21:08:29 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Mark Williamson 2005-10-09 23:14:52 UTC
Description of problem:

Kernel BUG at panic:74, invalid operand: 0000 [1] SMP

Version-Release number of selected component (if applicable):

System is RHEL AS4/U2, Kernel is custom, built from kernel-2.6.9-22.EL.src.rpm
BUT with TWO modifications:

1) Increase CONFIG_NR_CPUS=8 to 16 in kernel-2.6.9-x86_64-smp.config

2) Workaround for reboot problem; hence this patch is applied:
(http://www.iwill.net/product_imgs/90/RHEL4_Update1_Dual_core.PDF)

diff -Naur linux-2.6.9/arch/x86_64/kernel/reboot.c
linux-2.6.9-mjw/arch/x86_64/kernel/reboot.c
--- linux-2.6.9/arch/x86_64/kernel/reboot.c     2005-10-09 23:27:36.000000000 +0100
+++ linux-2.6.9-mjw/arch/x86_64/kernel/reboot.c 2005-10-10 00:03:43.000000000 +0100
@@ -113,7 +113,7 @@
        smp_stop_cpu();

        /* AP calling this. Just halt */
-       if (cpuid != boot_cpu_id) {
+       if (cpuid != x86_apicid_to_cpu(boot_cpu_id)) {
                for (;;)
                        asm("hlt");
        }


Hardware Environment:
8CPU Dual Core Opteron; http://www.iwill.net/product_2.asp?p_id=90&sp=Y


How reproducible:
Very; repeat the steps below

Steps to Reproduce:
1. Run piece of chemistry software
2. Wait a few minutes
3.
  
Actual results:
This is captured using netdump:

C and OSHP methods do not exist
usbhid: probe of 2-3:1.0 failed with error -5
ip_tables: (C) 2000-2002 Netfilter core team
ip_tables: (C) 2000-2002 Netfilter core team

CPU 30: Machine Check Exception:                4 Bank 4: b200000000070f0f
TSC 53843d7e254

CPU 22: Machine Check Exception:                4 Bank 4: b200000000070f0f
TSC 53843d821fe

CPU 16: Machine Check Exception:                4 Bank 4: b200000000070f0f
TSC 53843d82f41

CPU 24: Machine Check Exception:                4 Bank 4: b200000000070f0f
TSC 53843d8482e

CPU 28: Machine Check Exception:                4 Bank 4: b200000000070f0f
TSC 53843d84b3e

CPU 18: Machine Check Exception:                4 Bank 4: b200000000070f0f
TSC 53843d864b7

CPU 20: Machine Check Exception:                4 Bank 4: b200000000070f0f
TSC 53843d871f0

CPU 26: Machine Check Exception:                4 Bank 4: b200000000070f0f
TSC 53843d85b5c
Kernel panic - not syncing: Machine check
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at panic:74
invalid operand: 0000 [1] SMP
CPU 14
Modules linked in: md5 ipv6 netconsole netdump autofs4 sunrpc ds yenta_socket
pcmcia_core dm_mirror dm_mod button battery ac joydev ohci_hcd hw_random e100
mii e1000 floppy ext3 jbd 3w_xxxx sd_mod scsi_mod
Pid: 3058, comm: l502.exe Tainted: G   M  2.6.9-22mjw.EL.rootsmp
RIP: 0010:[<ffffffff8013691a>] <ffffffff8013691a>{panic+211}
RSP: 0000:000001023ff60d18  EFLAGS: 00010086
RAX: 000000000000002d RBX: ffffffff80317ca1 RCX: 0000000000000046
RDX: 0000000000006c99 RSI: 0000000000000046 RDI: ffffffff803d7f20
RBP: 0000000000000900 R08: 000000000000000d R09: ffffffff80317ca1
R10: 0000000002000000 R11: 0000000000000061 R12: 00000000ffffffff
R13: ffffffff803cf1a0 R14: 0000053843d7ceca R15: ffffffff80317ca1
FS:  0000000040812960(005b) GS:ffffffff804d5c80(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000002aeff6a0b8 CR3: 000000033ff54000 CR4: 00000000000006e0
Process l502.exe (pid: 3058, threadinfo 000001023bf8e000, task 000001013f1b9030)
Stack: 0000003000000008 000001023ff60df8 000001023ff60d38 0000053843d85b5c
       0000000000006c58 0000000000000046 0000000000006c6c 0000000000000046
       000000000000000d 0000000000000000
Call Trace:<ffffffff801176e4>{print_mce+136} <ffffffff801177bc>{mce_available+0}
       <ffffffff80117b0f>{do_machine_check+825}
<ffffffff801111db>{machine_check+127}


Code: 0f 0b 31 72 31 80 ff ff ff ff 4a 00 31 ff e8 c3 c4 fe ff e8
RIP <ffffffff8013691a>{panic+211} RSP <000001023ff60d18>


Expected results:

Program does not crash machine

Additional info:

MCE error translated using mcelog

[root@f01 kernel-2.6.9]# cat /tmp/mce.txt | mcelog --ascii
CPU 28 4 northbridge TSC 10311caf801c4
  Northbridge Watchdog error
       bit57 = processor context corrupt
       bit61 = error uncorrected
  bus error 'generic participation, request timed out
      generic error mem transaction
      generic access, level generic'
STATUS b200000000070f0f MCGSTATUS 4

Comment 1 Jim Paradis 2006-02-24 21:08:29 UTC
We do not support custom kernels.

As of Update 3, RHEL4 now provides a "largesmp" kernel that supports more than 8
processors.  Please try the RHEL4 U3 beta and report whether this addresses your
problem.

I am closing this issue as CANTFIX because it is reported against an unsupported
kernel.  If you still have problems with the latest kernel, please file a
separate support request.