Bug 101345 - [ACPI] SMP kernels with HT runqueue sharing panic on 16-way x440
[ACPI] SMP kernels with HT runqueue sharing panic on 16-way x440
Status: CLOSED RAWHIDE
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: kernel (Show other bugs)
3.0
i686 Linux
medium Severity high
: ---
: ---
Assigned To: Arjan van de Ven
Brian Brock
:
Depends On:
Blocks: 97942
  Show dependency treegraph
 
Reported: 2003-07-30 21:28 EDT by James Cleverdon
Modified: 2007-11-30 17:06 EST (History)
3 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2003-08-15 11:24:00 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
16-way x440 dmesg from 2.4.21-1.1931.2.349.2.2.entsmp kernel (26.58 KB, text/plain)
2003-07-31 16:45 EDT, James Cleverdon
no flags Details
Dmesg from a kernel with acpi_provides_cpus forced to 0. (7.58 KB, text/plain)
2003-07-31 16:56 EDT, James Cleverdon
no flags Details
dmesg from kernel with acpi_provides_cpus forced to 0, without being run through ksymoops (27.07 KB, text/plain)
2003-08-01 15:42 EDT, James Cleverdon
no flags Details
Used very crude kludge to fix sibling map. Still died in sched_map_runqueue. (27.92 KB, text/plain)
2003-08-01 20:45 EDT, James Cleverdon
no flags Details

  None (edit)
Description James Cleverdon 2003-07-30 21:28:02 EDT
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20030225

Description of problem:
On a 16-way x440 none of the installed kernels will run except for the
uni-processor kernel.  All SMP kernels crash in sched_map_runqueue with:

Kernel panic: Attempted to kill init!

The sibling map printed just above it is totally screwed up. This has probably
caused the runqueue merging code to die painfully.

I suspect the acpi_proc_id array contains bad values.



Version-Release number of selected component (if applicable):
kernel-smp-2.4.21-1.1931.2.349.2.2.ent

How reproducible:
Always

Steps to Reproduce:
1. Boot a SMP kernel on 16-way x440
2. Crash
3. There is no step 3.
    

Actual Results:  Not a lot, really.

Expected Results:  The kernel should have booted and run properly.

Additional info:

WARNING: No sibling found for CPU 0.
cpu_sibling_map[1] = 2
cpu_sibling_map[2] = 1 
WARNING: No sibling found for CPU 3.
WARNING: No sibling found for CPU 4. 
cpu_sibling_map[5] = 1 
cpu_sibling_map[6] = 1 
WARNING: No sibling found for CPU 7. 
WARNING: No sibling found for CPU 8.
cpu_sibling_map[9] = 1 
cpu_sibling_map[10] = 1
WARNING: No sibling found for CPU 11.
WARNING: No sibling found for CPU 12.
cpu_sibling_map[13] = 1
cpu_sibling_map[14] = 1
WARNING: No sibling found for CPU 15.
WARNING: No sibling found for CPU 16.
cpu_sibling_map[17] = 1
cpu_sibling_map[18] = 1
WARNING: No sibling found for CPU 19.
WARNING: No sibling found for CPU 20.
cpu_sibling_map[21] = 1
cpu_sibling_map[22] = 1
WARNING: No sibling found for CPU 23.
WARNING: No sibling found for CPU 24.
cpu_sibling_map[25] = 1
cpu_sibling_map[26] = 1
WARNING: No sibling found for CPU 27.
WARNING: No sibling found for CPU 28.
cpu_sibling_map[29] = 1
cpu_sibling_map[30] = 1
cpu_sibling_map[31] = 1
mapping CPU#1's runqueue to CPU#2's runqueue.
mapping CPU#2's runqueue to CPU#5's runqueue.
------------[ cut here ]------------
kernel BUG at sched.c:1144!
invalid operand: 0000

CPU:    0
EIP:    0060:[<021229ed>]    Not tainted
EFLAGS: 00010202

EIP is at sched_map_runqueue [kernel] 0xed (2.4.21-1.1931.2.349.2.2.entsmp)
eax: 00000001   ebx: 0242dc80   ecx: 00000001   edx: 00000002
esi: 02430480   edi: 00000005   ebp: 1f201fac   esp: 1f201f90
ds: 0068   es: 0068   ss: 0068
Process swapper (pid: 1, stackpage=1f201000)
Stack: 022b0520 00000002 00000005 00000000 00000002 0244c240 00000000 1c7f8000
       023efd1d 00000002 00000005 00000001 00000039 1f200000 00000000 00000000
       00000000 023e7645 0210709b 00000000 02107070 00000000 00000000 00000000
Call Trace:   [<0210709b>] init [kernel] 0x2b (0x1f201fd8)
[<02107070>] init [kernel] 0x0 (0x1f201fe0)
[<0210955d>] kernel_thread_helper [kernel] 0x5 (0x1f201ff0)


Code: 0f 0b 78 04 06 fe 2a 02 e9 78 ff ff ff 8d b6 00 00 00 00 55
 <0>Kernel panic: Attempted to kill init!

----
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010202
eax: 00000001   ebx: 0242dc80   ecx: 00000001   edx: 00000002
esi: 02430480   edi: 00000005   ebp: 1f201fac   esp: 1f201f90
ds: 0068   es: 0068   ss: 0068
Process swapper (pid: 1, stackpage=1f201000)
Stack: 022b0520 00000002 00000005 00000000 00000002 0244c240 00000000 1c7f8000
       023efd1d 00000002 00000005 00000001 00000039 1f200000 00000000 00000000
       00000000 023e7645 0210709b 00000000 02107070 00000000 00000000 00000000
Call Trace:   [<0210709b>] init [kernel] 0x2b (0x1f201fd8)
[<02107070>] init [kernel] 0x0 (0x1f201fe0)
[<0210955d>] kernel_thread_helper [kernel] 0x5 (0x1f201ff0)
Code: 0f 0b 78 04 06 fe 2a 02 e9 78 ff ff ff 8d b6 00 00 00 00 55


>>EIP; 021229ed <sched_map_runqueue+ed/100>   <=====

>>ebx; 0242dc80 <runqueues+a00/14000>
>>esi; 02430480 <runqueues+3200/14000>

Trace; 0210709b <init+2b/190>
Trace; 02107070 <init+0/190>
Trace; 0210955d <kernel_thread_helper+5/18>

Code;  021229ed <sched_map_runqueue+ed/100>
00000000 <_EIP>:
Code;  021229ed <sched_map_runqueue+ed/100>   <=====
   0:   0f 0b                     ud2a      <=====
Code;  021229ef <sched_map_runqueue+ef/100>
   2:   78 04                     js     8 <_EIP+0x8>
Code;  021229f1 <sched_map_runqueue+f1/100>
   4:   06                        push   %es
Code;  021229f2 <sched_map_runqueue+f2/100>
   5:   fe                        (bad)  
Code;  021229f3 <sched_map_runqueue+f3/100>
   6:   2a 02                     sub    (%edx),%al
Code;  021229f5 <sched_map_runqueue+f5/100>
   8:   e9 78 ff ff ff            jmp    ffffff85 <_EIP+0xffffff85>
Code;  021229fa <sched_map_runqueue+fa/100>
   d:   8d b6 00 00 00 00         lea    0x0(%esi),%esi
Code;  02122a00 <rebalance_tick+0/90>
  13:   55                        push   %ebp
Comment 1 Arjan van de Ven 2003-07-31 05:42:42 EDT
any chance of being able to get a full dmesg of this boot ?
Comment 2 James Cleverdon 2003-07-31 16:45:03 EDT
Created attachment 93306 [details]
16-way x440 dmesg from 2.4.21-1.1931.2.349.2.2.entsmp kernel

Here you go.
Comment 3 James Cleverdon 2003-07-31 16:56:36 EDT
Created attachment 93310 [details]
Dmesg from a kernel with acpi_provides_cpus forced to 0.

The sibling table is still all wrong, but in a different way.
Comment 4 Arjan van de Ven 2003-08-01 05:20:13 EDT
On first sight, tt looks like the ACPI cpu enumeration table is incorrect.
Comment 5 Arjan van de Ven 2003-08-01 05:23:07 EDT
can you provide the full dmesg of the second case without having ksymoops
corrupt it? (and with the full boot info in tact)
Comment 6 James Cleverdon 2003-08-01 15:42:16 EDT
Created attachment 93340 [details]
dmesg from kernel with acpi_provides_cpus forced to 0, without being run through ksymoops
Comment 7 James Cleverdon 2003-08-01 20:45:57 EDT
Created attachment 93352 [details]
Used very crude kludge to fix sibling map. Still died in sched_map_runqueue.
Comment 8 Arjan van de Ven 2003-08-04 11:37:36 EDT
OK I've figured this one out, there were some pretty bad bugs in the acpitable
parsing code.

This fix will be available with build number  2.4.21-1.1931.2.376 which should
be available from RHN as of tomorrow.

Thank you for the excellent information in this bugreport.
Comment 9 Wendy Hung 2003-08-07 16:27:35 EDT
Also saw this with a 8-way x445 with hyperthreading enabled.  HT disabled was 
OK.
x440 with HT enabled was OK as well.
Will verify fixed with new kernel.
Comment 10 Jay Turner 2003-08-13 10:40:18 EDT
kernel-2.4.21-1.1931.2.389.ent is now available via RHN.  Does this work a
little better on both the 8-way x445 (with HT) and 16-way x440?
Comment 11 Arjan van de Ven 2003-08-15 11:24:00 EDT
assuming this is fixed

Note You need to log in before you can comment on or make changes to this bug.