Bug 154620

Summary: kernel has problems with dual core processor boards
Product: [Fedora] Fedora Reporter: Michal Jaegermann <michal>
Component: kernelAssignee: Dave Jones <davej>
Status: CLOSED CURRENTRELEASE QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 3CC: barryn, jparadis, pfrields, thomas.duffy.99
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2005-07-15 20:53:43 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
dmesg log from booting with 'noapic'
none
An output from dmidecode on the test machine
none
resuts of 'cat /proc/cpuinfo' is a machne sort of booted
none
oops caused by rpc.nfsd for 2.6.11-1.1267_FC4smp kernel none

Description Michal Jaegermann 2005-04-13 03:02:20 UTC
Description of problem:

Booting Fedora kernels, like 2.6.11-1.14_FC3smp, on an x86_64 SMP board
using dual core processors, is no joy at all.

To begin in a default configuration the best outcome is that
networking is dead.  If you will get that far you will see something
like "Disabling IRQ #177" (IRQ number may be affected by other boot
options) and that will be it.

Attempts with 'acpi=off' end up with a very quick lockup and from
what is displayed on a frozen screen it looks that the problem
is in callibrate_APIC_clock+45.  Various options like 'noapictimer',
'acpi=ht', 'acpi=noirq', 'pci=routeirq', 'numa=noacpi' and so on,
single or in combinations, do not seem to be of any help.

It appear that the only way to boot is with 'noapic' ('nolapic' does
not help)  but then there is a catch.  If you can get a machine that
far then you will see:

Initializing CPU#1
spurious 8259A interrupt: IRQ7.
Calibrating delay loop... ...
Kernel BUG at timer:418
invalid operand: 0000 [1] SMP 
CPU 1 
Modules linked in:
Pid: 0, comm: swapper Not tainted 2.6.11-1.14_FC3smp
RIP: 0010:[<ffffffff80141cae>] <ffffffff80141cae>{cascade+46}
....
 <0>Kernel panic - not syncing: Aiee, killing interrupt handler!
 Stuck ??
Inquiring remote APIC #1...
... APIC #1 ID: 01000000
... APIC #1 VERSION: 00040010
... APIC #1 SPIV: 000000ff
Only one processor found.

'cat /proc/cpuinfo' does show "cpu cores : 2" but there is only one
CPU.  If you booted with a dead network then there is only one CPU as
well.

The situation is really the same with a distribution kernel-2.6.9-1.667
as well with the current rawhide kernel-2.6.11-1.1236_FC4.

A non-smp kernel can is bootable with 'acpi=off', and networking is
then operational, but this dooes not looke like a good possibiilty
in the circumstances either. :-)  There are no oopses with UP kernels.

Attached are dmesg output from booting SMP kernel 'noapic', an output
from 'dmidecode' for the board and what 'cat /proc/cpuinfo' will show
afterwards.

I am told that on another board of a similar type, namely Tyan 2891,
the results are really the same.

Version-Release number of selected component (if applicable):
2.6.11-1.14_FC3smp

How reproducible:
always

Comment 1 Michal Jaegermann 2005-04-13 03:02:20 UTC
Created attachment 113077 [details]
dmesg log from booting with 'noapic'

Comment 2 Michal Jaegermann 2005-04-13 03:04:06 UTC
Created attachment 113078 [details]
An output from dmidecode on the test machine

Comment 3 Michal Jaegermann 2005-04-13 03:05:31 UTC
Created attachment 113079 [details]
resuts of 'cat /proc/cpuinfo' is a machne sort of booted

Comment 4 Dave Jones 2005-04-13 03:07:43 UTC
Dual-core support is pretty new upstream, so this isn't going to work properly
until FC3 gets a 2.6.12 kernel, or that support gets backported to a .11 update.


Comment 5 Rik van Riel 2005-04-13 03:12:02 UTC
Michal, if you boot with "numa=off", does the system work?

(or am I mixing up my workarounds?)

Comment 6 Michal Jaegermann 2005-04-13 03:24:23 UTC
> Michal, if you boot with "numa=off", does the system work?

Of the top of my head - no; but the system is now some 25 kilometers away
and I may misremember various possibilities which I tried.

I should be around the box tommorow and then I can recheck.

As for 2.6.12 - from what I have seen in sources 2.6.11-1.1236_FC4 seems
to be 2.6.12-rc2 plus other patches so I hoped that at least this one
will behave. :-(


Comment 7 Dave Jones 2005-04-13 03:26:38 UTC
there's some fixes pending in the x86-64 tree that aren't merged in Linus tree
yet. Due to the current situation with him not merging patches, this is in limbo
for the timebeing.


Comment 8 Michal Jaegermann 2005-04-13 17:04:00 UTC
In a response to comment #5 - booting with 'numa=off' does not seem to make
much of a difference.  With a dual core CPU leaving 'noapic' makes the machine
to lock up right away.  If 'noapic' is used a presence of 'numa=off' is
inconsequential AFAICT.

If using single core CPUs it is possible to boot without any extra options but
then a networking is dead and only 'noapic' seems to help with this.

OTOH the whole mobo does not look very healthy as even with single core CPUs
the second one is still missing although there are no visible oopses during
bootup.  Possbily because we are not trying to initialize the second CPU at all
and the later appears to a trouble either in a hardware or in BIOS.

Comment 9 Tom Duffy 2005-04-22 20:36:46 UTC
The fix for this went into 2.6.12-rc3.  I have verified that a 2.6.12-rc3 tree
will boot on dual-core x86_64.

Comment 10 Michal Jaegermann 2005-04-26 20:43:47 UTC
Created attachment 113690 [details]
oops caused by rpc.nfsd for 2.6.11-1.1267_FC4smp kernel

Indeed both 2.6.12-rc3 and 2.6.11-1.1267_FC4.x86_64, which is based
on that kernel, will boot on a dual core Opteron board; at least on
this sample I could try.  All CPUs show up in /proc/cpuinfo as
expected.  The catch is that one needs nash from mkinitrd-4.2.8-1
or mounts on initrd fail.  For corresponding UP kernels nash from
mkinitrd-4.1.18-2 (FC3) is good enough.

Still booting 2.6.11-1.1267_FC4.x86_64 I got an attached oops.
A machine is still operational after that but this happens consistently
on every "Starting NFS daemon:" and rpc.nfsd does not run.  A version
of nfs-utils does not seem to be relevant here.

Comment 11 Michal Jaegermann 2005-04-26 21:12:20 UTC
Smells like oops from comment #10 is related to bug #1559

Comment 12 Michal Jaegermann 2005-04-26 21:13:31 UTC
Ouch, lets try again:
Smells like oops from comment #10 is related to bug #155999

Comment 13 Dave Jones 2005-04-26 22:51:40 UTC
that'll be fixed in tomorrows build.


Comment 14 Dave Jones 2005-07-15 20:30:16 UTC
An update has been released for Fedora Core 3 (kernel-2.6.12-1.1372_FC3) which
may contain a fix for your problem.   Please update to this new kernel, and
report whether or not it fixes your problem.

If you have updated to Fedora Core 4 since this bug was opened, and the problem
still occurs with the latest updates for that release, please change the version
field of this bug to 'fc4'.

Thank you.

Comment 15 Michal Jaegermann 2005-07-15 20:53:43 UTC
If nobody else is seeing the problem I think that this is resolved in
the current kernels.  If will hear no protests I will close that bug.