Bug 166437

Summary: Frequent SMP kernel lockups with Athlon X2
Product: [Fedora] Fedora Reporter: josip
Component: kernelAssignee: Dave Jones <davej>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: 4CC: bstretch, ericdavidbair, pfrields, redhat, wtogami
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-05-05 01:28:14 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description josip 2005-08-21 09:52:51 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.7.10) Gecko/20050720 Fedora/1.0.6-1.1.fc4 Firefox/1.0.6

Description of problem:
New Athlon X2 system runs stably under uniprocessor kernel but experiences random SMP kernel failures (lockups) after roughly 5-10 minutes of activity.  Reset switch is the only remedy.

Related bugs suspect powernow-k8, asus-acpi, etc. but I don't see a pattern yet.

This ASUS A8N-SLI system uses nVidia nForce 4 SLI chipset and a 7800GTX PCI-e video card.  Whether SMP or not, the kernel complains about aperture too small (suggests to enable IOMMU 64MB in BIOS) and reports a BIOS error (no PSB nor ACPI _PSS objects).

Version-Release number of selected component (if applicable):
kernel-smp-2.6.12-1.1398_FC4

How reproducible:
Didn't try


Additional info:

Comment 1 Dave Jones 2005-08-26 23:18:22 UTC
There is one potential fix for powernow-k8 in the latest errata kernel currently
in updates-testing. Give that a try.


Comment 2 josip 2005-08-27 13:41:19 UTC
I'll try, but my current SMP kernel (kernel-smp-2.6.12-1.1398_FC4) locks up even
after disabling the cpuspeed service and rebooting.  Lockups seem related to
system load, but the kernel doesn't provide any information.

Comment 3 Brian Stretch 2005-08-27 15:12:50 UTC
FWIW, I went straight to kernel-smp-2.6.12-1.1435_FC4 in updates-testing when I
swapped in my new ASUS A8N-SLI Premium (BIOS 1005) and X2 3800+ CPU yesterday
and it's stable, but power management doesn't work at all:

powernow-k8: MP systems not supported by PSB BIOS structure
powernow-k8: MP systems not supported by PSB BIOS structure

Interestingly, my APC UPS load meter says that system power consumption while
idle isn't much higher than my previous Athlon 64 3000+ with working power
management.

The only problem is that Sun Java 1.5.0_04 crashes, which it didn't do on the
1398 uniprocessor kernel.  That could be a Sun bug though.  Everything else
seems to work. 

Comment 4 josip 2005-08-27 15:45:02 UTC
Solution found!

The issue on my system is a BIOS setting which moves the memory block usually
obscured by PCI above the 4GB mark, assuming that the OS supports PAE.  The BIOS
does this in SW, and on E0+ revision CPUs, also in HW.  The benefit is that all
4GB can be used despite the PCI hole.

My Linux kernel seems to be OK with this in uniprocessor mode but not in SMP
mode.  There may be a memory setup bug in the SMP kernel.  BTW, is
CONFIG_HIGHMEM64G gone in x86-64 kernels?

More info:
Stream benchmark uses a large static array.  Arrays of 114 MB and smaller don't
lockup my SMP kernel.  Arrays 229 MB and larger do.  Turning off memory
remapping in BIOS restores proper operation.

Comment 5 josip 2005-08-27 15:53:16 UTC
This BIOS change can only be a temorporary fix, because instead of 4GB, the OS
now gets only 3GB.  Linux SMP memory handling needs to be fixed to allow BIOS
memory remapping so that full 4GB is usable.

Comment 6 Brian Stretch 2005-08-27 17:52:36 UTC
Interesting.  I only have 1GB RAM and I bet you have more?  I remember seeing
that remapping BIOS setting and I'm pretty sure it defaulted to On.  

Comment 7 Brian Stretch 2005-08-29 22:22:25 UTC
I just installed kernel 1447.  Power management still doesn't work but Java
doesn't blow up anymore. 

Comment 8 josip 2005-09-03 15:06:48 UTC
Kernel 1447 doesn't help -- the SMP kernel still locks up when running a 229MB
stream benchmark with memory remapping enabled in BIOS (to see all 4GB of RAM).
 With memory remapping disabled in BIOS, the SMP kernel doesn't lock up, but
only 3GB (out of installed 4GB) are visible.

This affects only the SMP kernel.  The uniprocessor kernel runs fine with memory
remapping enabled in BIOS and uses all 4GB.

Comment 9 josip 2005-09-20 04:23:23 UTC
This bug affects only Fedora kernels.  Freshly compiled kernel 2.6.13.1 from
kernel.org runs fine in SMP mode and sees all 4GB of RAM with memory remapping
enabled in BIOS.

Could this be some kind of Fedora-specific SMP kernel configuration bug?

Comment 10 Warren Togami 2005-09-20 04:28:27 UTC
> Freshly compiled kernel 2.6.13.1 from kernel.org

What if you build it using Fedora's /boot/config-VERSION?


Comment 11 josip 2005-09-30 05:01:45 UTC
Kernel 2.6.13.1 from kernel.org built with Fedora's
/boot/config-2.6.12-1.1456_FC4smp runs OK, but of course "make oldconfig" had to
add a number of new configuration options.  I took the default on all of them,
including memory related ones (*_DISCONTIGMEM_*).

Hypothesis: Fedora's 2.6.12-1.1456_FC4smp kernel had a memory handling bug and
should be updated (e.g. to 2.6.13.2 base from kernel.org), but the configuration
appears OK.

Comment 12 Dave Jones 2005-09-30 05:08:24 UTC
you're in luck. 2.6.13-1.1526_FC4 just got pushed out, which is based on 2.6.13.2
please give it a try, and let me know if that works.


Comment 13 josip 2005-09-30 08:15:34 UTC
Will do.  Meanwhile, kernel 2.6.14-rc2 from kernel.org *again* has the lockup
problem.  This may be an IOMMU issue.  With 2.6.13.1, I see (note that this
system doesn't have AGP, it's PCI-Express only, and there are no IOMMU options
in its BIOS):

 Linux version 2.6.13.1_FC4smp [...]
 BIOS-provided physical RAM map:
  BIOS-e820: 0000000000000000 - 000000000009e800 (usable)
  BIOS-e820: 000000000009e800 - 00000000000a0000 (reserved)
  BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved)
  BIOS-e820: 0000000000100000 - 00000000bfff0000 (usable)
  BIOS-e820: 00000000bfff0000 - 00000000bfff3000 (ACPI NVS)
  BIOS-e820: 00000000bfff3000 - 00000000c0000000 (ACPI data)
  BIOS-e820: 00000000e0000000 - 00000000f0000000 (reserved)
  BIOS-e820: 00000000fec00000 - 0000000100000000 (reserved)
  BIOS-e820: 0000000100000000 - 0000000140000000 (usable)
[...]
 Checking aperture...
 CPU 0: aperture @ 1a60000000 size 32 MB
 Aperture from northbridge cpu 0 too small (32 MB)
 No AGP bridge found
 Your BIOS doesn't leave a aperture memory hole
 Please enable the IOMMU option in the BIOS setup
 This costs you 64 MB of RAM
 Mapping aperture over 65536 KB of RAM @ 8000000
[...]
 PCI-DMA: Disabling AGP.
 PCI-DMA: aperture base @ 8000000 size 65536 KB
 PCI-DMA: Reserving 64MB of IOMMU area in the AGP aperture

...and the above works fine.  Switching to kernel 2.6.14-rc2, I get a total
lockup about 30 seconds after boot when PCI memory remapping is enabled in BIOS.
 This kernel boots only if BIOS remapping is disabled, but then I see only 3GB
(not 4GB) of RAM.  The IOMMU related messages become:

 Linux version 2.6.14-rc2_smp
[...]
 PCI-DMA: Disabling IOMMU.

Note that CONFIG_GART_IOMMU=y was used in both kernels (the second kernel had
fewer configuration options turned on).

ASUS BIOS 1007 doesn't have any IOMMU options, but the first example used PCI
memory remapping, while the second one crashed unless remapping was off.  I
believe that AMD's IOMMU is needed to see full 4GB, so the fact that it is
disabled in the second example is bad news.  


Comment 14 Dave Jones 2005-09-30 09:09:45 UTC
2.6.14rc isn't exactly in fantastic shape right now, so it doesn't surprise me
you hit problems with it.  For the sake of this bug though, lets focus on the
Fedora kernel alone.

btw, it'd be great if you could report that bug upstream to http://bugme.osdl.org

let me know how things go with the errata kernel.


Comment 15 josip 2005-09-30 17:12:19 UTC
The latest kernel-smp-2.6.13-1.1526_FC4 can run 2GB+ stream tests and can see
4GB of RAM when hardware memory remapping is enabled in BIOS.

I suggest closing this bug for now.

Comment 16 josip 2005-11-12 14:56:45 UTC
This bug is back: the latest kernel-smp-2.6.14-1.1637_FC4 crashes on boot after
failing to find the IOMMU.

FYI, both kernel-smp-2.6.13-1.1532_FC4 and kernel-smp-2.6.14-1.1637_FC4 worked
fine.  This was broken in 2.6.14 kernels...

Given that dual core Athlons and ASUS motherboards are now popular, and that
many users can afford 4GB of RAM, this "crash-on-boot" phenomenon is a critical
high-priority bug.


Comment 17 josip 2005-11-12 14:59:58 UTC
One correction: The "FYI" line in my previous message needs to be fixed.

Only the 2.6.13-based kernels (e.g. kernel-smp-2.6.13-1.1532_FC4) worked fine. 
All 2.6.14-based kernels (e.g. kernel-smp-2.6.14-1.1637_FC4) are broken and fail
to boot.

Comment 18 josip 2005-11-29 20:02:05 UTC
Hint: http://lkml.org/lkml/2005/11/6/54 suggests booting with "iommu=soft
swiotlb=65536" (which I haven't tried yet).  Also, bug #169115 may be a
duplicate of this bug.

Comment 19 josip 2005-11-30 04:24:08 UTC
While "iommu=soft swiotlb=65536" works, it may be better to use "pci=nommconf"
as recommended by Andi Kleen at http://bugzilla.kernel.org/show_bug.cgi?id=5343
-- the problem he reports is that the MCFG table provided by ACPI BIOS is broken
and needs a fix.  He is developing a workaround.

The MCFG table describes the memory mapped PCI configuration space, which is
required for the MMCONF form of access to devices on the PCI-Express bus
(otherwise, one must address them through BIOS or directly).

Anyway, using the boot line option "pci=nommconf" works for me, and I'll use
this until Andi's workaround arrives.

Comment 20 Dave Jones 2006-02-03 05:34:28 UTC
This is a mass-update to all currently open kernel bugs.

A new kernel update has been released (Version: 2.6.15-1.1830_FC4)
based upon a new upstream kernel release.

Please retest against this new kernel, as a large number of patches
go into each upstream release, possibly including changes that
may address this problem.

This bug has been placed in NEEDINFO_REPORTER state.
Due to the large volume of inactive bugs in bugzilla, if this bug is
still in this state in two weeks time, it will be closed.

Should this bug still be relevant after this period, the reporter
can reopen the bug at any time. Any other users on the Cc: list
of this bug can request that the bug be reopened by adding a
comment to the bug.

If this bug is a problem preventing you from installing the
release this version is filed against, please see bug 169613.

Thank you.


Comment 21 John Thacker 2006-05-05 01:28:14 UTC
Closing per previous comment.

Comment 22 Eric Bair 2006-06-29 10:13:02 UTC
I believe I'm having this same issue using kernel version
kernel-2.6.17-1.2139_FC5.  I have an Athlon X2 processor and an ASUS A8N-SLI
motherboard, and I'm having the same problems described above.  I found this
blog entry when I googled the problem:

http://www.linuxcompatible.org/4GB_with_Asus_A8N-SLI_and_A8N-SLI_SE_t33994.html

This suggests to me that I'm not the only one who has had this problem since the
bug was closed.  I suggest reopening the bug.

P.S. In spite of what the blog entry above suggests, I continued to have kernel
panics even when only software remapping was enabled.

Comment 23 Eric Bair 2006-06-29 10:50:03 UTC
Incidentally, I tried booting using both the kernel parameters "iommu=soft
swiotlb=65536" and "pci=nommconf," and neither workaround proved to be
satisfactory.  (My system didn't lock up, but it was clearly unstable.  I kept
having applications crash for no apparent reason.)  The only way I can get my
system to boot and operate normally is to disable both software and hardware
remapping in the BIOS (and thereby lose access to 1 GB of memory).

Comment 24 Eric Bair 2006-07-11 22:33:07 UTC
Are there any plans to reopen this bug?  If not, should I file a new bug report?
 I hate to knowingly file a duplicate report, but I don't know what else to do.
 As I noted above, I'm still having this problem using the latest release of the
kernel, and neither of the workarounds suggested above resolved the problem.

Comment 25 Matt Olson 2006-09-19 22:35:14 UTC
Just thought I'd update.  Still having problems with this on FC5
(2.6.17-1.2174).  Tried stock kernel (2.6.17.13), same problem.

Have tried passing (grub) options:

kernel /vmlinuz-2.6.17.13 ro root=/dev/VolGroup00/LogVol00 rhgb quiet acpi=no
swiotlb=65536 pci=nommconf

I have also tried disabling apic support.  No luck.

This is an ABIT AV8 mb, VIA K8T800 Pro/VT8237, Athlon X2 4400+, 4GB
DDR400(running at 333).  Frequency is every 1-2 days, typically under high load.  

I had the same experience of kernel-smp-2.6.13-1.1532_FC4 running fine.