Bug 166437
Summary: | Frequent SMP kernel lockups with Athlon X2 | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | josip |
Component: | kernel | Assignee: | Dave Jones <davej> |
Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Brian Brock <bbrock> |
Severity: | high | Docs Contact: | |
Priority: | medium | ||
Version: | 4 | CC: | bstretch, ericdavidbair, pfrields, redhat, wtogami |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2006-05-05 01:28:14 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
josip
2005-08-21 09:52:51 UTC
There is one potential fix for powernow-k8 in the latest errata kernel currently in updates-testing. Give that a try. I'll try, but my current SMP kernel (kernel-smp-2.6.12-1.1398_FC4) locks up even after disabling the cpuspeed service and rebooting. Lockups seem related to system load, but the kernel doesn't provide any information. FWIW, I went straight to kernel-smp-2.6.12-1.1435_FC4 in updates-testing when I swapped in my new ASUS A8N-SLI Premium (BIOS 1005) and X2 3800+ CPU yesterday and it's stable, but power management doesn't work at all: powernow-k8: MP systems not supported by PSB BIOS structure powernow-k8: MP systems not supported by PSB BIOS structure Interestingly, my APC UPS load meter says that system power consumption while idle isn't much higher than my previous Athlon 64 3000+ with working power management. The only problem is that Sun Java 1.5.0_04 crashes, which it didn't do on the 1398 uniprocessor kernel. That could be a Sun bug though. Everything else seems to work. Solution found! The issue on my system is a BIOS setting which moves the memory block usually obscured by PCI above the 4GB mark, assuming that the OS supports PAE. The BIOS does this in SW, and on E0+ revision CPUs, also in HW. The benefit is that all 4GB can be used despite the PCI hole. My Linux kernel seems to be OK with this in uniprocessor mode but not in SMP mode. There may be a memory setup bug in the SMP kernel. BTW, is CONFIG_HIGHMEM64G gone in x86-64 kernels? More info: Stream benchmark uses a large static array. Arrays of 114 MB and smaller don't lockup my SMP kernel. Arrays 229 MB and larger do. Turning off memory remapping in BIOS restores proper operation. This BIOS change can only be a temorporary fix, because instead of 4GB, the OS now gets only 3GB. Linux SMP memory handling needs to be fixed to allow BIOS memory remapping so that full 4GB is usable. Interesting. I only have 1GB RAM and I bet you have more? I remember seeing that remapping BIOS setting and I'm pretty sure it defaulted to On. I just installed kernel 1447. Power management still doesn't work but Java doesn't blow up anymore. Kernel 1447 doesn't help -- the SMP kernel still locks up when running a 229MB stream benchmark with memory remapping enabled in BIOS (to see all 4GB of RAM). With memory remapping disabled in BIOS, the SMP kernel doesn't lock up, but only 3GB (out of installed 4GB) are visible. This affects only the SMP kernel. The uniprocessor kernel runs fine with memory remapping enabled in BIOS and uses all 4GB. This bug affects only Fedora kernels. Freshly compiled kernel 2.6.13.1 from kernel.org runs fine in SMP mode and sees all 4GB of RAM with memory remapping enabled in BIOS. Could this be some kind of Fedora-specific SMP kernel configuration bug? > Freshly compiled kernel 2.6.13.1 from kernel.org
What if you build it using Fedora's /boot/config-VERSION?
Kernel 2.6.13.1 from kernel.org built with Fedora's /boot/config-2.6.12-1.1456_FC4smp runs OK, but of course "make oldconfig" had to add a number of new configuration options. I took the default on all of them, including memory related ones (*_DISCONTIGMEM_*). Hypothesis: Fedora's 2.6.12-1.1456_FC4smp kernel had a memory handling bug and should be updated (e.g. to 2.6.13.2 base from kernel.org), but the configuration appears OK. you're in luck. 2.6.13-1.1526_FC4 just got pushed out, which is based on 2.6.13.2 please give it a try, and let me know if that works. Will do. Meanwhile, kernel 2.6.14-rc2 from kernel.org *again* has the lockup problem. This may be an IOMMU issue. With 2.6.13.1, I see (note that this system doesn't have AGP, it's PCI-Express only, and there are no IOMMU options in its BIOS): Linux version 2.6.13.1_FC4smp [...] BIOS-provided physical RAM map: BIOS-e820: 0000000000000000 - 000000000009e800 (usable) BIOS-e820: 000000000009e800 - 00000000000a0000 (reserved) BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved) BIOS-e820: 0000000000100000 - 00000000bfff0000 (usable) BIOS-e820: 00000000bfff0000 - 00000000bfff3000 (ACPI NVS) BIOS-e820: 00000000bfff3000 - 00000000c0000000 (ACPI data) BIOS-e820: 00000000e0000000 - 00000000f0000000 (reserved) BIOS-e820: 00000000fec00000 - 0000000100000000 (reserved) BIOS-e820: 0000000100000000 - 0000000140000000 (usable) [...] Checking aperture... CPU 0: aperture @ 1a60000000 size 32 MB Aperture from northbridge cpu 0 too small (32 MB) No AGP bridge found Your BIOS doesn't leave a aperture memory hole Please enable the IOMMU option in the BIOS setup This costs you 64 MB of RAM Mapping aperture over 65536 KB of RAM @ 8000000 [...] PCI-DMA: Disabling AGP. PCI-DMA: aperture base @ 8000000 size 65536 KB PCI-DMA: Reserving 64MB of IOMMU area in the AGP aperture ...and the above works fine. Switching to kernel 2.6.14-rc2, I get a total lockup about 30 seconds after boot when PCI memory remapping is enabled in BIOS. This kernel boots only if BIOS remapping is disabled, but then I see only 3GB (not 4GB) of RAM. The IOMMU related messages become: Linux version 2.6.14-rc2_smp [...] PCI-DMA: Disabling IOMMU. Note that CONFIG_GART_IOMMU=y was used in both kernels (the second kernel had fewer configuration options turned on). ASUS BIOS 1007 doesn't have any IOMMU options, but the first example used PCI memory remapping, while the second one crashed unless remapping was off. I believe that AMD's IOMMU is needed to see full 4GB, so the fact that it is disabled in the second example is bad news. 2.6.14rc isn't exactly in fantastic shape right now, so it doesn't surprise me you hit problems with it. For the sake of this bug though, lets focus on the Fedora kernel alone. btw, it'd be great if you could report that bug upstream to http://bugme.osdl.org let me know how things go with the errata kernel. The latest kernel-smp-2.6.13-1.1526_FC4 can run 2GB+ stream tests and can see 4GB of RAM when hardware memory remapping is enabled in BIOS. I suggest closing this bug for now. This bug is back: the latest kernel-smp-2.6.14-1.1637_FC4 crashes on boot after failing to find the IOMMU. FYI, both kernel-smp-2.6.13-1.1532_FC4 and kernel-smp-2.6.14-1.1637_FC4 worked fine. This was broken in 2.6.14 kernels... Given that dual core Athlons and ASUS motherboards are now popular, and that many users can afford 4GB of RAM, this "crash-on-boot" phenomenon is a critical high-priority bug. One correction: The "FYI" line in my previous message needs to be fixed. Only the 2.6.13-based kernels (e.g. kernel-smp-2.6.13-1.1532_FC4) worked fine. All 2.6.14-based kernels (e.g. kernel-smp-2.6.14-1.1637_FC4) are broken and fail to boot. Hint: http://lkml.org/lkml/2005/11/6/54 suggests booting with "iommu=soft swiotlb=65536" (which I haven't tried yet). Also, bug #169115 may be a duplicate of this bug. While "iommu=soft swiotlb=65536" works, it may be better to use "pci=nommconf" as recommended by Andi Kleen at http://bugzilla.kernel.org/show_bug.cgi?id=5343 -- the problem he reports is that the MCFG table provided by ACPI BIOS is broken and needs a fix. He is developing a workaround. The MCFG table describes the memory mapped PCI configuration space, which is required for the MMCONF form of access to devices on the PCI-Express bus (otherwise, one must address them through BIOS or directly). Anyway, using the boot line option "pci=nommconf" works for me, and I'll use this until Andi's workaround arrives. This is a mass-update to all currently open kernel bugs. A new kernel update has been released (Version: 2.6.15-1.1830_FC4) based upon a new upstream kernel release. Please retest against this new kernel, as a large number of patches go into each upstream release, possibly including changes that may address this problem. This bug has been placed in NEEDINFO_REPORTER state. Due to the large volume of inactive bugs in bugzilla, if this bug is still in this state in two weeks time, it will be closed. Should this bug still be relevant after this period, the reporter can reopen the bug at any time. Any other users on the Cc: list of this bug can request that the bug be reopened by adding a comment to the bug. If this bug is a problem preventing you from installing the release this version is filed against, please see bug 169613. Thank you. Closing per previous comment. I believe I'm having this same issue using kernel version kernel-2.6.17-1.2139_FC5. I have an Athlon X2 processor and an ASUS A8N-SLI motherboard, and I'm having the same problems described above. I found this blog entry when I googled the problem: http://www.linuxcompatible.org/4GB_with_Asus_A8N-SLI_and_A8N-SLI_SE_t33994.html This suggests to me that I'm not the only one who has had this problem since the bug was closed. I suggest reopening the bug. P.S. In spite of what the blog entry above suggests, I continued to have kernel panics even when only software remapping was enabled. Incidentally, I tried booting using both the kernel parameters "iommu=soft swiotlb=65536" and "pci=nommconf," and neither workaround proved to be satisfactory. (My system didn't lock up, but it was clearly unstable. I kept having applications crash for no apparent reason.) The only way I can get my system to boot and operate normally is to disable both software and hardware remapping in the BIOS (and thereby lose access to 1 GB of memory). Are there any plans to reopen this bug? If not, should I file a new bug report? I hate to knowingly file a duplicate report, but I don't know what else to do. As I noted above, I'm still having this problem using the latest release of the kernel, and neither of the workarounds suggested above resolved the problem. Just thought I'd update. Still having problems with this on FC5 (2.6.17-1.2174). Tried stock kernel (2.6.17.13), same problem. Have tried passing (grub) options: kernel /vmlinuz-2.6.17.13 ro root=/dev/VolGroup00/LogVol00 rhgb quiet acpi=no swiotlb=65536 pci=nommconf I have also tried disabling apic support. No luck. This is an ABIT AV8 mb, VIA K8T800 Pro/VT8237, Athlon X2 4400+, 4GB DDR400(running at 333). Frequency is every 1-2 days, typically under high load. I had the same experience of kernel-smp-2.6.13-1.1532_FC4 running fine. |