Bug 164941 (badswapentry)
Summary: | swap_free: bad swap file entry (x86-64) | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Konstantin Olchanski <olchansk> |
Component: | kernel | Assignee: | Dave Jones <davej> |
Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Brian Brock <bbrock> |
Severity: | high | Docs Contact: | |
Priority: | medium | ||
Version: | 3 | CC: | mattdm, olivier.lelain, pfrields, rad, tmraz, vince.sweeney, wtogami |
Target Milestone: | --- | Keywords: | Reopened |
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
URL: | https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=161059 | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2008-02-12 06:01:55 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Konstantin Olchanski
2005-08-02 22:09:29 UTC
can you make sure you're running the latest bios update for this box ? There's an errata on some AMD chips which could explain this issue, which must be worked around in the BIOS. This is an MSI MS-9161 mobo (K8D Master 3) BIOS rev "P9161KMS V2.00" with two Opteron 246 CPUs (stepping 10). At the MSI download site for "K8D Master 3" does not show any BIOS updates (http://www.msi.com.tw/program/support/download/dld/spt_dld_detail.php?UID=610&kind=3). A slightly newer mobo (K8D Master 3 - 133) shows newer BIOSes up to version 2.50, but I would have to ask MSI if they are compatible before installing it on my machines. The release notes for the newer BIOSes are at http://www.msi.com.tw/program/support/bios/bos/spt_bos_detail.php?UID=560&kind=3 I am curious how a BIOS update fixes a problem that did not exist until after the 2.6.11-1.27_FC3smp kernel and one that smells like kernel memory corruption. K.O. the cpu errata I refer to fixes a problem in the tlb flush filter. 2.6.11+ kernels start using 4 level page tables, so we use completely different memory access patterns on x86-64. It is possible that this is a red herring, but in absense of any better ideas right now, I'm clutching at straws. I guess I closed my original bug report #161059 too early. Just got hit with this problem again for the first time since updating to 2.6.12-1.1372_FC3smp. I will update with more details one I get mobo / BIOS versions and if applying any BIOS patches helps. Davej refers to the Opteron CPU errata on the "tlb flush filter", here it is: 1) this is errata 122 listed on page 74 in the AMD "revision guide rev 3.51" at http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/25759.pdf 2) the suggested solution is "disable the tlb flush filter by setting a bit in an MSR". (Can Linux do this without BIOS help?) 3) the affected CPUs are listed on page 11: all but "cpu revision SH-B3". 4) (this is the confusing bit) it looks like "cpu revision SH-B3" corresponds to "stepping: 1" in /proc/cpuinfo. (personal note: this makes sense: my only unaffected machine is an old Opteron-242 with cpu stepping 1. All the other machines are new Opteron-246es with cpu stepping 10 and all have shown the "ld.so usage dump", the "bad pmd" and the "bad swap file entry" problems). Errata 122 say "set bit 6 in MSR 0xC001_0015". How do I do this without obtaining unobtainable BIOS updates? K.O. I am also seeing a similar problem, and this bug may be related to the bug I just filed (https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=165511), as am now seeing: Machine check events logged Machine check events logged sh[22666]: segfault at 00000000005a8028 rip 000000317a3022a3 rsp 00007fffffa02860 error 5 swap_free: Bad swap file entry 3800000000000000 swap_free: Bad swap offset entry 34365f363878 Machine check events logged Machine check events logged My motherboard is an Abit SU-2S, which is very difficult to get BIOS support for because Abit sold their server line to another company. I also have stepping 10 Opteron 246s... Appended is the program for setting bit 6 in MSR 0xC001_0015 per AMD errata 122. I applied this errata on one problem machine and there were no more "bad pmd" nor "ld.so usage dump" events for 24 hours. All are welcome to give it a try, but if your computers catch fire or turn into blue cheese, do not blame me. Note that you have to "mknod", edit, make and run the program for each one of your CPUs. K.O. // file: errata122.c // This is a program to apply AMD-suggested fix for Errata 122 // of AMD Opteron processors. Tested on Fedora Core 3. // Usage: // mknod msr0 c 202 0 <--- msr on cpu0 // mknod msr1 c 202 1 <--- msr on cpu1, etc // (edit errata122.c, change "open" statement to say "msr0" or "msr1") // make errata122 // ./errata122 // (if you get 0x000000000c000040, notice the bit 0x40, errata is already // applied, stop here). // (if bit 0x40 is not set, replace "#if 0" with "#if 1", recompile, rerun, // repeat for each cpu- edit the "open" statement). #include <stdio.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> int main(int argv,char*argc[]) { char buf[8]; int fd = open("msr0",O_RDWR); lseek(fd,0xC0010015,SEEK_SET); read(fd,buf,8); printf("0x%02x%02x%02x%02x%02x%02x%02x%02x\n", buf[7],buf[6],buf[5],buf[4], buf[3],buf[2],buf[1],buf[0]); #if 0 buf[0] |= (1<<6); lseek(fd,0xC0010015,SEEK_SET); write(fd,buf,8); #endif return 0; } (In reply to comment #7) > Appended is the program for setting bit 6 in MSR 0xC001_0015 per AMD errata 122. Failing to find a BIOS update for my system I've used this bit of C code and run for a week under increased load. So far not a single oops / random process segfault or bad swap / bad pmd kernel message so it looks to of fixed all my previous problems (so far). Just for the record: vendor_id : AuthenticAMD cpu family : 15 model : 33 model name : Dual Core AMD Opteron(tm) Processor 265 stepping : 2 The current update in updates-testing has a similar workaround. Please give it a try. FYI, we have been running a couple of machines with my errata 122 fix and all wierdness has disappeared, so the fix seems to be working. I now see the new kernel in updates-testing, I will test it and report results. Any chance the errata 122 workaround will be submitted into the mainline kernel? K.O. latest upstream has a variant of this workaround, as do all current Fedora errata kernels. FWIW, latest FC3 kernel 2.6.12-2.3.legacy_FC3smp does not apply the suggested fix for AMD errata 122. Was this fix lost from latest FC4 and FC5 kernels, too? K.O. Fedora Core 3 is now maintained by the Fedora Legacy project for security updates only. If this problem is a security issue, please reopen and reassign to the Fedora Legacy product. If it is not a security issue and hasn't been resolved in the current FC5 updates or in the FC6 test release, reopen and change the version to match. Thank you! Fedora Core 3 is not maintained anymore. Setting status to "INSUFFICIENT_DATA". If you can reproduce this bug in the current Fedora release, please reopen this bug and assign it to the corresponding Fedora version. |