Bug 164941 (badswapentry) - swap_free: bad swap file entry (x86-64)
Summary: swap_free: bad swap file entry (x86-64)
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: badswapentry
Product: Fedora
Classification: Fedora
Component: kernel
Version: 3
Hardware: x86_64
OS: Linux
medium
high
Target Milestone: ---
Assignee: Dave Jones
QA Contact: Brian Brock
URL: https://bugzilla.redhat.com/bugzilla/...
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2005-08-02 22:09 UTC by Konstantin Olchanski
Modified: 2015-01-04 22:21 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2008-02-12 06:01:55 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)

Description Konstantin Olchanski 2005-08-02 22:09:29 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.7.10) Gecko/20050719 Fedora/1.7.10-1.3.1

Description of problem:
Bug https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=161059 is still present in 2.6.12-1.1372_FC3smp. I had two panics while running tests for bug https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=160341.

$ grep -i swap_free /var/log/messages
Aug  2 13:35:45 bench kernel: swap_free: Bad swap file entry 4000000000000000
(panic)
Aug  2 14:20:03 bench kernel: swap_free: Bad swap file entry e000007fffffe02b
Aug  2 14:20:03 bench kernel: swap_free: Bad swap file entry e800007fffffe02b
Aug  2 14:20:03 bench kernel: swap_free: Bad swap file entry f000007fffffe02b
Aug  2 14:20:03 bench kernel: swap_free: Bad swap file entry f800007fffffe02b
(panic)

K.O.


Version-Release number of selected component (if applicable):
kernel-smp-2.6.12-1.1372_FC3

How reproducible:
Always

Steps to Reproduce:
I am running a test perl script that feeds sensors data to ganglia. In a tight loop, I call "sensors" and "gmetric". 2.6.12-1.1372_FC3smp panics within 5 minutes.


Additional info:

Comment 1 Dave Jones 2005-08-03 22:53:08 UTC
can you make sure you're running the latest bios update for this box ?
There's an errata on some AMD chips which could explain this issue, which must
be worked around in the BIOS.


Comment 2 Konstantin Olchanski 2005-08-04 01:05:53 UTC
This is an MSI MS-9161 mobo (K8D Master 3) BIOS rev "P9161KMS V2.00" with two
Opteron 246 CPUs (stepping 10).

At the MSI download site for "K8D Master 3" does not show any BIOS updates
(http://www.msi.com.tw/program/support/download/dld/spt_dld_detail.php?UID=610&kind=3).


A slightly newer mobo (K8D Master 3 - 133) shows newer BIOSes up to version
2.50, but I would have to ask MSI if they are compatible before installing it on
my machines. The release notes for the newer BIOSes are at
http://www.msi.com.tw/program/support/bios/bos/spt_bos_detail.php?UID=560&kind=3

I am curious how a BIOS update fixes a problem that did not exist until after
the 2.6.11-1.27_FC3smp kernel and one that smells like kernel memory corruption.

K.O.


Comment 3 Dave Jones 2005-08-04 01:51:44 UTC
the cpu errata I refer to fixes a problem in the tlb flush filter. 2.6.11+
kernels start using 4 level page tables, so we use completely different memory
access patterns on x86-64.

It is possible that this is a red herring, but in absense of any better ideas
right now, I'm clutching at straws.


Comment 4 Vincent Sweeney 2005-08-07 23:26:31 UTC
I guess I closed my original bug report #161059 too early. Just got hit with
this problem again for the first time since updating to 2.6.12-1.1372_FC3smp. 

I will update with more details one I get mobo / BIOS versions and if applying
any BIOS patches helps.

Comment 5 Konstantin Olchanski 2005-08-10 20:57:01 UTC
Davej refers to the Opteron CPU errata on the "tlb flush filter", here it is:

1) this is errata 122 listed on page 74 in the AMD "revision guide rev 3.51" at
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/25759.pdf
2) the suggested solution is "disable the tlb flush filter by setting a bit in
an MSR". (Can Linux do this without BIOS help?)
3) the affected CPUs are listed on page 11: all but "cpu revision SH-B3".
4) (this is the confusing bit) it looks like "cpu revision SH-B3" corresponds to
"stepping: 1" in /proc/cpuinfo.

(personal note: this makes sense: my only unaffected machine is an old
Opteron-242 with cpu stepping 1. All the other machines are new Opteron-246es
with cpu stepping 10 and all have shown the "ld.so usage dump", the "bad pmd"
and the "bad swap file entry" problems).

Errata 122 say "set bit 6 in MSR 0xC001_0015". How do I do this without
obtaining unobtainable BIOS updates?

K.O.


Comment 6 Brian Rademacher 2005-08-11 19:42:27 UTC
I am also seeing a similar problem, and this bug may be related to the bug I 
just filed (https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=165511), as 
am now seeing:

Machine check events logged
Machine check events logged
sh[22666]: segfault at 00000000005a8028 rip 000000317a3022a3 rsp 
00007fffffa02860 error 5
swap_free: Bad swap file entry 3800000000000000
swap_free: Bad swap offset entry 34365f363878
Machine check events logged
Machine check events logged

My motherboard is an Abit SU-2S, which is very difficult to get BIOS support 
for because Abit sold their server line to another company.  I also have 
stepping 10 Opteron 246s...


Comment 7 Konstantin Olchanski 2005-08-12 05:15:52 UTC
Appended is the program for setting bit 6 in MSR 0xC001_0015 per AMD errata 122.
I applied this errata on one problem machine and there were no more "bad pmd"
nor "ld.so usage dump" events for 24 hours. All are welcome to give it a try,
but if your computers catch fire or turn into blue cheese, do not blame me. Note
that you have to "mknod", edit, make and run the program for each one of your
CPUs. K.O.

// file: errata122.c
// This is a program to apply AMD-suggested fix for Errata 122
// of AMD Opteron processors. Tested on Fedora Core 3.
// Usage:
// mknod msr0 c 202 0   <--- msr on cpu0
// mknod msr1 c 202 1   <--- msr on cpu1, etc
// (edit errata122.c, change "open" statement to say "msr0" or "msr1")
// make errata122
// ./errata122
// (if you get 0x000000000c000040, notice the bit 0x40, errata is already
// applied, stop here).
// (if bit 0x40 is not set, replace "#if 0" with "#if 1", recompile, rerun,
// repeat for each cpu- edit the "open" statement).
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
int main(int argv,char*argc[])
{
  char buf[8];
  int fd = open("msr0",O_RDWR);
  lseek(fd,0xC0010015,SEEK_SET);
  read(fd,buf,8);
  printf("0x%02x%02x%02x%02x%02x%02x%02x%02x\n",
         buf[7],buf[6],buf[5],buf[4],
         buf[3],buf[2],buf[1],buf[0]);
#if 0
  buf[0] |= (1<<6);
  lseek(fd,0xC0010015,SEEK_SET);
  write(fd,buf,8);
#endif
  return 0;
}


Comment 8 Vincent Sweeney 2005-08-19 22:08:38 UTC
(In reply to comment #7)
> Appended is the program for setting bit 6 in MSR 0xC001_0015 per AMD errata 122.

Failing to find a BIOS update for my system I've used this bit of C code and run
for a week under increased load. So far not a single oops / random process
segfault or bad swap / bad pmd kernel message so it looks to of fixed all my
previous problems (so far).

Just for the record:

vendor_id       : AuthenticAMD
cpu family      : 15
model           : 33
model name      : Dual Core AMD Opteron(tm) Processor 265
stepping        : 2



Comment 9 Dave Jones 2005-08-26 07:25:37 UTC
The current update in updates-testing has a similar workaround. Please give it a
try.


Comment 10 Konstantin Olchanski 2005-08-28 00:26:55 UTC
FYI, we have been running a couple of machines with my errata 122 fix and all
wierdness has disappeared, so the fix seems to be working. I now see the new
kernel in updates-testing, I will test it and report results. Any chance the
errata 122 workaround will be submitted into the mainline kernel?
K.O.


Comment 11 Dave Jones 2005-10-25 07:57:15 UTC
latest upstream has a variant of this workaround, as do all current Fedora
errata kernels.


Comment 12 Konstantin Olchanski 2006-06-20 22:18:49 UTC
FWIW, latest FC3 kernel 2.6.12-2.3.legacy_FC3smp does not apply the suggested
fix for AMD errata 122. Was this fix lost from latest FC4 and FC5 kernels, too?
K.O.


Comment 13 Matthew Miller 2006-07-10 21:41:26 UTC
Fedora Core 3 is now maintained by the Fedora Legacy project for security
updates only. If this problem is a security issue, please reopen and
reassign to the Fedora Legacy product. If it is not a security issue and
hasn't been resolved in the current FC5 updates or in the FC6 test
release, reopen and change the version to match.

Thank you!


Comment 14 petrosyan 2008-02-12 06:01:55 UTC
Fedora Core 3 is not maintained anymore.

Setting status to "INSUFFICIENT_DATA". If you can reproduce this bug in the
current Fedora release, please reopen this bug and assign it to the
corresponding Fedora version.


Note You need to log in before you can comment on or make changes to this bug.