Bug 481506 - SATA-attached drives give spurious errors (Nvidia nForce 630a chipset)
Summary: SATA-attached drives give spurious errors (Nvidia nForce 630a chipset)
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 10
Hardware: All
OS: Linux
low
medium
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2009-01-25 19:49 UTC by David Tonhofer
Modified: 2009-01-31 22:08 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-01-31 22:08:48 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)

Description David Tonhofer 2009-01-25 19:49:34 UTC
Preamble:

I have looked for similar bugs in the "kernel" component but have not found anything along the lines. As it seems unlikely that no-one has noticed this yet, this may possibly be just a hardware problem on my machine. Anyway, here goes:

Description of problem:

An "Asus M2N-CM/DVI nF630a" motherboard (http://www.asus.com/products.aspx?modelmenu=1&model=2078&l1=3&l2=149&l3=642&l4=0) 
with an nForce 630a chipset gives problem with the SATA subsystem (bad reads and probably bad writes). This machine is not overclocked. The memory, though not ECC, passes the memtest. 

More details:

I experienced major problems installing Fedora 10 from a SATA DVD drive on this machine - even though the Live CD system could be started, installation generally failed with bizarre problems or the installed system was unusable afterwards.

To check this, I did a test consisting in booting to the Live CD OS on the suspect machine, dumping the Live CD to a file, then comparing this dump against another dump of the same CD taken on an old, well-oiled Fedora 8 machine.

This revealed that on the new machine, I got sparse, random block corruption when reading from CD. The set of affected blocks changed between two sequential dumps.

However, I could install Fedora 10 from the same Live CD w/o any problem using a vintage CD drive attached to the PATA bus of the new system (also available on this motherboard) All seemed well...

Chapter 2:

So I thought I had a problem with the DVD drive. I replaced the DVD drive with a new one. This didn't help.

In addition, the SATA harddisks on this system (2 harddisks in a RAID 1 software mirror config)  _also_ yield errors of a spurious nature. A check over all files' md5sums with AIDE revealed varying errors (plus I/O errors on reading the ACLs of some randomly scattered files). rpm --verify shows varying changes in the md5sums, with the errors changing between runs. Bletch. 

Worse, in one instance, python yielded segmentation faults, making the system unusable - the next boot solved this. At some point the ext3 filesystem became corrupted and a fsck repair was called for. 

Interestingly, the RAID 1 software mirror seems to hold. The SMART disk checks yield no errors.

I will be looking for additional information and stress test the system. Maybe also try Ubuntu to see what happens.

Version-Release number of selected component (if applicable):

kernel 2.6.27.9-159-fc10

Comment 1 David Tonhofer 2009-01-27 10:13:27 UTC
So far, no indication that something is wrong with the hardware.

On second thoughts, it may be the F10 software RAID itself which gives problems? Will test this on a newly installed system.

Comment 2 Chuck Ebbert 2009-01-27 23:53:07 UTC
Hmm, upstream bug like this one turned out to be bad memory:

http://bugzilla.kernel.org/show_bug.cgi?id=12084

Comment 3 David Tonhofer 2009-01-28 10:42:08 UTC
Thanks Chuck, I will replace the memory ASAP, with ECC one, too. I hear it's cheap right now....

Anyway, tested writing/reading a large file on software RAID and not on software RAID with a freshly installed Fedora 10. A big file of zeros or random numbers starts to show errors at around 3'000'000 blocks. Will run memtest for the remainder of the day. Test code:

------ runit.sh ------------------------
#!/bin/sh

# Errors at:
#  2'880'000 blocks with urandom
#  2'870'000 blocks with urandom
#  2'890'000 blocks with zero
#  3'000'000 blocks with zero

BLOCKS=2880000
ERROR=0
#SOURCE=/dev/urandom
SOURCE=/dev/zero
OUTFILE=entropy

while [[ $ERROR == 0 ]]; do
   echo -n "Testing $BLOCKS blocks at "
   date
   dd if=$SOURCE of=$OUTFILE count=$BLOCKS
   perl hasher.pl $OUTFILE > /tmp/HASHES
   perl hasher.pl $OUTFILE > /tmp/HASHES2
   diff /tmp/HASHES /tmp/HASHES2
   if [[ $? == 1 ]]; then
      ERROR=1
      echo "Errors found at $BLOCKS blocks"
   else
      let BLOCKS=$BLOCKS+10000
   fi
done

------- hasher.pl ------------------------

#!/usr/bin/perl
use Digest::MD5 qw(md5 md5_hex md5_base64);
$file = $ARGV[0];
open(FILE,$file) or die "Could not open $file: $!\n";
binmode FILE;
my $buffer;
my $size = 1024*10;
my $read;
my $pos = 0;
while (($read = read(FILE, $buffer, $size)) > 0) {
   my $digest = md5_hex($buffer);
   my $end    = $pos + $size - 1;
   print "$digest [$pos,$end] $size\n";
   $pos += $size;
}
close(FILE);

Comment 4 David Tonhofer 2009-01-29 09:21:20 UTC
memtest ran for 32 passes with no errors

Will try what happens on Ubuntu server next.

Comment 5 David Tonhofer 2009-01-30 11:53:07 UTC
Running the above "dump-and-check" tests in Ubuntu 8.10 server in recovery mode also yields error, though only with files reaching 2G size (~RAM sized). I tested with swap off and a RAID mirror with only one device left - no improvement.

Several passes of "hexdump" on the same file conistently yield the same result (ditto for md5sum). This seems to indicate that it's writing the file which fails, not reading it. For example, a zeros-only file of ~1.9 GB size yields:

---------------
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
16fcf40 0000 0000 0000 0000 0020 0000 0000 0000
16fcf50 0000 0000 0000 0000 0000 0000 0000 0000
16fcf60 0000 0000 0000 0000 0024 0000 0000 0000
16fcf70 0000 0000 0000 0000 0000 0000 0000 0000
16fcf80 0000 0000 0000 0000 0024 0000 0000 0000
16fcf90 0000 0000 0000 0000 0000 0000 0000 0000
16fcfa0 0000 0000 0000 0000 0024 0000 0000 0000
16fcfb0 0000 0000 0000 0000 0000 0000 0000 0000
16fcfc0 0000 0000 0000 0000 0020 0000 0000 0000
16fcfd0 0000 0000 0000 0000 0000 0000 0000 0000
16fcfe0 0000 0000 0000 0000 0031 0000 0000 0000
16fcff0 0000 0000 0000 0000 0000 0000 0000 0000
*
4df0a660 0000 0000 0000 0000 0004 0000 0000 0000
4df0a670 0000 0000 0000 0000 0000 0000 0000 0000
4df0a680 0000 0000 0000 0000 0020 0000 0000 0000
4df0a690 0000 0000 0000 0000 0000 0000 0000 0000
4df0a6a0 0000 0000 0000 0000 0004 0000 0000 0000
4df0a6b0 0000 0000 0000 0000 0000 0000 0000 0000
*
4df0a6e0 0000 0000 0000 0000 0021 0000 0000 0000
4df0a6f0 0000 0000 0000 0000 0000 0000 0000 0000
4df0a700 0000 0000 0000 0000 0024 0000 0000 0000
4df0a710 0000 0000 0000 0000 0000 0000 0000 0000
4df0a720 0000 0000 0000 0000 0020 0000 0000 0000
4df0a730 0000 0000 0000 0000 0000 0000 0000 0000
4df0a740 0000 0000 0000 0000 0020 0000 0000 0000
4df0a750 0000 0000 0000 0000 0000 0000 0000 0000
4df0a760 0000 0000 0000 0000 0030 0000 0000 0000
4df0a770 0000 0000 0000 0000 0000 0000 0000 0000
4df0a780 0000 0000 0000 0000 0024 0000 0000 0000
4df0a790 0000 0000 0000 0000 0000 0000 0000 0000
*
4df0a7c0 0000 0000 0000 0000 0026 0000 0000 0000
4df0a7d0 0000 0000 0000 0000 0000 0000 0000 0000
4df0a7e0 0000 0000 0000 0000 0037 0000 0000 0000
4df0a7f0 0000 0000 0000 0000 0000 0000 0000 0000
4df0a800 0000 0000 0000 0000 0027 0000 0000 0000
4df0a810 0000 0000 0000 0000 0000 0000 0000 0000
4df0a820 0000 0000 0000 0000 0038 0000 0000 0000
4df0a830 0000 0000 0000 0000 0000 0000 0000 0000
*
73f78000
---------------

This is a bit scary. I wonder whether I should replace the motherboard. Or the CPU, or maybe just the SATA cables? Oh sh*t. 

I will try this with the PATA subsystem, and just the SATA raw device as a last test.


A fitting comment by "Arno Wagner" in an old thread:

-------------
http://osdir.com/ml/linux.kernel.device-mapper.dm-crypt/2004-05/msg00038.html

Yes. My guess now would be that the mainboard is not well-engineered.
Maybe some parallell signal-lines are not exactly the same length
or some lines are too long. Maybe the chipset is not cooled well.
Maybe they did not use enough decoupling capacitors. Any chance of
getting another CPU and testing with that? Any change of getting
other memory and testing with that?

Of course the fundamental problem is that these systems have gotten
far to complex. I remember owning a 386SX mainboard that had 6
semiconductors on it, including the CPU and one LED. 
-------------

Comment 6 David Tonhofer 2009-01-31 14:45:18 UTC
It's the memory for sure, so this bug is INVALID.

I just tested both of the RAM modules in the system separately, and one of them results in errors, the other not. memtest86 still finds no problems -- it will try with the latest version.

The script to dump a zero-only file and check its contents has been refined/reduced to:

---------------------
#!/bin/bash

BLOCKS=4200000
ERROR=0
SOURCE=/dev/zero
OUTFILE=entropy
BLOCKSSTEP=10000

# Write initial file full of zeros
echo "Writing initial file of $BLOCKS blocks"
dd if=$SOURCE of=$OUTFILE count=$BLOCKS conv=fdatasync
 
while [[ $ERROR == 0 ]]; do
   echo -n "Testing $BLOCKS blocks at "
   date
   # hexdump the file full of zeros 
   HEXDUMP=hexdump_`date +%Y%m%d_%H%M%S`
   hexdump $OUTFILE > $HEXDUMP 
   # if the hexdump contains only zeros, then...
   LINE=`cat $HEXDUMP | cut --fields=1- --delimiter=" " --only-delimited`
   if [[ $LINE != "0000000 0000 0000 0000 0000 0000 0000 0000 0000" ]]; then
      ERROR=1
      echo "Errors found in $HEXDUMP ... dumping a second time"
      hexdump $OUTFILE > ${HEXDUMP}_repeat
   else
      let BLOCKS=$BLOCKS+$BLOCKSSTEP
      dd if=$SOURCE of=$OUTFILE \
         count=$BLOCKSSTEP \
         conv=fdatasync,notrunc \
         status=noxfer \
         oflag=append
   fi
done
---------------------

Thus, we start with a file with only zeros and add zeros to its end until "hexdump" says that there is a nonzero content. "hexdump" is then applied a second time for good measure, generally resulting in an output that indicates that the file contains indeed, only zeros.

That will teach me to use non-ECC RAM.


Note You need to log in before you can comment on or make changes to this bug.