229239 – Data corruption with multiple SATA devices -- believe it to be x86_64 specific! driver problem

Bug 229239 - Data corruption with multiple SATA devices -- believe it to be x86_64 specific! driver problem

Summary: Data corruption with multiple SATA devices -- believe it to be x86_64 specifi...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	rawhide
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:	bzcl34nup
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2007-02-19 18:16 UTC by David Bennion
Modified:	2008-05-07 03:58 UTC (History)
CC List:	3 users (show)
Fixed In Version:	FC7
Clone Of:
Environment:
Last Closed:	2008-05-07 01:12:20 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description David Bennion 2007-02-19 18:16:27 UTC

Description of problem:

System has:  1 ATA hard drive (hosting FC6 with latest updates)
2 internal SATA hard drives connected to sata_nv on motherboard
1 internal SATA hard drive && 1 external SATA drive (for backups) connected to
sata_sil in pci slot

3 internal SATA's function as raid0 (/dev/md0).  Devices connected to the
sata_sil experience data corruption!  Observe this session:

I have a large file I have created by cat'ing a bunch of .iso's together (from
/dev/hda -> /dev/md0) File is called "knownquantity.dat".

Here is the size:
[09:58:39 davidb@marvin raid]$ ls -l knownquantity.dat 
-rw-rw-r-- 1 davidb davidb 13569177600 Feb 19 09:54 knownquantity.dat

Here is the initial checksum from on the raid:
[09:56:55 davidb@marvin raid]$ md5sum knownquantity.dat 
4f7ff463f0593902e843e53534e258ab  knownquantity.dat

Now I copy it the external SATA backup device (/dev/sdc1) via the sata_sil driver.
[09:59:57 davidb@marvin raid]$ sudo cp -a knownquantity.dat /mnt/backup/

Now, I'm done so I check the size and it looks OK:
[10:03:25 davidb@marvin raid]$ ls -l /mnt/backup/knownquantity.dat 
-rw-rw-r-- 1 davidb davidb 13569177600 Feb 19 09:54 /mnt/backup/knownquantity.dat

Now I checksum the copied file to see if it copied correctly:
[10:04:44 davidb@marvin raid]$ md5sum /mnt/backup/knownquantity.dat 
5a5e6ab2b9be30262768fa8b69cf9c0c  /mnt/backup/knownquantity.dat

Uh oh!!  5a5e6ab2b9be30262768fa8b69cf9c0c != 4f7ff463f0593902e843e53534e258ab

How bad is the problem?  Let's try to checksum the file a few more times just on
the raid.
[10:17:45 davidb@marvin raid]$ md5sum knownquantity.dat 
0ce17ed09df818197a8e707d1833177d  knownquantity.dat
[10:19:03 davidb@marvin raid]$ md5sum knownquantity.dat 
5357648fc4600c7e771057170a1fa407  knownquantity.dat
[10:21:11 davidb@marvin raid]$ md5sum knownquantity.dat 
f16ddd9a507d3543afc4b995adf5348d  knownquantity.dat
[10:22:32 davidb@marvin raid]$ md5sum knownquantity.dat 
87ddff0ef0a695c46f718a9463014352  knownquantity.dat

Version-Release number of selected component (if applicable):

Linux marvin 2.6.19-1.2911.fc6 #1 SMP Sat Feb 10 15:16:31 EST 2007 x86_64 x86_64
x86_64 GNU/Linux

How reproducible:
Very.

Steps to Reproduce:
1.Have a big file on an x86_64 machine with the sata_sil device.
2.Copy it to the drive using sata_sil.
3.Check the checksums.
  
Actual results:
Non-matching checksum.

Expected results:
A matching checksum.

Additional info:

The funny thing though, even though these checksums are off, the problem is
probably just a few bytes.  I have VMware virtual machines on the raid and they
do function.  No, I'm not going to leave it this way -- but that's why this
problem has been hard to find until now.

The RAID0 used to be just two sata drives on the sata_nv and I don't recall
having any trouble.  Then I added the sata_sil and the third drive and that's
when weird things started happening.  Recognizing the data corruption and
pinpointing it is a relatively recent development.

You should know that I have used an identical sata_sil PCI card and the exact
same external hard drive to back up a 32 bit machine and have experienced no
data corruption.  This is verified by the fact that I do my backups using tar
-czf /mnt/backup/file.tgz  and then following the backup I have started doing a
tar -tvzf /mnt/backup/file.tgz to validate the file.  I successfully backed up
and tested a 161G file, so gzip vouches for the data integrity.  There is,
however, only a single SATA controller in that 32 bit machine.  So it could be a
64 bit driver issue (likely) or a SATA stack kernel issue (to me, less-likely).

Comment 1 David Bennion 2007-02-19 18:17:48 UTC

Output of lspci on the machine

00:00.0 RAM memory: nVidia Corporation C51 Host Bridge (rev a2)
00:00.1 RAM memory: nVidia Corporation C51 Memory Controller 0 (rev a2)
00:00.2 RAM memory: nVidia Corporation C51 Memory Controller 1 (rev a2)
00:00.3 RAM memory: nVidia Corporation C51 Memory Controller 5 (rev a2)
00:00.4 RAM memory: nVidia Corporation C51 Memory Controller 4 (rev a2)
00:00.5 RAM memory: nVidia Corporation C51 Host Bridge (rev a2)
00:00.6 RAM memory: nVidia Corporation C51 Memory Controller 3 (rev a2)
00:00.7 RAM memory: nVidia Corporation C51 Memory Controller 2 (rev a2)
00:03.0 PCI bridge: nVidia Corporation C51 PCI Express Bridge (rev a1)
00:04.0 PCI bridge: nVidia Corporation C51 PCI Express Bridge (rev a1)
00:05.0 VGA compatible controller: nVidia Corporation C51G [GeForce 6100] (rev a2)
00:09.0 RAM memory: nVidia Corporation MCP51 Host Bridge (rev a2)
00:0a.0 ISA bridge: nVidia Corporation MCP51 LPC Bridge (rev a3)
00:0a.1 SMBus: nVidia Corporation MCP51 SMBus (rev a3)
00:0b.0 USB Controller: nVidia Corporation MCP51 USB Controller (rev a3)
00:0b.1 USB Controller: nVidia Corporation MCP51 USB Controller (rev a3)
00:0d.0 IDE interface: nVidia Corporation MCP51 IDE (rev a1)
00:0e.0 IDE interface: nVidia Corporation MCP51 Serial ATA Controller (rev a1)
00:10.0 PCI bridge: nVidia Corporation MCP51 PCI Bridge (rev a2)
00:10.1 Audio device: nVidia Corporation MCP51 High Definition Audio (rev a2)
00:14.0 Bridge: nVidia Corporation MCP51 Ethernet Controller (rev a3)
00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
HyperTransport Technology Configuration
00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map
00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM
Controller
00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
Miscellaneous Control
03:06.0 Mass storage controller: Silicon Image, Inc. SiI 3512
[SATALink/SATARaid] Serial ATA Controller (rev 01)

Comment 2 David Bennion 2007-02-19 20:45:51 UTC

I wanted to do some additional tests on this problem.  

First, I went and created a copy of knownquantity.dat onto /dev/hda (a known
good, reliable basic ata drive).  Then I checksummed knownquantity.dat on that
drive.  Then I copied it from /dev/hda -> /dev/sdc1 (/mnt/backup in the above
examples; the external SATA drive on sata_sil).  Then checksummed it from the
external drive.  Much to my surprise, it checked out.  So I did it two more
times (copying over top of the same file).  It checked out the other two times.

Then I said: let's sanity check sata_nv.  So I went and dropped /dev/md0 and
re-created it with only the 2 sata drives (/dev/sda, /dev/sdb).  Then I copied
knownquantity.dat from /dev/hda to the new /dev/md0.  then checksummed it, it
checked out.  then I checksummed it again a couple of times.  Again, it checked out.

Then I re-created /dev/md0 again with the third internal drive running on
sata_sil, re-copied knownquantity.dat onto it.  Immediately after the copy, the
md5sum doesn't match.

I am willing to consider 3 device raid0 as part of the culprit.  I will try the
test again with only one drive from sata_nv and one drive from sata_sil.  I am
betting I will experience the same problem, however.

It looks at this point like it is related to multiple sata drivers or how the
sata_sil drivers handle being in a raid0.

Comment 3 Chuck Ebbert 2007-02-19 21:00:04 UTC

How much memory is installed on this machine? If it's >4Gb or you
have a memory hole there can be problems. You can use the boot
option
    iommu=soft
to work around this (edit /etc/grub.conf and add it after the
"root=" entry for your kernel.)

Also, corruption like this can be caused by not enough
power getting to the drives. Your power supply may not be
big enough to drive them, especially when they are all
doing I/O at once.

Comment 4 David Bennion 2007-02-19 21:17:00 UTC

Those are some insightful questions.  

cat /proc/meminfo gives this number:
MemTotal:      3972512 kB

I did have to enable "HardWare Memory Hole" in order to see the full 4GB.  So I
will go try your kernel option.

As far as a power supply, I have a 550W.  I would have thought that would be
enough.  4 internal drives, the external has an external power source.  

Let me see my mileage on the memory hole option and I will get back with you.

Comment 5 David Bennion 2007-02-19 21:35:10 UTC

iommu=soft 
makes my md5sum's work!  :)

Comment 6 David Bennion 2007-02-19 21:37:08 UTC

Thank you very much for your help!  You are a star!

Comment 7 Chip Coldwell 2007-02-20 16:34:57 UTC

probably a duplicate of bug 223238.  We have a patch as of yesterday that fixes
the problem, but the root cause appears (at the moment) to be a silicon erratum.

Chip

Comment 8 Chip Coldwell 2007-02-20 16:59:41 UTC

This problem seems to be associated with the Nvidia chipsets.  What was the
exact platform that was having the problem (Vendor/model, etc).

Chip

Comment 9 Matthew Miller 2007-04-06 16:02:33 UTC

Fedora Core 5 and Fedora Core 6 are, as we're sure you've noticed, no longer
test releases. We're cleaning up the bug database and making sure important bug
reports filed against these test releases don't get lost. It would be helpful if
you could test this issue with a released version of Fedora or with the latest
development / test release. Thanks for your help and for your patience.

[This is a bulk message for all open FC5/FC6 test release bugs. I'm adding
myself to the CC list for each bug, so I'll see any comments you make after this
and do my best to make sure every issue gets proper attention.]

Comment 10 David Bennion 2007-04-07 19:36:22 UTC

MSI K8NGM2-L is the motherboard manufacturer.  I think it is the nforce 410, I 
believe is the chipset.

I have another confession, however.  I didn't realize it but I think I had an 
overclocked northbridge to southbridge link.  Running with iommu=soft made 
everything run smooth, with the exception that I would have the 
occasional "lockup" on IO and would have to hard-reset the system.  Then when I 
went into the BIOS and did "Load Optimized Defaults", it changed my northbridge 
to southbridge link from 1000Mhz to 800Mhz.

Since I did that, I have not had a single "lockup" and everything appears to be 
much more stable.  I still have iommu=soft.

I am willing to take iommu off to test, if there has been some fixes.

(In reply to comment #8)
> This problem seems to be associated with the Nvidia chipsets.  What was the
> exact platform that was having the problem (Vendor/model, etc).
> Chip

Comment 12 Matthew Miller 2007-04-09 14:33:25 UTC

Bug 223238 is Top Sekret, so I can't mark this as a dupe of that.

Comment 13 Bug Zapper 2008-04-03 19:11:36 UTC

Based on the date this bug was created, it appears to have been reported
against rawhide during the development of a Fedora release that is no
longer maintained. In order to refocus our efforts as a project we are
flagging all of the open bugs for releases which are no longer
maintained. If this bug remains in NEEDINFO thirty (30) days from now,
we will automatically close it.

If you can reproduce this bug in a maintained Fedora version (7, 8, or
rawhide), please change this bug to the respective version and change
the status to ASSIGNED. (If you're unable to change the bug's version
or status, add a comment to the bug and someone will change it for you.)

Thanks for your help, and we apologize again that we haven't handled
these issues to this point.

The process we're following is outlined here:
http://fedoraproject.org/wiki/BugZappers/F9CleanUp

We will be following the process here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping to ensure this
doesn't happen again.

Comment 14 Bug Zapper 2008-05-07 01:12:18 UTC

This bug has been in NEEDINFO for more than 30 days since feedback was
first requested. As a result we are closing it.

If you can reproduce this bug in the future against a maintained Fedora
version please feel free to reopen it against that version.

The process we're following is outlined here:
http://fedoraproject.org/wiki/BugZappers/F9CleanUp

Comment 15 David Bennion 2008-05-07 03:58:09 UTC

This bug may be closed.  The issue was resolved with a fixed driver that is in
the mainline kernel, I believe it went in in 2.6.3.21 or something like that. 
It has been a long time. 

The problem was caused by an out of spec driver for the nforce chip.  You can
read about it in the kernel changelogs if you care, but this issue really is closed.

Note You need to log in before you can comment on or make changes to this bug.