Bug 223216

Summary:	Files on RAID experiencing corruption
Product:	[Fedora] Fedora	Reporter:	Stuart MacDonald <stuartm>
Component:	mdadm	Assignee:	Doug Ledford <dledford>
Status:	CLOSED NOTABUG	QA Contact:
Severity:	medium	Docs Contact:
Priority:	medium
Version:	6
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2007-01-27 08:20:55 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Stuart MacDonald 2007-01-18 14:37:43 UTC

Description of problem:

I have an Asus A8V-SE with a Via VT6420 RAID controller. There are two HDs;
hda (new), hdb (old, possibly dying). There are two other HDs sda and sdb on
the RAID controller, and they are configured as a RAID-1 volume md0. This is
an FC6 install, with kernel 2.6.18-1.2798.fc6 #1 SMP x86_64. All fs are ext3.

I mounted hdb5 and copied my old partition to md0. After a little while I
found a file that had a strange one-bit error in it. Since the old drive is
probably dying of bad sectors, I assumed the fault was there. How to correct?
Make a second copy, diff the two, and if the corruption is random, the diff
should show all the errors, and I can manually correct. So I made a second copy
to hda2, and then diffed** hda2 and md0. This returned a file with a large
number of errors. I manually corrected some errors in one subdir that I needed,
and noticed that all the errors were in the md0 copy, and none in the hda2
copy. So I'm diffing hda2 against hdb5 and so far, no errors at all. This
implies that the errors on md0 were introduced by the RAID somehow.

Version-Release number of selected component (if applicable):

Stock FC6 install.

How reproducible:

The partition is 30 Gb. I've only got this one machine. So, I can't attempt to
reproduce it, but I suspect it's reproducible.

Steps to Reproduce:
1.
2.
3.
  
Actual results:

Files on md0 show one-bit corruption. Somewhere between 200 and 400 such errors
in 30 Gb of files.

Expected results:

Files are corruption-free.

Additional info:

Willing to test. I'd desperately like the RAID to work. I haul very large sets
of files around, and early testing has shown the RAID-1 to cut time by 66%.

Comment 1 Stuart MacDonald 2007-01-18 14:44:33 UTC

*** Bug 223217 has been marked as a duplicate of this bug. ***

Comment 2 Stuart MacDonald 2007-01-18 14:44:45 UTC

*** Bug 223218 has been marked as a duplicate of this bug. ***

Comment 3 Stuart MacDonald 2007-01-18 14:51:44 UTC

Sorry about that; both Opera and IE showed an {Invalid product '} error after
I pressed submit, so I thought it hadn't gone through. IE was slower though and
I could see the Bug Processed page before the error message came up.

It's the same problem as 181015.

Comment 4 Stuart MacDonald 2007-01-18 16:50:14 UTC

Ha, forgot this. **diffed. Not actually diff(1), but a perl script I wrote to
be diff++. Checks permissions and file types as well. But essentially diff.

Comment 5 Doug Ledford 2007-01-18 17:43:52 UTC

This is almost always a CPU or memory problem.  Single bit errors don't happen
because of the RAID system, although 512 byte block errors (aka an entire block
on the hard disk) could be caused by the raid system.  Please check your CPU,
CPU fan, power supply, and RAM.  Any of those items not functioning properly can
cause single byte errors.

As far as the errors only showing up on the /dev/sd* devices implying that it's
the raid subsystem, this would be incorrect.  What it more likely implies is
that the disks on the raid subsystem are faster than the disks on the ide
subsystem and therefore the combined DMA and processor load on RAM is higher
when talking to the scsi disks than the ide disks, causing errors that didn't
show up before to show up.  On my personal web site I have a memory checking
script that detects these sorts of memory errors.  You can download that script
and see if your machine passes thoses tests.  It's at
http://people.redhat.com/dledford

Comment 6 Stuart MacDonald 2007-01-18 19:04:04 UTC

New work machine from May 2006. Case open, no dust build up. CPU fan is running
fine. Power supply is an Antec 450W, which is new. The old one starting dying
and was RMAed in Dec, just before the copy operations but after the old HD
started having bad blocks. RAM is Kingston Hyper-X 2x 1Gb sticks. These have an
anodized-blue heat spreader on them, and are warm to the touch. I memtest86-ed
them when I got the machine and they passed. Case fan is also running well,
although probably not doing much.

Wow! The heat sink (no fan) on the Via K8T890 Northbridge is very hot. I don't
suppose there's an easy way to have
# cp -r <hda2> <md0>
slowed down? A slower copy would generate less heat, and may avoid errors.

This would also explain why I haven't seen any random corruption while working
with the files on md0. I had been putting that down to being lucky so far.

I'll give memtest86 another run, but not sure when.

No idea how to check the CPU though.

I'll run your script as well.

Comment 7 Stuart MacDonald 2007-01-24 05:24:16 UTC

Results so far:

memtest86 ran for 24 hours with no errors.

Your memory script ran once with no errors, working strictly from hda3.
Your memory script is not done the first run through on md0, but is reporting
errors.

Wouldn't the fact that the IDE drive doesn't show errors but the SATA drives do
mean that the problem is the RAID, and not the RAM? (The IDE drive is using
DMA.)

Also, it seems that in every error, it is the LSB that's wrong, and it's always
flipped to 1 wrongly. I get 'u's instead of 't's, '!'s instead of ' 's, '3's
instead of '2's. This implies that it is the RAM, and one particular faulty
bit.

So I'm confused. :-)

Things I'm going to try:

- turn off irqbalance and try the script again
- try each memory stick individually

Any other suggestions?

Comment 8 Doug Ledford 2007-01-24 15:25:56 UTC

memtest86 will often times miss the errors that my script catches, so it passing
doesn't surprise me, it just means that your memory very borderline OK.

The memory script reporting errors is a sure sign of either RAM or CPU problems.

The fact that the single IDE drive doesn't show problems but dual SATA drives do
does not mean the raid is the culprit.  In any given machine, the architecture
for memory access is like this:

   CPU  <->   Memory    <->   PCI bus
             Controller
 
                 ^
                 |
                 v

                RAM

The CPU and PCI bus can both initiate memory access transactions.  The memory
controller then queues them up and issues them to RAM as soon as it can.  When
the CPU is the only thing accessing RAM (aka, when running memtest86), it only
produces a certain amount/pattern of loading on the memory.  When both CPU and
PCI devices are accessing RAM, the loading is heavier and the pattern different.
 When you have two PCI devices and the CPU accessing RAM, especially when the
two PCI devices individually are faster than the one other PCI device you tried,
it can expose problems that didn't show up with just the one slower device and
the CPU trying to access RAM.

Furthermore, in the simpler test of just copying data from IDE drives to SATA
drives, the copy program never attempts to access the data.  It basically says
to the operating system "here's a 64k chunk of RAM, have hard drive hda copy the
first 64k of data from this file into it" and it happens via DMA.  Then it takes
that same 64k chunk of data and tells the operating system to write it to the
SATA drives (the raid subsystem simply issues the same write twice, once to each
drive, in the case of RAID1 which is what I assume you are using).  The raid
subsystem never "inspects" the data to see what it is, so it never has the
chance to do single bit errors like you are seeing.

Now, with my memory test, it's possible the CPU could be causing the problem
because the CPU has to decompress the compressed tarball when writing the files
out.  The corruption could happen during that decompression.  In the case of
just plain copying files from hda to the SATA drives, the CPU doesn't make any
attempt to modify the data, so it's less likely to cause errors there (it's
still possible for it to happen because the CPU copies the data from DMA buffers
to the cp program's internal buffers sometimes depending on the type of hard
drive controller you have and whether or not it supports DMA directly to user
space).  So, between those two facts, and the fact that you got corruption on
both a plain copy operation and the stress test, I would say you have memory
problems with about a 95% certainty.

Now, you can test the memory sticks independently and that may identify the bad
DIMM, but there's also a chance that both will pass by themselves.  In that
case, you may have a motherboard that, due to the placement of the memory DIMM
slots, can not support full speed memory operation with multiple DIMMs.  For
instance, I have a motherboard with 3 DIMM slots.  With any two DIMMs inserted
side by side, I can run at 400MHz DDR.  As soon as I add a third DIMM, I have to
slow it down to 333MHz DDR or else it develops memory errors.  The manual for
the motherboard spells this limitation out in my case.  It also may be possible
that the memory controller on the motherboard is bad, in which case replacing
the motherboard itself is the only option.

Comment 9 Stuart MacDonald 2007-01-24 16:23:00 UTC

> (the raid subsystem simply issues the same write twice, once to each drive,
> in the case of RAID1 which is what I assume you are using)

Damn it. I even said RAID-1 in my initial problem report. It's actually RAID-0.
I found out the hard way that the "hardware" RAID support doesn't work on this
motherboard very well, at least with Linux, so I've got it set up as a software
RAID, /dev/md0 -> /dev/sda + /dev/sdb.

Well, irqbalance off made no difference, an error popped up pretty soon after
starting the script.

This is my memory:
http://www.ec.kingston.com/ecom/configurator_new/
PartsInfo.asp?root=us&LinkBack=http://www.kingston.com&ktcpartno=KHX3200AK2/2G

One thing I haven't mentioned: the BIOS on this does not autodetect the
Kingston-specified timings, so I had set them manually. Before the shell
script test though I changed it back to the autodetected timings. Didn't make
a difference AFAICS.

My motherboard manual makes no mention of any memory limitation, but I've
already run into one that Asus will not confirm:
http://www.ec.kingston.com/ecom/configurator_new/
modelsinfo.asp?SysID=24573&mfr=ASUS&model=A8V&root=us&LinkBack=http%3A%2F%2Fwww.kingston.com&Sys=24573-
ASUS-A8V-E+SE+Motherboard&distributor=0&submit1=Search
(only supports 6 ranks).

I'm going to scale back the speed to 333 and try again.

Comment 10 Stuart MacDonald 2007-01-24 21:11:33 UTC

The BIOS not autodetecting the memory turned out to be a feature: Kingston
programs the SPD with lesser default timings.

I scaled the memory back to 333, 266 and 200, and the md0 corruption still
occurs.

I found a 512 Mb stick of DDR 400 and tried that instead of my regular memory,
and the corruption still occurs, so I think that rules out my memory being
bad. I can't test the regular memory in another machine though, I'm pretty sure
this is the only one we have that has a RAID, and it's almost certainly the
only AMD system.

I think this means that it's a motherboard issue.

Comment 11 Stuart MacDonald 2007-01-27 08:20:55 UTC

I located another machine with SATA connectors, got linux running and brought
the RAID-0 up. The stress passed. So it's not the drives.

The motherboard has been RMAed. Thanks for all your help, especially the stress
test script. Very handy.