Bug 81925

Summary:	File reads on ext3 filesystems corrupt
Product:	[Retired] Red Hat Linux	Reporter:	Simon Matter <simon.matter>
Component:	kernel	Assignee:	Arjan van de Ven <arjanv>
Status:	CLOSED NOTABUG	QA Contact:	Brian Brock <bbrock>
Severity:	high	Docs Contact:
Priority:	high
Version:	7.2	CC:	sct
Target Milestone:	---
Target Release:	---
Hardware:	i686
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2003-02-10 15:41:10 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Simon Matter 2003-01-15 08:49:08 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020827

Description of problem:
Reading files on a ext3 filesystem delivers corrupt data, ~1 wrong byte out of
~100G data. Writing files seems to be okay. This happens in my case on software
RAID5. I'm using 4x60G IBM drives on one Promise Ultra100TX2 (PDC20268) IDE
controller. System is DELL Poweredge 1400SC.
I'm quite sure it's a problem of ext3 here. I have built
2.4.18-18SGI_XFS_1.2pre3 kernel, which is RedHat 2.4.18-18 with XFS added, and
there is no corruption when using XFS. With ext3, I get corruption. I have then
installed original latest RedHat errata kernel 2.4.18-19.7.x and tested again,
and the corruption still occurs.



Version-Release number of selected component (if applicable):
kernel-2.4.18-19.7.x

How reproducible:
Always

Steps to Reproduce:
1. Create a 180G RAID5 volumeon the 4x60G drives. Wait until sync finishes.
2. Create ext3 filesystem on it with 'mkfs -j /dev/md9'
3. mount my data via nfs on /mnt/nfs, mount /dev/md9 on /mnt/md9
4. cp -r /mnt/nfs /mnt/md9/nfs1 ; cp -r /mnt/nfs /mnt/md9/nfs2 ; cp -r /mnt/nfs
/mnt/md9/nfs3 ; cp -r /mnt/nfs /mnt/md9/nfs4 ; cp -r /mnt/nfs /mnt/md9/nfs5
5. diff -r /mnt/nfs /mnt/md9/nfs1 & ; diff -r /mnt/md9/nfs1 /mnt/md9/nfs2 & ;
diff -r /mnt/md9/nfs2 /mnt/md9/nfs3 & ; diff -r /mnt/md9/nfs3 /mnt/md9/nfs4 & ;
diff -r /mnt/md9/nfs4 /mnt/md9/nfs5 &

Actual Results:  At least one file is reported corrupt by one of the diffs. If I
repeat the diffs, one or more other files are reported not identical.

Example, first run:
Binary files /nfs4/mp3/jennifer_lopez-on_the_6.track12.mp3 and
/nfs5/mp3/jennifer_lopez-on_the_6.track12.mp3 differ

Example, second run:
Binary files /nfs3/ISO/Beta/phoebe-i386-disc1.iso and
/nfs4/ISO/Beta/phoebe-i386-disc1.iso differ

Expected Results:  There should be no output from any of the diffs because files
are identical.

Additional info:

This bug has first appeared to me when migrating a customers server. I didn't
pay much attention. Now I found the same on my own server. I'm worried.

Comment 1 Stephen Tweedie 2003-01-15 14:18:27 UTC

OK, there are way too many variables in this so far.  First job is to isolate
some of them.  The XFS thing is interesting, but we know for a fact that ext3
and XFS stress both the disks and the VM in different way, and there have been
plenty of times in the past where only one fs tickled a problem but the fault
turned out to be elsewhere, so that doesn't pin things down yet.

This could be a disk fault, a driver fault, a filesystem fault in ext3, a
filesystem fault in NFS, or a fault in the NFS server itself.

First question --- what does the data corruption look like?  Could you attach an
example of the actual diff between the two files which mis-compare?  (cmp -l
will do that.)

If you repeat the diff several times, do you always get the same diffs?  (ie.
can we eliminate read problems during the final diff as the fault here?)

Next, you were copying from NFS to two different filesystems.  If there is an
NFS problem here, then NFS could produce different answers each time, and ext3
would faithfully record that.  What happens if, instead, you copy to local disk
once, and then copy _that_ to the multiple different destinations?

Comment 2 Simon Matter 2003-01-15 15:00:24 UTC

I know about the many variables which already made it difficult for me to
isolate things. The other problem is that it takes me way too much time to run
those tests. Anyway, here we go:

XFS <-> ext3, I know they are very different. The only thing I know for sure is
that XFS never failed (I mean in this scenario) while ext3 fails with recent
2.4.18 based RedHat kernels while it doesn't with the 2.4.9-34 version.

---
> This could be a disk fault, a driver fault, a filesystem fault in ext3, a
> filesystem fault in NFS, or a fault in the NFS server itself.
---

Disk fault. Could be, but I have pushed them very hard. I have already diffed
terabytes on them with XFS without a single error. So, I guess they are fine.

Driver fault. Looks like being the same as the 'disk fault' thing. At least the
driver is always the same. And I'm using this controller on my main server since
two years.

Filesystem fault in ext3, maybe yes, but as you said, may be diffcult to prove.

Filesystem fault in NFS, or a fault in the NFS server itself. No. I'm copying
the nfs data 5 times on the local volume, which is 145G in altoghether. I run
only one diff against the nfs server, the four others are between the copies on
disk.

---
> First question --- what does the data corruption look like?  Could you attach > an
> example of the actual diff between the two files which mis-compare?  (cmp -l
> will do that.)
---

I didn't do that yet, see below

---
> If you repeat the diff several times, do you always get the same diffs?  (ie.
> can we eliminate read problems during the final diff as the fault here?)
---

I get one or more different diff errors with every run. I know they are read
errors because I compare 1 with 2, 2 with 3, 3 with 4 and so on. How can 2 and 3
be different while 1=2, 3=4 and 4=1?

---
> Next, you were copying from NFS to two different filesystems.  If there is an
> NFS problem here, then NFS could produce different answers each time, and ext3
> would faithfully record that.  What happens if, instead, you copy to local disk
> once, and then copy _that_ to the multiple different destinations?
---

Sorry, I wasn't clear.
I was always using the same disks, with the same partitions, with the same raid5
volume. The filesystem was 88% filled because I first did this test to test the
disk drives.

Another diff has just finished and shows this now:
Binary files /mnt/nfs/mp3/scorpions-face_the_heat.track04.mp3 and
/nfs1/mp3/scorpions-face_the_heat.track04.mp3 differ
Binary files /nfs3/ISO/Beta/phoebe-i386-disc1.iso and
/nfs4/ISO/Beta/phoebe-i386-disc1.iso differ

Now I do 'cmp -l /nfs3/ISO/Beta/phoebe-i386-disc1.iso
/nfs4/ISO/Beta/phoebe-i386-disc1.iso'
No error.

I do 'cmp -l /mnt/nfs/mp3/scorpions-face_the_heat.track04.mp3
/nfs1/mp3/scorpions-face_the_heat.track04.mp3'
No error.

What information do you need. I keep this box available for testing as long as
possible.

Comment 3 Stephen Tweedie 2003-01-15 21:53:33 UTC

NFS could _easily_ be a problem.  You have copied from NFS to 5 different local
filesystems.  If NFS gave the wrong data one of those times, then afterwards
you'd find that the local filesystems don't match each other.  That's why
copying from local disk to local disk after the first copy from NFS would help
to isolate things.

However, given the other data here, that doesn't look like the problem --- you
do indeed look as if there are read errors.  I've *never* seen that caused by a
filesystem fault --- it is almost always a hardware problem, and occassionally a
software VM problem.

The next thing is definitely to capture a snapshot of just what the corruption
looks like.  For example, one pattern of corruption I've seen (again, only on
reads) on Promise controllers in the past was a missing run of 4 bytes from the
middle of a file.  If this is hardware, it might well show up clearly like that.

One way to catch the corruption might be to repeatedly copy the tree from disk
to disk, so that hopefully at least one iteration will end up reading the
original tree wrongly and recording the corruption pattern permanently during
the copy.

Comment 4 Simon Matter 2003-01-22 13:56:32 UTC

NFS is _not_ the problem here. I can easily reproduce the problem without any
NFS involved.

I was unable to reproduce the problem using cmp -l. No differences showed up,
even when doing the diff's in parallel.

I have not being able to reproduce this problem with kernel 2.4.9-34 on ext3 and
XFS or 2.4.18-18 on XFS.

The only way I can reproduce this is with 2.4.18-18 and later on ext3, but as
you correctly stated, that doesn't mean it is the filesystems fault.

I have then removed the software RAID5 and created 4 independent ext3
filesystems on then. Running some diff's in parallel between those 4 filesystems
generates the same error but even quicker and with less data involved.

I have now copied the 4 directories to the root filesystem, which is on U160
SCSI with software RAID5. I have tried my diff script several time only on the
SCSI disks with no error. I have then started the diff processes on the SCSI and
IDE filesystems in parallel, and voila, the SCSI filesystems reported errors too.

The only thing and can try now is to reproduce the problem on a completely
different hardware, right?

Comment 5 Stephen Tweedie 2003-01-22 14:19:55 UTC

This is definitely looking like an IDE-level problem.  So far it's unclear
whether the problem is hardware or a driver fault, though.  

Do you have any indications of IDE errors occurring in the logs?  If you're
still in UDMA-100 mode, then the controller should be detecting CRC failures on
the cable and the driver will retry them, but if the driver has had problems
staying in UDMA mode then it may have backed down to a non-CRC-capable IO mode.

I strongly suspect that the reason 2.4.9-34 worked OK is that it had far earlier
IDE drivers, and simply was not capable of driving your PDC20268 to the fullest
extent, thus leaving the card in a slower but less demanding mode.  As for the
XFS kernel, it's not clear whether the difference is just in the way XFS is
driving the hardware, or whether you built the XFS-enabled kernel with different
IDE config options, so again it could have been that the most advanced driver
options were not enabled there.

Beyond this, we're going to have to look at either booting IDE with an argument
to force the driver into a slower mode, or using a later kernel with updated IDE
drivers from work done upstream since 2.4.18.  Would you be willing to test a
Phoebe (current Red Hat Linux beta) kernel on this hardware?  That is based on a
 kernel closer to 2.4.20, and has a number of IDE improvements as a result.

Comment 6 Simon Matter 2003-01-22 15:27:15 UTC

There are no IDE errors showing up in the logs. All four disks look like this:

 Model=CI530L06VARE700-                        , FwRev=REO64AA6,
SerialNo=        S PZZT8T4F34
 Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs }
 RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=40
 BuffType=DualPortCache, BuffSize=1916kB, MaxMultSect=16, MultSect=16
 CurCHS=16383/16/63, CurSects=-66060037, LBA=yes, LBAsects=120103200
 IORDY=on/off, tPIO={min:240,w/IORDY:120}, tDMA={min:120,rec:120}
 PIO modes: pio0 pio1 pio2 pio3 pio4
 DMA modes: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 *udma5
 AdvancedPM=yes: disabled (255)
 Drive Supports : ATA/ATAPI-5 T13 1321D revision 1 : ATA-2 ATA-3 ATA-4 ATA-5

 multcount    = 16 (on)
 I/O support  =  0 (default 16-bit)
 unmaskirq    =  0 (off)
 using_dma    =  1 (on)
 keepsettings =  0 (off)
 nowerr       =  0 (off)
 readonly     =  0 (off)
 readahead    =  8 (on)
 geometry     = 119150/16/63, sectors = 120103200, start = 0

My XFS kernels are compiled from the RedHat source RPM with only the XFS bits
added. Anything else is the same! I have then put a root filesystem onto the IDE
disks and bootet the original RedHat kernel to verify that it's not the XFS
kernel's fault. Now I went back to by true root filesystem, which is XFS. I'd
like to try the Phoebe kernels but can't do it quickly because it still doesn't
have XFS included. Let me know if I'm wrong.

Comment 7 Simon Matter 2003-01-22 23:20:38 UTC

To correct things, I have finally been able to create corruption on XFS too. If
I find the time, I'll try with a newer kernel from RedHat Phoebe.

Comment 8 Simon Matter 2003-01-24 10:33:45 UTC

I have tried 2.4.20-2.2 from Phoebe now and it's all the same. I have also
switched off DMA and used pio4 without success (hdparm -d0 -X12 /dev/hd[e-h]).
That way corruption got even worse.

It really seems to be a problem in the new IDE or PDC code here. Kernel 2.4.9-34
works perfect.

Comment 9 Simon Matter 2003-01-24 13:23:37 UTC

Finished my final tests right now. A simple bonnie shows significant speed
improvements between RH 2.4.9-34 and RH 2.4.[18,20] of up to 100%.
Unfortunately, something broke, at least on the hardware I use. ext3 and XFS (on
XFS enhanced RH kernel) show almost the same performance here.

The really big problem is that people won't realize the problem until they do
some serious integrity check.

Comment 10 Simon Matter 2003-02-07 12:25:53 UTC

I have just tried with the newest errata kernel 2.4.18-24 and the problem still
exists.

Comment 11 Simon Matter 2003-02-10 07:08:20 UTC

Interesting update: I have moved the Promise controller and the four disks to my
small production server, which is a AMD K6-400 with VIA MVP3 chipset with
exactly the same RedHat installation. I can't reproduce the problem here. So I'm
starting to think it's a problem between the serverworks chipset and the
Probmise card on the other server.

Comment 12 Stephen Tweedie 2003-02-10 14:13:23 UTC

Did you have any luck trying to catch a snapshot of what the corruption actually
looks like?  If you copy the source tree multiple times off disk, then you're
hopefully going to hit a read error once in a while which will end up being
copied to the target dir, so that when you get a verification error you'll be
able to do a "cmp -l" to find the exact corruption later.

I think we'll ulimately have to close this as NOTABUG, though.

Comment 13 Simon Matter 2003-02-10 14:38:49 UTC

What I said this morning was unfortunately wrong. I am now running my test on my
production server, with another Promise Ultra100TX2, other cables and the same
disks. I've been able to reproduce corruption with kernel 2.4.18-24 _and_
2.4.9-34 so it's not limited to the newer kernels. My next - and I hope last
steps - will be to 1) upgrade the microcode (firmware) on the IBM disks. When
this doesn't help, I'll do 2) connect the disks to the onboard VIA IDE
controller. If this corrects the problem, then it's definitely the Promise
controller hardware or the PDC kernel driver.

I was unable to verify the error with "cmp -l" yet but will try it again.

Comment 14 Simon Matter 2003-02-10 15:27:14 UTC

Okay, here is a corruption pattern:

[root@crash FreeBSD]# pwd
/home/XXL/x/FreeBSD
[root@crash FreeBSD]# cmp -l 4.7-disc2.iso
/home/XXL/backup/FreeBSD/4.7-disc2.iso
614860720 341 241

I have to run this ~10 times to get the error once. The next error after several
test was:
[root@crash FreeBSD]# cmp -l 4.7-disc2.iso
/home/XXL/backup/FreeBSD/4.7-disc2.iso
62316588 323 123

Comment 15 Stephen Tweedie 2003-02-10 15:41:10 UTC

In hex, that's:

24A607B0: E1 A1
03B6E02C: D3 53

Single bit-flip errors, in different bits.  This is most definitely looking like
a chipset problem.  Good luck in trying to hunt it down!

Comment 16 Simon Matter 2003-02-10 22:20:53 UTC

To finish this thread: It came out that the Promise or VIA controller as well as
the cables are okay. The only problem comes from the harddisks which are IBM
IC35L060AVER07-0. I can put them in any box I want and am able to reproduce
those single bit errors.

Comment 17 Stephen Tweedie 2003-02-10 22:31:36 UTC

Are all of the disks behaving badly, or just one?  It could be bad memory on the
disk's internal cache, for example, but that excuse fades if they all show the
same problem (and you wouldn't really expect different bits to flip each time if
it's just bad memory.)

Comment 18 Simon Matter 2003-02-11 23:23:00 UTC

Okay, I ran some more tests. The error rate differs between the drives but they
all produce errors. And before you ask, they are not from the same batch. Two of
them were produced in Thailand, two of them on the Philippines. Manufacturing
date differs by some months. Firmware level is different too. Looks really bad.
Two of the drives were already replaced by IBM because they started to report
bad sectors whithin the first month of operation. I'm very sure I'm not alone
but most people won't find out what their disks are doing.

Comment 19 Simon Matter 2003-02-12 14:44:57 UTC

No, I don't believe it! I have just installed a Maxtor drive and I can reproduce
the same errors. I have also disabled whatever I can think with hdparm. It looks
like that now:

/dev/hde:
 multcount    =  0 (off)
 I/O support  =  0 (default 16-bit)
 unmaskirq    =  0 (off)
 using_dma    =  0 (off)
 keepsettings =  0 (off)
 nowerr       =  0 (off)
 readonly     =  0 (off)
 readahead    =  0 (off)
 geometry     = 77504/16/63, sectors = 78125000, start = 0

The error are still here. It seems that whenever I have four IDE disks installed
and access them alltoghether, I get wrong data in read operations, not write
operation. This happens with two different mailboard chipsets, two different
CPUs, two different IDE controllers, two different IDE cables sets and two
different disk types. It does never happen with SCSI disks on the same systems.

I will try the following tonight:
- Doing a fresh install of RedHat 7.2.
- Update everything to the lastest errata level.
- Try to reproduce the bug.
- Move the disks to the other server.
- Repeat the tests.

Let's see what comes out. Maybe I have to see my Doctor soon...

Comment 20 Simon Matter 2003-03-04 07:51:56 UTC

To finish this thread, here is what came out:

Corrupt IDE read operations with 1bit errors every ~5Gb were discovered with the
following hardware, independant whether (U)DMA was enabled or disabled, when
stressing four disks simultaneously:
Serverworks LE / Promise Ultra100TX2 (PDC20268)
VIA MVP3 / Promise Ultra100TX2 (PDC20268)
VIA MVP3 / VIA MVP3 integrated IDE

No corruption exists with:
Intel 440BX / Promise Ultra100TX2 (PDC20268)
Intel 440BX / Intel 440BX integrated IDE