Bug 463508 - data corruption detected during O_DIRECT testcases
data corruption detected during O_DIRECT testcases
Status: CLOSED DUPLICATE of bug 463134
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
5.3
x86_64 Linux
medium Severity medium
: rc
: ---
Assigned To: Red Hat Kernel Manager
Martin Jenner
: Regression
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2008-09-23 15:07 EDT by Nate Straz
Modified: 2008-09-24 15:14 EDT (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2008-09-24 15:14:21 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Nate Straz 2008-09-23 15:07:06 EDT
Description of problem:

While running d_io on the latest kmod-gfs I keep hitting data corruption failures where the pattern is shifted by a few bytes.



Version-Release number of selected component (if applicable):
kernel-2.6.18-115.gfs2abhi.001
kmod-gfs-0.1.26-1.el5


How reproducible:
Easily on this cluster.

Steps to Reproduce:
1. run d_io on a GFS file system
  
Actual results:
*** xdoio(pid: 13386) DATA COMPARISON ERROR /mnt/brawl/marathon-02/rwdirectsmall ***
Corrupt regions follow - unprintable chars are represented as '.'
-----------------------------------------------------------------
corrupt bytes starting at file offset 65536
    1st 32 expected bytes:  :13386:writev*I:13386:writev*I:1
    1st 32 actual bytes:    itev*I:13386:writev*I:13386:writ
--
*** xdoio(pid: 24338) DATA COMPARISON ERROR /mnt/brawl/marathon-04/rwrandirectlarge ***
Corrupt regions follow - unprintable chars are represented as '.'
-----------------------------------------------------------------
corrupt bytes starting at file offset 76136448
    1st 32 expected bytes:  writev*I:24338:writev*I:24338:wr
    1st 32 actual bytes:    338:writev*I:24338:writev*I:2433
--
*** xdoio(pid: 24481) DATA COMPARISON ERROR /mnt/brawl/marathon-04/mtfile_rwrevdirect ***
Corrupt regions follow - unprintable chars are represented as '.'
-----------------------------------------------------------------
corrupt bytes starting at file offset 127389696
    1st 32 expected bytes:  1:writev*P:24481:writev*P:24481:
    1st 32 actual bytes:    481:writev*P:24481:writev*P:2448
--
*** xdoio(pid: 26547) DATA COMPARISON ERROR /mnt/brawl/marathon-03/mtfile_rwrandirect ***
Corrupt regions follow - unprintable chars are represented as '.'
-----------------------------------------------------------------
corrupt bytes starting at file offset 76136448
    1st 32 expected bytes:  writev*P:26547:writev*P:26547:wr
    1st 32 actual bytes:    547:writev*P:26547:writev*P:2654
--
*** xdoio(pid: 7649) DATA COMPARISON ERROR /mnt/brawl/marathon-01/rwrevdirectsmall ***
Corrupt regions follow - unprintable chars are represented as '.'
-----------------------------------------------------------------
corrupt bytes starting at file offset 65536
    1st 32 expected bytes:  7649:writev*I:7649:writev*I:7649
    1st 32 actual bytes:    :writev*I:7649:writev*I:7649:wri
--
*** xdoio(pid: 26783) DATA COMPARISON ERROR /mnt/brawl/marathon-03/rwdirectlarge ***
Corrupt regions follow - unprintable chars are represented as '.'
-----------------------------------------------------------------
corrupt bytes starting at file offset 29560832
    1st 32 expected bytes:  3:writev*I:26783:writev*I:26783:
    1st 32 actual bytes:    783:writev*I:26783:writev*I:2678
--
*** xdoio(pid: 16483) DATA COMPARISON ERROR /mnt/brawl/marathon-02/mtfile_rwrandirect ***
Corrupt regions follow - unprintable chars are represented as '.'
-----------------------------------------------------------------
corrupt bytes starting at file offset 76136448
    1st 32 expected bytes:  writev*P:16483:writev*P:16483:wr
    1st 32 actual bytes:    483:writev*P:16483:writev*P:1648


Expected results:


Additional info:
Comment 1 Nate Straz 2008-09-23 15:57:27 EDT
I was able to reproduce this on a second cluster, also x86_64, running kernel-2.6.18-116.el5 and kmod-gfs-0.1.26-1.el5.

*** xdoio(pid: 18786) DATA COMPARISON ERROR /mnt/brawl/dash-01/rwrandirectsmall ***
Corrupt regions follow - unprintable chars are represented as '.'
-----------------------------------------------------------------
corrupt bytes starting at file offset 360448
    1st 32 expected bytes:  itev*I:18786:writev*I:18786:writ
    1st 32 actual bytes:    writev*I:18786:writev*I:18786:wr
Comment 2 Robert Peterson 2008-09-23 17:13:38 EDT
Can you try the same test with older versions of the kmod-gfs package
to see when it broke?
Comment 3 Nate Straz 2008-09-23 18:44:15 EDT
I reran the tests on kmod-gfs-0.1.23-5.el5_2.2 (RHEL5.2.Z) and kmod-gfs-0.1.19-7.el5 (RHEL5.1) and I hit it in both places.  

I also tried it in /tmp (ext3) and it failed there too.

*** xdoio(pid: 24281) DATA COMPARISON ERROR /tmp/rwrevdirectlarge ***
Corrupt regions follow - unprintable chars are represented as '.'
-----------------------------------------------------------------
corrupt bytes starting at file offset 148197376
    1st 32 expected bytes:  e*Y:24281:write*Y:24281:write*Y:
    1st 32 actual bytes:    Y:24281:write*Y:24281:write*Y:24

The corruption is definitely in the file after the test case finishes

08d54fe0  77 72 69 74 65 2a 59 3a  32 34 32 38 31 3a 77 72  |write*Y:24281:wr|
08d54ff0  69 74 65 2a 59 3a 32 34  32 38 31 3a 77 72 69 74  |ite*Y:24281:writ|
08d55000  59 3a 32 34 32 38 31 3a  77 72 69 74 65 2a 59 3a  |Y:24281:write*Y:|
08d55010  32 34 32 38 31 3a 77 72  69 74 65 2a 59 3a 32 34  |24281:write*Y:24|

I feel like I'm going insane because the test cases haven't changed, I hit the bug using the sts-rhel5.2 bits on the node.  We've run these tests hundreds of times without hitting this before now.
Comment 4 Robert Peterson 2008-09-23 20:40:45 EDT
Well, perhaps there's a bug in the kernel at the vfs layer that was
introduced recently.  Can you boot on a fairly old kernel to see if
the test is successful there?
Comment 5 Nate Straz 2008-09-24 08:45:15 EDT
I'm moving this to the kernel since I have an easy to reproduce test case which fails on ext3.

xiogen -S 6392 -f direct -m random -s read,write,readv,writev -t 1b -F 1000b:rwrandirectsmall | xdoio -v -i 10s

I'm backtracking through kernels now, -110 failed, -104 looks like a pass.
Comment 6 Nate Straz 2008-09-24 09:04:06 EDT
This is definitely a regression in -110.el5, -109.el5 passes my test.
Comment 7 Nate Straz 2008-09-24 11:39:03 EDT
Eric Sandeen built a kernel for me to try, an -110 kerne without the following patch:

- [mm] keep pagefault from happening under page lock (Josef Bacik ) [445433]

This kernel passes my test case from comment #5.

I also looked back through my test logs and found that this test ran fine on my i686 nodes with a -115 based kernel.  I think this is an x86_64 only bug.
Comment 8 Nate Straz 2008-09-24 15:14:21 EDT

*** This bug has been marked as a duplicate of bug 463134 ***

Note You need to log in before you can comment on or make changes to this bug.