Bug 81978 - NFS data corruption
Summary: NFS data corruption
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Linux
Classification: Retired
Component: kernel
Version: 7.3
Hardware: i686
OS: Linux
medium
high
Target Milestone: ---
Assignee: Steve Dickson
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2003-01-15 23:02 UTC by Ben Woodard
Modified: 2007-04-18 16:50 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2004-03-01 15:50:22 UTC
Embargoed:


Attachments (Terms of Use)
program to stress nfs (24.99 KB, text/plain)
2003-01-15 23:03 UTC, Ben Woodard
no flags Details
shell script to run test (635 bytes, text/plain)
2003-01-15 23:04 UTC, Ben Woodard
no flags Details
correct NFS data (24.05 KB, text/plain)
2003-01-15 23:14 UTC, Ben Woodard
no flags Details
corrupted nfs data (24.05 KB, text/plain)
2003-01-15 23:14 UTC, Ben Woodard
no flags Details
diff between the two data sets (1.58 KB, text/plain)
2003-01-15 23:15 UTC, Ben Woodard
no flags Details
list of NFS operations that led to the file getting corrupted (86.48 KB, text/plain)
2003-01-15 23:16 UTC, Ben Woodard
no flags Details
Revised fsx (23.73 KB, text/plain)
2003-01-23 03:19 UTC, Steve Dickson
no flags Details
A patch to stop data corruption when using the fsx test suite (2.87 KB, patch)
2003-02-14 16:14 UTC, Steve Dickson
no flags Details | Diff
an update to prevous patch that works in an SMP env. (4.63 KB, patch)
2003-02-18 00:48 UTC, Steve Dickson
no flags Details | Diff

Description Ben Woodard 2003-01-15 23:02:37 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.1) Gecko/20021003

Description of problem:
Details are still being filled in.

Basically we having a problem with NFS here at LLNL. It ends up corrupting data.
We are still trying to figure out exactly under what circumstances the problem
arises.

However, we have been able to come up with at least two artificial tests where
the NFS client cache falls out of sync with the server. We have yet to be able
to actually reproduce the problem that the user is seeing.



Version-Release number of selected component (if applicable):


How reproducible:
Sometimes

Steps to Reproduce:
1.compile the fsx program
2.edit runfsx to point to some nfs directory somewhere
3.run the runfsx shell script
    

Actual Results:  It turns up bugs in NFS

Expected Results:  run should complete with no errors

Additional info:

Comment 1 Ben Woodard 2003-01-15 23:03:57 UTC
Created attachment 89387 [details]
program to stress nfs

Here is the C code to a program that we picked up off the net which is designed
to trigger errors in NFS. We selected this program because we felt that it was
likely to be able to reproduce the problem we are seeing.

Comment 2 Ben Woodard 2003-01-15 23:04:52 UTC
Created attachment 89388 [details]
shell script to run test

shell script that runs the fsx program

Comment 3 Ben Woodard 2003-01-15 23:05:59 UTC
Comment on attachment 89388 [details]
shell script to run test

 Here is a script that we used to reproduce the problem. You will have to
change some paths in this dir to run this script but you will get the sense of
what it is doing.

The first test which is commented out is the first bug that we saw. Basically,
this bug shows up from time to time as lseek and lstat returning incorrect
values after a truncate.

The rest of the tests which are uncommented actually produce data corruption
from the client's point of view. This is a very serious issue for us.

Comment 4 Ben Woodard 2003-01-15 23:13:03 UTC
 The kernel is based upon 2.4.18-17 or 2.4.18-18. We see the problem in both of
them. The changes are as follows:
1) quadrics driver added high speed interconnect device driver
2) several unneeded config options turned off (e.g sound cards)
3) newer MTD driver
4) newer ECC module
5) Lustre file system added
6) mcore crash dump support added

Comment 5 Ben Woodard 2003-01-15 23:14:20 UTC
Created attachment 89389 [details]
correct NFS data

Comment 6 Ben Woodard 2003-01-15 23:14:46 UTC
Created attachment 89390 [details]
corrupted nfs data

Comment 7 Ben Woodard 2003-01-15 23:15:19 UTC
Created attachment 89391 [details]
diff between the two data sets

Comment 8 Ben Woodard 2003-01-15 23:16:21 UTC
Created attachment 89392 [details]
list of NFS operations that led to the file getting corrupted

Comment 9 Ben Woodard 2003-01-15 23:16:55 UTC
 We have yet to verify if the bug that this reproduces is the exact same one
that is seen by the user. Our user's problem is that the client cache seems
corrupted. i.e. if you look at the file from another machine it is fine.
However, on the original node the file has some bad data in it.

Touching the file on the server fixes the problem because it invalidates the
client's cache of the file. The user's usage behavior as best as we can
determine is that she has 200,000 different files on the NFS servers which are
Blue Arc's. Then she reads these (no writes) with about 1000 machines almost
simultaneously and on a small percentage of hosts she runs into corruption of
the client's cache. We sort of think that this may be a server problem because
we can't think of any other way that the erroneous data can get into the page
cache. The files are not terribly big about 50MB each.

Comment 10 Norm Murray 2003-01-16 15:02:31 UTC
Adding issue tracker to the cc list. 

Comment 11 Steve Dickson 2003-01-17 02:30:59 UTC
Does this corruption happen with v2, v3 or both?
Whats the transport UDP or TCP?

What (if any) are the mounting options that are being used?

Does this corruption happen with only a bluearc server.
Meaning do you see this corruption with a RH7.3 or RH.0 server?

Comment 12 Ben Woodard 2003-01-17 20:13:44 UTC
V3 in production with UDP. It looks like one of the things that we didn't
properly control for is the fact that on the linux server where we saw the
problem we were running V2

For two of the servers that we are seeing problems with:

Bluearc:
ba33:/vol0 on /mnt/ba2 type nfs
(rw,rsize=8192,wsize=8192,intr,nfsvers=3,noac,addr=134.9.39.177

Linux:
microsoft:/exports/linux.home on /home type nfs
(rw,rsize=16384,wsize=16384,intr,addr=134.9.36.5)

I'm not sure which version of linux the nfs server microsoft is running. I'm in
the process of finding that out.

Comment 13 Ben Woodard 2003-01-17 21:58:09 UTC
Just double checked it with NFS v3 between two 7.3 based linux nodes:

mdev22:/tmp/ben on /mnt/ben type nfs
(rw,rsize=16384,wsize=16384,intr,nfsvers=3,addr=134.9.98.153)

This is the test that seems to be causing the most problems.

for i in `seq -w 1 100`
do
    ./fsx -q -n -c10 -l16234 -N100000 -p1000 -S1 /mnt/ben/nfstest/nfstest3$i >
/home/ben/nfstest/out3.$i 2>&1 &
done

Tell me if you would like the logs for these runs.

Comment 14 Ben Woodard 2003-01-22 15:56:51 UTC
Reproduced exactly the same problem on a stock 2.4.18-19.7.x kernel on UP
machines. This indicates that it is not a race condition and it is not related
to any kernel changes we have made locally.

Comment 15 Steve Dickson 2003-01-22 17:16:10 UTC
How long does it take before you see the corruption? I have
let these tests run for over 12 hours and not seen any problems.
I was using 2.4.18-17.7.x kernel on the client and a stock
8.0 (2.4.18-14) as the server.

Comment 16 Ben Woodard 2003-01-22 17:57:34 UTC
Just minutes. We just tried it and we had one failure that popped up in about 5
minutes. The faster the connection the more failures and the faster we see the
failures. When we first tried to reproduce it we did it between two quadrics
connected nodes.  That gave us on the order of 180MB/s (not Mb/s) bandwidth.



Comment 17 Ben Woodard 2003-01-22 18:09:00 UTC
Correction. The problem is seen with 2.4.18-18 and 2.4.18-19 not 2.4.18-17 and
2.4.18-19. That was a thinko on my part.

Comment 18 Ben Woodard 2003-01-22 18:38:07 UTC
> Ben,
> 
> How do you tell when there is corruption? Do the test stop?

Yes individual tests stop.
I'll put the description in the bug report.
The way we check it is to look in the directory with the output files.

I usually do a:

watch "ls -lS | head" 

The files which have the problem are much longer than the others. 

1003 [ben@xenophanes nfstest.out]$ ls -lS | head
total 2784
-rw-rw-r--    1 ben      ben         66768 Jan 22 10:22 out3.003
-rw-rw-r--    1 ben      ben         66609 Jan 22 10:20 out3.052
-rw-rw-r--    1 ben      ben         66145 Jan 22 10:18 out3.024
-rw-rw-r--    1 ben      ben         64376 Jan 22 10:18 out3.074
-rw-rw-r--    1 ben      ben         35988 Jan 22 10:15 out2.021
-rw-rw-r--    1 ben      ben           402 Jan 22 10:24 out4.024
-rw-rw-r--    1 ben      ben           402 Jan 22 10:24 out4.037
-rw-rw-r--    1 ben      ben           402 Jan 22 10:24 out4.061
-rw-rw-r--    1 ben      ben           402 Jan 22 10:24 out4.069

See how the first 5 files are much longer than the others.


1005 [ben@xenophanes nfstest.out]$ tail out3.003
7414(246 mod 256): MAPWRITE 0xd87 thru 0x3f69   (0x31e3 bytes)  ******WWWW
7415(247 mod 256): MAPWRITE 0x2a0a thru 0x3f69  (0x1560 bytes)  ******WWWW
7416(248 mod 256): READ 0x27e0 thru 0x3f69      (0x178a bytes)  ***RRRR***
7417(249 mod 256): WRITE        0x27f5 thru 0x3f69      (0x1775 bytes)  ***WWWW
7418(250 mod 256): WRITE        0x157b thru 0x3f69      (0x29ef bytes)  ***WWWW
7419(251 mod 256): TRUNCATE DOWN        from 0x3f6a to 0x2fa7   ******WWWW
7420(252 mod 256): WRITE        0x37fe thru 0x3f69      (0x76c bytes) HOLE      
***WWWW
7421(253 mod 256): MAPREAD      0x2238 thru 0x3f69      (0x1d32 bytes)  ***RRRR***
Correct content saved for comparison
(maybe hexdump "/mnt/nfstest/nfstest3003" vs "/mnt/nfstest/nfstest3003.fsxgood")

I just reproduced this problem over a loopback mount. That may speed up
the time it takes to demonstrate the problem.

Comment 19 Steve Dickson 2003-01-22 19:18:06 UTC
Do you always need to start up 100 processes for 
it to occur? 

Comment 20 Ben Woodard 2003-01-22 21:04:35 UTC
Tried it with tcp mount option and the problem still occurs.

Comment 21 Ben Woodard 2003-01-22 21:18:37 UTC
I am able to reproduce the problem faster with this much smaller script. This is
essentially an exerpt from the original test script which only exectues the 3rd
stanza. In testing, I discovered that the 3rd stanza is the one that fails most
frequently. However stanza 2 and stanza 4 both fail just much less often.

#!/bin/bash -x

for i in `seq -w 1 100`
do
    ./fsx -q -n -c10 -l16234 -N100000 -p1000 -S1 /mnt/nfstest/nfstest3$i >
/tmp/test/nfstest.out/out3.$i 2>&1 &
done

Comment 22 Ben Woodard 2003-01-22 21:37:43 UTC
The problem doesn't seem to happen when I run the fsx's sequentially -- only
when I run them simultaneously.

Comment 23 Steve Dickson 2003-01-23 03:19:47 UTC
Created attachment 89538 [details]
Revised fsx 

Please try this revised fsx program to see if the corruption
still occurs. I have eliminated some of the system calls to
try and isolate the problem.

Comment 24 Ben Woodard 2003-01-23 19:10:08 UTC
The new version of fsx still causes the problem.

Comment 25 Steve Dickson 2003-01-23 19:17:15 UTC
please send me the list of opts that cause the problem

Comment 26 Ben Woodard 2003-01-23 23:02:46 UTC
Discovered a dumb typo in my script that reproduced the failure that was making
it so that I was not running the new fsx.

After I fixed this problem and was actually running the new fsx then I didn't
have any problems.

I can also get this same behavior by running the old fsx with -W -R

However, if the problem we are seeing here is limited to only mmap operations as
it seems, then here at LLNL we may have not come up with a reproducer which
recreates the problems that the user is seeing. i.e. we found this problem when
were trying to reproduce the users problem. However, it may be that it is a
seperate problem. I do not think that the user is using mmap for nfs files but I
have to do more work to prove that. Multi machine MPI jobs written in fortran
can do some strange things.

Comment 27 Steve Dickson 2003-01-23 23:15:20 UTC
Please try the kernel in http://people.redhat.com/steved/.bug81978

Comment 28 Ben Woodard 2003-01-23 23:25:11 UTC
Here is a set of operations that causes the problem.

504(248 mod 256): MAPWRITE 0x3725 thru 0x376f	(0x4b bytes)	******WWWW
		CLOSE/OPEN
<snip>
512(0 mod 256): TRUNCATE DOWN	from 0x3770 to 0x2bac	******WWWW
		CLOSE/OPEN
<snip>
514(2 mod 256): TRUNCATE UP	from 0x2bac to 0x3967	******WWWW
		CLOSE/OPEN
515(3 mod 256): READ	0x370e thru 0x3732	(0x25 bytes)	***RRRR***
		CLOSE/OPEN

In this case what you see is the data from the mapwrite appears in the file even
after the file has been truncated down so that it should have been removed.


Comment 29 Ben Woodard 2003-01-24 00:59:15 UTC
Here is another operation summary:

7418(250 mod 256): WRITE	0x157b thru 0x3f69	(0x29ef bytes)	***WWWW
7419(251 mod 256): TRUNCATE DOWN	from 0x3f6a to 0x2fa7	******WWWW
7420(252 mod 256): WRITE	0x37fe thru 0x3f69	(0x76c bytes) HOLE	***WWWW
7421(253 mod 256): MAPREAD	0x2238 thru 0x3f69	(0x1d32 bytes)	***RRRR***

This one is interesting in that the corruption is between 3000 and 3f7e it is
not like it is of a truncate like 2fa7. It appears like the data between 2fa7
and 3000  is correct. So the corruption seems to be limited to the data above a
page boundry.

Comment 30 Ben Woodard 2003-01-24 02:26:23 UTC
I'm having some trouble getting that new kernel to boot.

First of all it needed a new version of mkinitrd and modutils. We are on 7.3
here and even the ones from 8.0 were not new enough and so I rebuilt the source
RPMs from Phoebe and installed them. This uncovered a bug in mkinitrd. I fixed
that and submitted the patch.

In the end I needed to install:
modutils-2.4.22-3.i386.rpm
mkinitrd-3.4.35-1.i386.rpm
dietlibc-0.21-2.i386.rpm

Then I was able to get the kernel RPM to install. Unfortunately it will not
boot. The error message is:

Linux IP multicast router 0.06 plus PIM-SM
NET4: Unix domain sockets 1.0/SMP for Linux NET4.0.
RAMDISK: Compressed image found at block 0
Freeing initrd memory: 156k freed
VFS: Mounted root (ext2 filesystem).
Red Hat nash version 3.4.35 staride: no cache flush required.
ting
Loading jbide: no cache flush required.
d.o.gz module
Eide: no cache flush required.
RROR: failed in ide: no cache flush required.
exec of /bin/inside: no cache flush required.
mod
Loading extide: no cache flush required.
3.o.gz module
Eide: no cache flush required.
RROR: failed in ide: no cache flush required.
exec of /bin/inside: no cache flush required.
mod
Mounting /pide: no cache flush required.
roc filesystem
ide: no cache flush required.
Creating block dide: no cache flush required.
evices
Creatingide: no cache flush required.
 root device
Moide: no cache flush required.
unting root fileide: no cache flush required.
system
mount: eide: no cache flush required.
rror 19 mountingide: no cache flush required.
 ext3
pivotrootide: no cache flush required.
: pivot_root(/syide: no cache flush required.
sroot,/sysroot/iide: no cache flush required.
nitrd) failed: 2ide: no cache flush required.

umount /initrdide: no cache flush required.
ide: no cache flush required.

ERROR: /bin/inside: no cache flush required.
mod exited abnoride: no cache flush required.
mally!
Mountingide: no cache flush required.
 /proc filesysteide: no cache flush required.
m
mount: error ide: no cache flush required.
16 mounting proc
Creating block devices
Creating root device
mount: cannot create device /dev/root (3,5)
Mounting root filesystem
mount: error 19 mounting ext3
pivotroot: pivot_root(/sysroot,/sysroot/initrd) failed: 2
umount /initrd/proc failed: 2
ERROR: /bin/insmod exited abnormally!
Loading ext3.o.gz module
ERROR: failed in exec of /bin/insmod
Mounting /proc filesystem
mount: error 16 mounting proc
Creating block devices
Creating root device
mount: cannot create device /dev/root (3,5)
Mounting root filesystem
mount: error 19 mounting ext3
pivotroot: pivot_root(/sysroot,/sysroot/initrd) failed: 2
umount /initrd/proc failed: 2
ERROR: /bin/insmod exited abnormally!
Mounting /proc filesystem
mount: error 16 mounting proc
Creating block devices
Creating root device
mount: cannot create device /dev/root (3,5)
Mounting root filesystem
mount: error 19 mounting ext3
pivotroot: pivot_root(/sysroot,/sysroot/initrd) failed: 2
umount /initrd/proc failed: 2
Freeing unused kernel memory: 196k freed
Kernel panic: No init found.  Try passing init= option to kernel.
 
I'm not sure what is causing this. It looks like it cannot move over from the
ram disk to the main kernel. I'm going to try to rebuild the kernel RPM to fix
the problem. I suspect that the problem may be one of binary compatability
between the items that are in the initrd and the ones on disk.

Comment 31 Ben Woodard 2003-01-24 03:01:40 UTC
It seems like the src rpm has disappeard and so that aborts my plan of trying to
rebuild it.

Comment 32 Steve Dickson 2003-01-24 12:57:17 UTC
src rpm is back but now that we know it has something
to do with mmap io its not clear how fruitful this exercise
will be. Plus I *thinking* you'll need to install rh8.0 get
get this kernel up and running....

Comment 33 Steve Dickson 2003-01-24 13:15:58 UTC
Now that know the corruption has something
to do with mmap io I would like to take it
a step further and find out if has to do
with *just* mmap io or mmap io interaction
with other filesystem ops.

So I would like (and will be running) the
following tests run to try and isolate
the problem further.

1) Run the tests with *just* mmap io. This
should tell us if it is a straight mmap io
issue.

2) Run the tests with mmap io and *only* truncation
tests.

3) run the tests with mmap io and *only* normal
reads and writes.

I suspect that tests 1 and 3 will run just fine
and test 2 will show the corruption.


Comment 34 Steve Dickson 2003-01-28 11:09:54 UTC
Over the weekend I was finally able to consistently
reproduce this corruption on my machine at home.
This allowed me to (I believe) figure out what is
happening although I don't have a fix at this point.

It seems the corruption occurs when the file has
been extended but not written to. The following scenario
seems to be prevalent throughout most of the test runs:

    create a file.
    write of data to the file.
    ftruncate the file to some random size;
    mmap file to extend the file beyond its current size.
    mmapread (i.e. memcpy) from the new extended part of file.

The corruption seems to occurs with the reading of the unwritten
part of the the. The scenario can deviate somewhat like:

    ftruncate the file down to a random size
    ftruncate the file up to a large size
    mmapread (i.e. memcpy) from the new extended part of file.

but the corruption seem to always occur when the process reads
the unwritten part of the file.


Comment 35 Steve Dickson 2003-02-14 16:14:29 UTC
Created attachment 90089 [details]
A patch to stop data corruption when using the fsx test suite

The Cause: memory mapped pages were not being flushed
out in a timely manner. When size of the file was about to change
nfs_writepage() is called by filemap_fdatasync() to flush
out dirty pages. The was done asynchronously which meant
nfs_writepage() would indirectly call nfs_strategy(). nfs_strategy()
tries to send a group of pages (in this case 4 page at a time) so it 
did *not* flushing out the page (a bad strategy in this case). The page
would eventually flushed by kupdate but by that time it was too late.

The Solution: When a file is going to be truncated down, synchronously
flush out the mmapped page. I used a (surprising) unused NFS_INO_FLUSH
nfs_inode flag be to tell nfs_writepage to synchronously write out the page.

Comment 36 Steve Dickson 2003-02-18 00:48:17 UTC
Created attachment 90140 [details]
an update to prevous patch that works in an SMP env.

Comment 37 Dave Maley 2004-03-01 15:50:22 UTC
LLNL has reported that this issue has been resolved
- steved has verified 

closing BZ


Note You need to log in before you can comment on or make changes to this bug.