From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.1) Gecko/20021003 Description of problem: Details are still being filled in. Basically we having a problem with NFS here at LLNL. It ends up corrupting data. We are still trying to figure out exactly under what circumstances the problem arises. However, we have been able to come up with at least two artificial tests where the NFS client cache falls out of sync with the server. We have yet to be able to actually reproduce the problem that the user is seeing. Version-Release number of selected component (if applicable): How reproducible: Sometimes Steps to Reproduce: 1.compile the fsx program 2.edit runfsx to point to some nfs directory somewhere 3.run the runfsx shell script Actual Results: It turns up bugs in NFS Expected Results: run should complete with no errors Additional info:
Created attachment 89387 [details] program to stress nfs Here is the C code to a program that we picked up off the net which is designed to trigger errors in NFS. We selected this program because we felt that it was likely to be able to reproduce the problem we are seeing.
Created attachment 89388 [details] shell script to run test shell script that runs the fsx program
Comment on attachment 89388 [details] shell script to run test Here is a script that we used to reproduce the problem. You will have to change some paths in this dir to run this script but you will get the sense of what it is doing. The first test which is commented out is the first bug that we saw. Basically, this bug shows up from time to time as lseek and lstat returning incorrect values after a truncate. The rest of the tests which are uncommented actually produce data corruption from the client's point of view. This is a very serious issue for us.
The kernel is based upon 2.4.18-17 or 2.4.18-18. We see the problem in both of them. The changes are as follows: 1) quadrics driver added high speed interconnect device driver 2) several unneeded config options turned off (e.g sound cards) 3) newer MTD driver 4) newer ECC module 5) Lustre file system added 6) mcore crash dump support added
Created attachment 89389 [details] correct NFS data
Created attachment 89390 [details] corrupted nfs data
Created attachment 89391 [details] diff between the two data sets
Created attachment 89392 [details] list of NFS operations that led to the file getting corrupted
We have yet to verify if the bug that this reproduces is the exact same one that is seen by the user. Our user's problem is that the client cache seems corrupted. i.e. if you look at the file from another machine it is fine. However, on the original node the file has some bad data in it. Touching the file on the server fixes the problem because it invalidates the client's cache of the file. The user's usage behavior as best as we can determine is that she has 200,000 different files on the NFS servers which are Blue Arc's. Then she reads these (no writes) with about 1000 machines almost simultaneously and on a small percentage of hosts she runs into corruption of the client's cache. We sort of think that this may be a server problem because we can't think of any other way that the erroneous data can get into the page cache. The files are not terribly big about 50MB each.
Adding issue tracker to the cc list.
Does this corruption happen with v2, v3 or both? Whats the transport UDP or TCP? What (if any) are the mounting options that are being used? Does this corruption happen with only a bluearc server. Meaning do you see this corruption with a RH7.3 or RH.0 server?
V3 in production with UDP. It looks like one of the things that we didn't properly control for is the fact that on the linux server where we saw the problem we were running V2 For two of the servers that we are seeing problems with: Bluearc: ba33:/vol0 on /mnt/ba2 type nfs (rw,rsize=8192,wsize=8192,intr,nfsvers=3,noac,addr=134.9.39.177 Linux: microsoft:/exports/linux.home on /home type nfs (rw,rsize=16384,wsize=16384,intr,addr=134.9.36.5) I'm not sure which version of linux the nfs server microsoft is running. I'm in the process of finding that out.
Just double checked it with NFS v3 between two 7.3 based linux nodes: mdev22:/tmp/ben on /mnt/ben type nfs (rw,rsize=16384,wsize=16384,intr,nfsvers=3,addr=134.9.98.153) This is the test that seems to be causing the most problems. for i in `seq -w 1 100` do ./fsx -q -n -c10 -l16234 -N100000 -p1000 -S1 /mnt/ben/nfstest/nfstest3$i > /home/ben/nfstest/out3.$i 2>&1 & done Tell me if you would like the logs for these runs.
Reproduced exactly the same problem on a stock 2.4.18-19.7.x kernel on UP machines. This indicates that it is not a race condition and it is not related to any kernel changes we have made locally.
How long does it take before you see the corruption? I have let these tests run for over 12 hours and not seen any problems. I was using 2.4.18-17.7.x kernel on the client and a stock 8.0 (2.4.18-14) as the server.
Just minutes. We just tried it and we had one failure that popped up in about 5 minutes. The faster the connection the more failures and the faster we see the failures. When we first tried to reproduce it we did it between two quadrics connected nodes. That gave us on the order of 180MB/s (not Mb/s) bandwidth.
Correction. The problem is seen with 2.4.18-18 and 2.4.18-19 not 2.4.18-17 and 2.4.18-19. That was a thinko on my part.
> Ben, > > How do you tell when there is corruption? Do the test stop? Yes individual tests stop. I'll put the description in the bug report. The way we check it is to look in the directory with the output files. I usually do a: watch "ls -lS | head" The files which have the problem are much longer than the others. 1003 [ben@xenophanes nfstest.out]$ ls -lS | head total 2784 -rw-rw-r-- 1 ben ben 66768 Jan 22 10:22 out3.003 -rw-rw-r-- 1 ben ben 66609 Jan 22 10:20 out3.052 -rw-rw-r-- 1 ben ben 66145 Jan 22 10:18 out3.024 -rw-rw-r-- 1 ben ben 64376 Jan 22 10:18 out3.074 -rw-rw-r-- 1 ben ben 35988 Jan 22 10:15 out2.021 -rw-rw-r-- 1 ben ben 402 Jan 22 10:24 out4.024 -rw-rw-r-- 1 ben ben 402 Jan 22 10:24 out4.037 -rw-rw-r-- 1 ben ben 402 Jan 22 10:24 out4.061 -rw-rw-r-- 1 ben ben 402 Jan 22 10:24 out4.069 See how the first 5 files are much longer than the others. 1005 [ben@xenophanes nfstest.out]$ tail out3.003 7414(246 mod 256): MAPWRITE 0xd87 thru 0x3f69 (0x31e3 bytes) ******WWWW 7415(247 mod 256): MAPWRITE 0x2a0a thru 0x3f69 (0x1560 bytes) ******WWWW 7416(248 mod 256): READ 0x27e0 thru 0x3f69 (0x178a bytes) ***RRRR*** 7417(249 mod 256): WRITE 0x27f5 thru 0x3f69 (0x1775 bytes) ***WWWW 7418(250 mod 256): WRITE 0x157b thru 0x3f69 (0x29ef bytes) ***WWWW 7419(251 mod 256): TRUNCATE DOWN from 0x3f6a to 0x2fa7 ******WWWW 7420(252 mod 256): WRITE 0x37fe thru 0x3f69 (0x76c bytes) HOLE ***WWWW 7421(253 mod 256): MAPREAD 0x2238 thru 0x3f69 (0x1d32 bytes) ***RRRR*** Correct content saved for comparison (maybe hexdump "/mnt/nfstest/nfstest3003" vs "/mnt/nfstest/nfstest3003.fsxgood") I just reproduced this problem over a loopback mount. That may speed up the time it takes to demonstrate the problem.
Do you always need to start up 100 processes for it to occur?
Tried it with tcp mount option and the problem still occurs.
I am able to reproduce the problem faster with this much smaller script. This is essentially an exerpt from the original test script which only exectues the 3rd stanza. In testing, I discovered that the 3rd stanza is the one that fails most frequently. However stanza 2 and stanza 4 both fail just much less often. #!/bin/bash -x for i in `seq -w 1 100` do ./fsx -q -n -c10 -l16234 -N100000 -p1000 -S1 /mnt/nfstest/nfstest3$i > /tmp/test/nfstest.out/out3.$i 2>&1 & done
The problem doesn't seem to happen when I run the fsx's sequentially -- only when I run them simultaneously.
Created attachment 89538 [details] Revised fsx Please try this revised fsx program to see if the corruption still occurs. I have eliminated some of the system calls to try and isolate the problem.
The new version of fsx still causes the problem.
please send me the list of opts that cause the problem
Discovered a dumb typo in my script that reproduced the failure that was making it so that I was not running the new fsx. After I fixed this problem and was actually running the new fsx then I didn't have any problems. I can also get this same behavior by running the old fsx with -W -R However, if the problem we are seeing here is limited to only mmap operations as it seems, then here at LLNL we may have not come up with a reproducer which recreates the problems that the user is seeing. i.e. we found this problem when were trying to reproduce the users problem. However, it may be that it is a seperate problem. I do not think that the user is using mmap for nfs files but I have to do more work to prove that. Multi machine MPI jobs written in fortran can do some strange things.
Please try the kernel in http://people.redhat.com/steved/.bug81978
Here is a set of operations that causes the problem. 504(248 mod 256): MAPWRITE 0x3725 thru 0x376f (0x4b bytes) ******WWWW CLOSE/OPEN <snip> 512(0 mod 256): TRUNCATE DOWN from 0x3770 to 0x2bac ******WWWW CLOSE/OPEN <snip> 514(2 mod 256): TRUNCATE UP from 0x2bac to 0x3967 ******WWWW CLOSE/OPEN 515(3 mod 256): READ 0x370e thru 0x3732 (0x25 bytes) ***RRRR*** CLOSE/OPEN In this case what you see is the data from the mapwrite appears in the file even after the file has been truncated down so that it should have been removed.
Here is another operation summary: 7418(250 mod 256): WRITE 0x157b thru 0x3f69 (0x29ef bytes) ***WWWW 7419(251 mod 256): TRUNCATE DOWN from 0x3f6a to 0x2fa7 ******WWWW 7420(252 mod 256): WRITE 0x37fe thru 0x3f69 (0x76c bytes) HOLE ***WWWW 7421(253 mod 256): MAPREAD 0x2238 thru 0x3f69 (0x1d32 bytes) ***RRRR*** This one is interesting in that the corruption is between 3000 and 3f7e it is not like it is of a truncate like 2fa7. It appears like the data between 2fa7 and 3000 is correct. So the corruption seems to be limited to the data above a page boundry.
I'm having some trouble getting that new kernel to boot. First of all it needed a new version of mkinitrd and modutils. We are on 7.3 here and even the ones from 8.0 were not new enough and so I rebuilt the source RPMs from Phoebe and installed them. This uncovered a bug in mkinitrd. I fixed that and submitted the patch. In the end I needed to install: modutils-2.4.22-3.i386.rpm mkinitrd-3.4.35-1.i386.rpm dietlibc-0.21-2.i386.rpm Then I was able to get the kernel RPM to install. Unfortunately it will not boot. The error message is: Linux IP multicast router 0.06 plus PIM-SM NET4: Unix domain sockets 1.0/SMP for Linux NET4.0. RAMDISK: Compressed image found at block 0 Freeing initrd memory: 156k freed VFS: Mounted root (ext2 filesystem). Red Hat nash version 3.4.35 staride: no cache flush required. ting Loading jbide: no cache flush required. d.o.gz module Eide: no cache flush required. RROR: failed in ide: no cache flush required. exec of /bin/inside: no cache flush required. mod Loading extide: no cache flush required. 3.o.gz module Eide: no cache flush required. RROR: failed in ide: no cache flush required. exec of /bin/inside: no cache flush required. mod Mounting /pide: no cache flush required. roc filesystem ide: no cache flush required. Creating block dide: no cache flush required. evices Creatingide: no cache flush required. root device Moide: no cache flush required. unting root fileide: no cache flush required. system mount: eide: no cache flush required. rror 19 mountingide: no cache flush required. ext3 pivotrootide: no cache flush required. : pivot_root(/syide: no cache flush required. sroot,/sysroot/iide: no cache flush required. nitrd) failed: 2ide: no cache flush required. umount /initrdide: no cache flush required. ide: no cache flush required. ERROR: /bin/inside: no cache flush required. mod exited abnoride: no cache flush required. mally! Mountingide: no cache flush required. /proc filesysteide: no cache flush required. m mount: error ide: no cache flush required. 16 mounting proc Creating block devices Creating root device mount: cannot create device /dev/root (3,5) Mounting root filesystem mount: error 19 mounting ext3 pivotroot: pivot_root(/sysroot,/sysroot/initrd) failed: 2 umount /initrd/proc failed: 2 ERROR: /bin/insmod exited abnormally! Loading ext3.o.gz module ERROR: failed in exec of /bin/insmod Mounting /proc filesystem mount: error 16 mounting proc Creating block devices Creating root device mount: cannot create device /dev/root (3,5) Mounting root filesystem mount: error 19 mounting ext3 pivotroot: pivot_root(/sysroot,/sysroot/initrd) failed: 2 umount /initrd/proc failed: 2 ERROR: /bin/insmod exited abnormally! Mounting /proc filesystem mount: error 16 mounting proc Creating block devices Creating root device mount: cannot create device /dev/root (3,5) Mounting root filesystem mount: error 19 mounting ext3 pivotroot: pivot_root(/sysroot,/sysroot/initrd) failed: 2 umount /initrd/proc failed: 2 Freeing unused kernel memory: 196k freed Kernel panic: No init found. Try passing init= option to kernel. I'm not sure what is causing this. It looks like it cannot move over from the ram disk to the main kernel. I'm going to try to rebuild the kernel RPM to fix the problem. I suspect that the problem may be one of binary compatability between the items that are in the initrd and the ones on disk.
It seems like the src rpm has disappeard and so that aborts my plan of trying to rebuild it.
src rpm is back but now that we know it has something to do with mmap io its not clear how fruitful this exercise will be. Plus I *thinking* you'll need to install rh8.0 get get this kernel up and running....
Now that know the corruption has something to do with mmap io I would like to take it a step further and find out if has to do with *just* mmap io or mmap io interaction with other filesystem ops. So I would like (and will be running) the following tests run to try and isolate the problem further. 1) Run the tests with *just* mmap io. This should tell us if it is a straight mmap io issue. 2) Run the tests with mmap io and *only* truncation tests. 3) run the tests with mmap io and *only* normal reads and writes. I suspect that tests 1 and 3 will run just fine and test 2 will show the corruption.
Over the weekend I was finally able to consistently reproduce this corruption on my machine at home. This allowed me to (I believe) figure out what is happening although I don't have a fix at this point. It seems the corruption occurs when the file has been extended but not written to. The following scenario seems to be prevalent throughout most of the test runs: create a file. write of data to the file. ftruncate the file to some random size; mmap file to extend the file beyond its current size. mmapread (i.e. memcpy) from the new extended part of file. The corruption seems to occurs with the reading of the unwritten part of the the. The scenario can deviate somewhat like: ftruncate the file down to a random size ftruncate the file up to a large size mmapread (i.e. memcpy) from the new extended part of file. but the corruption seem to always occur when the process reads the unwritten part of the file.
Created attachment 90089 [details] A patch to stop data corruption when using the fsx test suite The Cause: memory mapped pages were not being flushed out in a timely manner. When size of the file was about to change nfs_writepage() is called by filemap_fdatasync() to flush out dirty pages. The was done asynchronously which meant nfs_writepage() would indirectly call nfs_strategy(). nfs_strategy() tries to send a group of pages (in this case 4 page at a time) so it did *not* flushing out the page (a bad strategy in this case). The page would eventually flushed by kupdate but by that time it was too late. The Solution: When a file is going to be truncated down, synchronously flush out the mmapped page. I used a (surprising) unused NFS_INO_FLUSH nfs_inode flag be to tell nfs_writepage to synchronously write out the page.
Created attachment 90140 [details] an update to prevous patch that works in an SMP env.
LLNL has reported that this issue has been resolved - steved has verified closing BZ