Description of problem: We have a network appliance serving NFS as server. We're using RHAT ES 2.1 as client with solarises [5.7 Generic_106541-32 but I think it's not really important]. We're running IPlanet 6.0 as web server ontop them, and sometimes I see corrupted files on linux boxes, with a lot of \0 [ascii nulls], while I see these files as a non corrupted, valid ones on solarises. I'm afraid this is an NFS implementation bug, I found one silimar issue, #142849, but I haven't found any solution that. I have no information what FS is used on the NFS server, I administer these, linux NFS clients. My NFS mount options are now: <servername>:/vol/v1/q1 /share nfs rw,nosuid,retrans=5,rsize=8192,wsize=8192,timeo=11,noac,tcp 0 0 originally there wasn't tcp option, but I tried this one, because I found an UDP fragmentation issue around this kernel version. But I experience the same error. Version-Release number of selected component (if applicable): [root@intlweb23 root]# uname -a Linux intlweb23.starwave.com 2.4.9-e.59smp #1 SMP Mon Jan 17 07:07:22 EST 2005 i686 unknown How reproducible: sometimes occur only... Steps to Reproduce: 1. 2. 3. Actual results: No idea how to workaround this. Expected results: Minimum, to close the affected tickets [including this one] with any kind of workaround, and a real solution, since as I mentioned, this [#142849] ticket was opened on 12/2004... Additional info:
Does the corruption seen on the linux NFS client go away if you update the file's timestamp (via touch) from one of the solaris machines?
Also, can you describe how file updates are done on the nfs server (append-only/ seeking around and replacing contents/overwriting)?
Answering the first question, root@<solaris nfs client>:/$ ssh <linux nfs client> "cksum messages.js" 97125888 15093 messages.js root@<solaris nfs client>:/$ touch messages.js root@<solaris nfs client>:/$ ssh <linux nfs client> "cksum messages.js" 679637199 15093 messages.js yes, touching the file from solaris cure the corrupted file. Q2, mostly appended files are corrupted, but sometimes we found overwritten files with shorter new one, we saw the old file's content, but if the new file is longer than the overwritten old one, we see null ascii chars in it. Hope that helps, Tamas
and I have a file, which is updated, overwritten, seeked, modified. But what I mentioned above, it is appended.
This sounds like BZ 113905. The main problem there is that nfs clients writing to a file can race with nfs clients reading the same file. You can work around this by writing the updated file to some private location on the nfs server and then using an atomic operation (like mv) to make it public to the readers, or use file locking between the writer and the readers. (or just touch the file after it's been written). *** This bug has been marked as a duplicate of 113905 ***
Undoing dup, because bug 113905 is against RHEL3 kernel.
is this bug fixed in the later kernels for this system? If yes, which one is that?
Hello, now we found several files in the same environment, which can't be cured by touching them. We resolved it this way: cp -p file file.tmp && \ mv file file.bak && \ mv file.tmp file && \ cmp file file.bak && \ rm file.bak But doing that AND figuring out what files are corrupted implies a huge amount of resource usages so this resolution is pretty expensive. I submitted this ticket pretty LONG TIME ago, and I haven't any further reply how could we get rid this really annoying error, esp. because we run these servers in production environment and we can loss revenue if this won't be solved soon. From now, please switch off the security sensitive bug on this ticket, I'd like to inform my collegues on the progress. Thanks, Tamas Szerb
I've seen a small number of reports of this problem. I've not been able to reproduce it in-house, nor have any reporters been able to reproduce it at will. In each case, the corruption is limited to specific files (it's not system-wide corruption) and is precipitated by a lack of synchronization betweeen nfs readers and writers. In the reports that I've looked at, adding some means of reader-writer synchronization has corrected the problem. Short of that, a means to reliably reproduce this problem would be most helpful to speed a resolution.
Hi Tamas Have you come across any way to reliably reproduce the stale file problem? I'll need some way to tickle this bug in order to find the source of the problem. For now, I am going to close this ticket. Please re-open it if you have any more info (esp a way to reproduce the bug!).