156307 – NFS file corruption

Bug 156307 - NFS file corruption

Summary: NFS file corruption

Keywords:
Status:	CLOSED CANTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 2.1
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	2.1
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Don Howard
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2005-04-28 19:55 UTC by Tamas SZERB
Modified:	2007-11-30 22:06 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2005-11-07 21:59:39 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Tamas SZERB 2005-04-28 19:55:32 UTC

Description of problem:

We have a network appliance serving NFS as server. We're using RHAT ES 2.1 as
client with solarises [5.7 Generic_106541-32 but I think it's not really
important]. We're running IPlanet 6.0 as web server ontop them, and sometimes I
see corrupted files on linux boxes, with a lot of \0 [ascii nulls], while I see
these files as a non corrupted, valid ones on solarises. I'm afraid this is an
NFS implementation bug, I found one silimar issue, #142849, but I haven't found
any solution that. I have no information what FS is used on the NFS server, I
administer these, linux NFS clients. My NFS mount options are now:
<servername>:/vol/v1/q1   /share                  nfs    
rw,nosuid,retrans=5,rsize=8192,wsize=8192,timeo=11,noac,tcp      0 0

originally there wasn't tcp option, but I tried this one, because I found an UDP
fragmentation issue around this kernel version. But I experience the same error.


Version-Release number of selected component (if applicable):
[root@intlweb23 root]# uname -a
Linux intlweb23.starwave.com 2.4.9-e.59smp #1 SMP Mon Jan 17 07:07:22 EST 2005
i686 unknown


How reproducible: sometimes occur only...


Steps to Reproduce:
1.
2.
3.
  
Actual results:
No idea how to workaround this.

Expected results:
Minimum, to close the affected tickets [including this one] with any kind of
workaround, and a real solution, since as I mentioned, this [#142849] ticket was
opened on 12/2004...

Additional info:

Comment 1 Don Howard 2005-04-28 20:18:03 UTC

Does the corruption seen on the linux NFS client go away if you update the
file's timestamp (via touch) from one of the solaris machines?

Comment 2 Don Howard 2005-04-28 20:49:38 UTC

Also, can you describe how file updates are done on the nfs server (append-only/
seeking around and replacing contents/overwriting)?

Comment 3 Tamas SZERB 2005-04-29 11:11:21 UTC

Answering the first question, 

root@<solaris nfs client>:/$ ssh <linux nfs client> "cksum messages.js"
97125888 15093 messages.js
root@<solaris nfs client>:/$ touch messages.js
root@<solaris nfs client>:/$ ssh <linux nfs client> "cksum messages.js"
679637199 15093 messages.js

yes, touching the file from solaris cure the corrupted file.

Q2, mostly appended files are corrupted, but sometimes we found overwritten
files with shorter new one, we saw the old file's content, but if the new file
is longer than the overwritten old one, we see null ascii chars in it.

Hope that helps,

Tamas

Comment 4 Tamas SZERB 2005-04-29 11:59:53 UTC

and I have a file, which is updated, overwritten, seeked, modified. But what I
mentioned above, it is appended.

Comment 5 Don Howard 2005-04-29 17:20:37 UTC

This sounds like BZ 113905.

The main problem there is that nfs clients writing to a file can race with nfs
clients reading the same file.  You can work around this by writing the updated
file to some private location on the nfs server and then using an atomic
operation (like mv) to make it public to the readers, or use file locking
between the writer and the readers. (or just touch the file after it's been
written).

*** This bug has been marked as a duplicate of 113905 ***

Comment 6 Ernie Petrides 2005-04-29 19:06:33 UTC

Undoing dup, because bug 113905 is against RHEL3 kernel.

Comment 7 Tamas SZERB 2005-04-29 21:28:58 UTC

is this bug fixed in the later kernels for this system? If yes, which one is that?

Comment 8 Tamas SZERB 2005-05-05 10:48:34 UTC

Hello, now we found several files in the same environment, which can't be cured
by touching them. We resolved it this way:

cp -p file file.tmp && \
mv file file.bak && \
mv file.tmp file && \
cmp file file.bak && \
rm file.bak

But doing that AND figuring out what files are corrupted implies a huge amount
of resource usages so this resolution is pretty expensive.

I submitted this ticket pretty LONG TIME ago, and I haven't any further reply
how could we get rid this really annoying error, esp. because we run these
servers in production environment and we can loss revenue if this won't be
solved soon.

From now, please switch off the security sensitive bug on this ticket, I'd like
to inform my collegues on the progress.

Thanks,

Tamas Szerb

Comment 9 Don Howard 2005-05-05 18:02:20 UTC

I've seen a small number of reports of this problem.
I've not been able to reproduce it in-house, nor have any reporters been able to
reproduce it at will.

In each case, the corruption is limited to specific files (it's not system-wide
corruption) and is precipitated by a lack of synchronization betweeen nfs
readers and writers.  In the reports that I've looked at, adding some means of
reader-writer synchronization has corrected the problem.

Short of that, a means to reliably reproduce this problem would be most helpful
to speed a resolution.

Comment 10 Don Howard 2005-11-07 21:59:39 UTC

Hi Tamas

Have you come across any way to reliably reproduce the stale file problem?
I'll need some way to tickle this bug in order to find the source of the problem.

For now, I am going to close this ticket.  Please re-open it if you have any
more info (esp a way to reproduce the bug!).

Note You need to log in before you can comment on or make changes to this bug.