We have an intermittent NFS write bug, on Redhat 6.0 with all the latest updates, both single-processor (with de4x5 ethernet) and 4-proc SMP i686 (with 3c59x). During CVS access and sometimes during assembly/linking, sections of files are written out shifted up 1-3 characters with 0's inserted on one end. We're using a Sun as a file-server (SunOS 5.5), and haven't observed the problem with an SGI Irix fileserver (Irix 5.3), but I've used tcpdump to watch the packets and find that the data is sent already corrupted, so I'm pretty sure it's on the Linux end, and the server is just affecting the packet ordering in a way that produces the bug. An example which often exhibits the problem is running /usr/bin/as -V -Qy -o tmp.o formatted-local.s with the input file which I've put in http://suif.stanford.edu/~brm/formatted-local.s for your perusal. The output of od -x on the output tmp.o when correct, has the line > 0110000 642e 5f2e 765f 5f74 3174 7332 6d69 6c70 and, when incorrect, the line < 0110000 2e00 2e64 5f5f 7476 745f 3231 6973 706d (corruption begins at byte 36865, with a 0 byte inserted and 211 bytes shifted down 1; the 212th byte is missing). A 2.2.10 kernel still has the problem. I spent as much time as I can afford on this, using tcpdump, strace, and feeble debugging of a 2.2.10 kernel (which also has the problem, perhaps more consistently). It appears that the problem results from the following write: [from strace] _llseek(0x3, 0, 0x8ca1, 0xbfffef58, 0) = 0 write(3, "\0.symtab\0.strtab\0.shstrtab\0."..., 1075)=1075 which makes it down to the kernel intact, to the routine generic_file_write [in /usr/src/linux/mm/filemap.c] which writes the first part on one page (863 bytes) and then copies the rest (212 bytes) and schedules it as an asynchronous write. Immediately following is a 52-byte write to a different block _llseek(0x3, 0, 0, 0xbfffeef0, 0) = 0 write(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\1"..., 52) = 52 which gets handled as a synchronous write and either goes before or after the 212-byte scheduled write. Apparently if that write gets sent before the async write, then the 212 byte block is corrupted. Why, I can't figure out. I have also experienced very similar symptoms in some CVS repository file corruption recently, with 3 0 bytes inserted and a section of the file shifted 3 bytes over. I've been looking around, but can't find a mailing list or a contact for the NFS driver. Where are they located? Thanks. I feel a bit presumptuous giving this a "high" priority, since apparently noone else has noticed it, but we have $40K to spend on new computers and unless I fix this the pro-Sun camp will get to spend it on Sparcs... :-) ------- Additional Comments From 06/27/99 16:40 ------- Our NFS filesystem is mounted (from a Sun Ultra) like: failaka:/sow/1/1 on /sow/1/1 type nfs (rw,nosuid,nodev,bg,addr=171.64.73.157) and I have experimented today and have not been able to reproduce the problem with a Linux file server. So I suspect this may be hard for anyone else to reproduce. :-( Perhaps there's some regression test of the NFS filesystem that someone could run that might evidence the problem. :-)
This is a known bug in Solaris. You need to get the Solaris Errata or move to Solaris 2.6 or higher. The sun patches should be on sunsolve. The report you have is the classic offset by 1-3 bytes problem that is the clear evidence of the bug.
Discarding bug as it is actually known problem with Solaris and not problem with Red Hat Linux.
*** Bug 3807 has been marked as a duplicate of this bug. *** When using Red Hat Linux 6.0 as an NFS client to a host running Solaris 2.5.1, I've seen file corruption. The specific trigger is running ld to link executables -- the output file is messed up with some regularity. I do not know which side of the NFS connection is to blame. It could be a bug in the Solaris server. I have not yet seen the problem with servers running Solaris 2.6. When I look at the bytes that are wrong it looks like some byte or word swapping has occurred. Here is some output of "cmp -l" showing some of the differences between a good file and a corrupted one: 36525 33 0 36527 0 33 36529 1 0 36531 0 1 36533 6 0 36535 0 6 36541 100 0 36543 0 100 36545 140 0 36546 42 0 36547 0 140 36548 0 42 36557 20 0 36559 0 20 36565 41 0 36567 0 41 36569 11 0 36571 0 11 36581 314 0 36582 224 0 36583 0 314 36584 0 224 36585 260 0 36586 23 0 36587 0 260 36588 0 23 36589 14 0 36591 0 14 36593 1 0 36595 0 1 36597 4 0 36599 0 4 36601 10 0 36603 0 10 36605 53 0 36607 0 53 ------- Additional Comments From 06/29/99 12:36 ------- We have a large amount of experience with Red Hat 5.1 and the bug was definitely not present in that release. It only appeared after the 6.0 upgrade.
I'm using RedHat-6.1 kernel and I have exactly the same kind of NFS bug, with file systems mounted from a Solaris-2.6 PC; I've have it occasionnaly on gcc compiles (as I usually compile to the local disk) but have it also on some of my programs that write through NFS. After some head scratching I discover it was linked with using fseek on an NFS mounted file, where when seeking at say offset 8000 from the beginning of the file and writing 2000 bytes (in a signed fwrite), I get one zero (but may be up to three zeroes) inserted at 1000, then the first 191 bytes I wrote (see the pattern: up to byte 8191) then the 192nd byte is omitted and the rest of my write gets out correctly. Suspecting a possible problem with stdio, I added a flush, then an fseek in 0 followed by a small fread there, then the correct fseek in 8000. if I then wrote less than 193 bytes everything is OK; but if I write 2000 bytes, the byte that should be written at offset 8192 is missing (IIRC) and the others are shifted :-< To work around this I finally have to fclose() the file and the fopen() it again, always in binary mode but for update... then all seems to work OK for more than one month now (although gcc compiles still fail occasionally). The main point here is that I get these problem with several kernels (stock 6.0, stock 6.1, stock 6.1smp and 2.2.14-8smp) mounting an NFS file system from a Solaris-2.6 PC, that work perfectly with RedHat-4.0, RedHat-5.2, hpUX-9.07, hpUX-10.20, Solaris-2.5 (Sparc) and AIX-4.2... Even if its a problem with Solaris, it's a pity that all the proprietary OSes are able to work OK but not Linux!...