Our application creates large binary files. To create these files we first create a zero length file
and lock it (fcntl(... SETLK)). Then we write a second data file. When the write is complete
we fsync the data file and then close both file descriptors. This should release the lock.
When a large amount of data is written quickly, the lock is not released (about 100 MB is enough).
I've generated this bug with the Linux box at kernel 2.2.12-20 and 2.2.12-32. I've used data servers
on Solaris 2.5.1, Solaris 2.7 and HP-UX 11.00.
By the way, our Solaris boxes do have patch 105299-02 installed (sunsolve ID 4071076) so this is
not a repeat of that problem (thank you for referencing that in the other bug reports it did solve
another problem I was having).
Please contact me at firstname.lastname@example.org and I will provide a full testcase via ftp.
Created attachment 1814 [details]
tar file of test case, please read README file, modify Makefile macro and type make
Further testing has been done. If the remote directory is on HP 10.20 it also fails.
If the remote directory is on another Linux 2.2.12 machine or a Network Appliance
file server the problem does not occur. The NetApp might be an anomoly because it
is on a gigabit network and may be consuming the data VERY fast coming off my
100TX linux machines.
assigned to johnsonm
I tried to append this previously but it somehow disappeared....
A customer of mine had his remote file systems mounted with
mount -o rw,nolock ....
In this configuration the problem did NOT occur.
In my opinion this is a crazy way to mount a file system since it bypassed NFS locking all together
and could lead to corruption problems, but it did serve to isolate the bug to NFS locking. With 'nolock'
the lock is maintained locally on the client and the problem did not occur.
Has there been any progress on this issue?
Could you try our 2.2.19 errata kernel? it has majorly revamped NFS code,
including NFSv3 support.