From Bugzilla Helper: User-Agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; de-DE; rv:1.7.6) Gecko/20050321 Firefox/1.0.2 Description of problem: Our email server application uses a file-based storage approach and utilizes file locking heavily; one of our larger customers is just deciding on the storage architecture for their upcoming Scalix project. One option is a NetAPP NAS device. when using the app on NFSv3 storage, we have large-scale performance degradations with RHEL4 as the app server, e.g. application startup time goes up from 5 to 30 seconds on NFS as compared to local storage. The problem was isolated down to two processes accessing the same file and trying to lock using a system call similar to the one giving the following strace output: fcntl64(3, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0, len=0}) When the first process, which is holding the lock, terminates and thereby releases it, the second process will hang anywhere between 5 and 45 seconds before seeing the release, then acquiring the lock and continuing. Using the "nolock" mount option on NFSv3 makes the situation even worse, as then there is no locking at all and the two processes access the file at the same time, resulting in data corruption. Kernel 2.6.10 resolves the problem as it introduces a number of fixes around NFS locking. This is described in http://www.linux-nfs.org/Linux-2.6.x/2.6.10-rc1/linux-2.6.9-17-fix_nfs_nolock.dif The patch was modified again in Kernel 2.6.11. The fix is currently not available in RHEL4 stable and beta kernels. We re-tested our application on Fedora Core 4 (based on 2.6.11) and the problem seems to disappear completely, i.e. without nolock the lock release is detected by the second process without any delay; with nolock, the kernel code still provides for local logging (on NFS client only as opposed to NFS server). The latter is what we want because only one client will be accessing the NAS device at a time and local logging will provide for better performance. So... ;-) it would be good to include this fix in a RHEL4 kernel update somehow. Version-Release number of selected component (if applicable): kernel-2.6.9-11 How reproducible: Always Steps to Reproduce: see description. Actual Results: see description. Expected Results: see description. Additional info:
In addition, while doing the testing on FC4, we're experiencing a throughput of 15-16MB/s. to the NetApp box as opposed to 5-6MB/s on RHEL4, all other parameters unchanged.
On testing with FC4 and Kernel 2.6.11-1 there, we still saw lock-related lockups of processes under load. We weren't able to gather exact data, but working assumption is that the NFS locking code in 2.6.11 is not fully stable either.
A preliminary test suggests that this is a regression from RHEL3. The "nolock" option seems to do what Florian requires -- it simply causes a fall-back to local-locking instead of NFS locking. Although the man page sort-of implies that "nolock" completely disables locking this would appear to be a completely useless option: it converts a slow locking mechanism for an exclusively mounted NFS directory (as from an NAS) into a something that is unusable! I've used "nolock" with RHEL3 to greatly speed up applications that do a lot of file locking -- and, of course, made sure that no one else is using NFS exported file system. There's still time to get this in RHEL4 U2 isn't there? It's quite important for people who want to use NetApp servers and the like.
Its probably too late to get it into U2, but I will try and get it in as early as possible for U3 and if need be, you can request an Hot Fix kernel.
*** Bug 170545 has been marked as a duplicate of this bug. ***
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2006-0132.html