Bug 59992
Summary: | Unable to flock file on "busy" Sendmail server | ||||||
---|---|---|---|---|---|---|---|
Product: | [Retired] Red Hat Linux | Reporter: | Michael Brock <michael_brock> | ||||
Component: | kernel | Assignee: | Arjan van de Ven <arjanv> | ||||
Status: | CLOSED ERRATA | QA Contact: | Brian Brock <bbrock> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 7.1 | CC: | chrismcc, chris.ricker, developer.redhat.com, dtong, gdh, joe.simmons, jon, shishz | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | i686 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2003-12-17 01:13:39 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Michael Brock
2002-02-18 17:28:34 UTC
Please validate and quantify this bug using standard Red Hat kernels. Is NFS involved ? This error has occured under every standard redhat kernel we have used, including standard, smp and enterprise kernels. We have also compiled the most recent kernel from kernel.org using the Redhat Enterprise config script - and the same problem occurs. This only happens with Sendmail Switch - not open source sendmail. No NFS is used on the box. We tried setting RLIMIT_LOCKS to RLIM_INFINITY in the sendmail binary, but it did not help. Compiling sendmail to use fcntl() instead of flock() seems to work around the problem. The same sendmail code (using flock()) is used on all other Unix platforms without showing this behavior. We suspect that this is a bug in the implementation of flock() on linux. We discussed the issue with other open source developers in conferences, and some of them saw similar problems with flock() in their applications on linux as well. An earlier analysis done on the problem (for informational purposes only): The problem being investigated is why does flock() fail, returning ENOLCK. My investigation was against glibc-2.2.4-19.3 and kernel-2.4.9-21. However, after completing this writeup, I noticed that even though I was told the kernel in use was the above version, the dmesg output provided by customer states: Linux version 2.4.17 (root.com) (gcc version 2.96 20000731 (Red Hat Linux 7.1 2.96-85)) #2 SMP Tue Feb 5 12:51:09 PST 2002 After looking at the glibc source, it appears that flock is a direct call into the kernel. There are two ways to get ENOLCK from flock(): 1. A failure in the flock() system call. 2. An attempt to lock a file over NFS. Since NFS is not in use, I'll ignore the NFS case and concentrate on the system call case. However, I note rpc.statd is running on the machine. This process is used by rpc.lockd for NFS file locking services. It should not be running if NFS is not in use. flock() is a system call implemented in linux/fs/locks.c:sys_flock(). It calls flock_lock_file() which does: if (!unlock) { error = -ENOLCK; new_fl = flock_make_lock(filp, lock_type); if (!new_fl) return error; } So the only way to get ENOLCK is if flock_make_lock() fails. That function only fails if locks_alloc_lock() fails: struct file_lock *fl = locks_alloc_lock(1); if (fl == NULL) return NULL; locks_alloc_lock() is: /* Allocate an empty lock structure. */ static struct file_lock *locks_alloc_lock(int account) { struct file_lock *fl; if (account && current->locks >= current->rlim[RLIMIT_LOCKS].rlim_cur) return NULL; fl = kmem_cache_alloc(filelock_cache, SLAB_KERNEL); if (fl) current->locks++; return fl; } It can fail for two reasons. First, the number of locks reaches the RLIMIT_LOCKS soft (rlim_cur) resource limit. This may be a system tunable to set the limits at boot time. I'm not familiar enough with Linux to know how to tune this at boot time. This program will print the RLIMIT_LOCKS resource limit. To get a true picture though, it needs to be run in the same process that starts sendmail at boot time. This probably means changing the sendmail startup script. Running the program in your shell will only show the limits of your particular process and these limits differ per-user and may vary depending on system state. #include <sys/types.h> #include <sys/time.h> #include <sys/resource.h> #include <errno.h> #include <stdio.h> #include <string.h> #include <sysexits.h> int main(int argc, char **argv) { struct rlimit rl; if (getrlimit(RLIMIT_LOCKS, &rl) < 0) { fprintf(stderr, "getrlimit(RLIMIT_LOCKS): %s\n", strerror(errno)); exit(EX_OSERR); } printf("RLIMIT_LOCKS rlim_cur = %d\n", rl.rlim_cur); printf("RLIMIT_LOCKS rlim_max = %d\n", rl.rlim_max); printf("NOTE: -1 == infinite\n"); exit(EX_OK); } The second way which lock_alloc_lock() can fail is if kmem_cache_alloc() fails. This function is found in linux/mm/slab.c. Note that this code is different on single processor and multiprocessor machines. This code is a bit more complex and would require more research to understand fully. Using the huge assumption (given my limited knowledge of the code) that all attempts to use an available cached slab will succeed, the only place left for failure is if kmem_cache_grow() fails to grow the cache. That function can fail if: 1. The SLAB_NO_GROW flag is set. This isn't the case on the call from locks_alloc_lock(). 2. kmem_cache_slabmgmt() fails. This can be a recursive call into kmem_cache_alloc() so it's only failure mode is if kmem_getpages() fails. 3. kmem_getpages() fails. This fails if __get_free_pages() returns NULL. linux/mm/page_alloc.c:__get_free_pages() returns NULL if alloc_pages() fails. alloc_pages() is aliased to _alloc_pages() in linux/include/linux/mm.h. linux/mm/numa.c:_alloc_pages() can fail if alloc_pages_pgdat() fails to give a new page. That function is simply a call to linux/mm/page_alloc.c:__alloc_pages(). Again, we have hit some necessarily complex code. This code tries to allocate memory using __alloc_pages_limit() in multiple ways. If it gets desperate, it wakes up kswapd to start swapping and tries more allocations using __alloc_pages_limit(). I note that the ps list from the customer machine shows kswapd has a lot of CPU time (224:02) which seems to indicate that the machine has been swapping. In fact, kswapd has more CPU time than any other process on the machine. That's pretty odd given the amount of memory these machines have (3G if I read the dmesg output correctly). This may be a red herring. Going back to __alloc_pages_limit(), I see that it obeys per-zone memory limits and will fail if it can't find available memory within the zone limits. It is possible that the "zone" used for allocating kernel memory for locks is simply filling up given the number of locks in use. Once again, I've reached the limit of my Linux kernel internals to know how to tune these zone limits and/or increase the amount of kernel memory reserved at boot time. In summary, ENOLCK is returned by flock() for one of three reasons: 1. NFS file locks in use. 2. The RLIMIT_LOCKS resource limit is reached. 3. The kernel memory for locks is exhausted. The first is not an issue on this system (except for the running rpc.statd). The second is a tunable parameter which should be investigated. The third is probably also tunable and needs to be checked into. Perhaps some of these dmesg output values will help someone knowledgeable in kernel tuning to pinpoint the problem: 2815MB HIGHMEM available. On node 0 totalpages: 950270 zone(0): 4096 pages. zone(1): 225280 pages. zone(2): 720894 pages. Processors: 4 Memory: 3738380k/3801080k available (1162k kernel code, 62316k reserved, 403k data, 256k init, 2883576k highmem) Dentry-cache hash table entries: 262144 (order: 9, 2097152 bytes) Inode-cache hash table entries: 262144 (order: 9, 2097152 bytes) Mount-cache hash table entries: 65536 (order: 7, 524288 bytes) Buffer-cache hash table entries: 262144 (order: 8, 1048576 bytes) Page-cache hash table entries: 524288 (order: 9, 2097152 bytes) It's possible the cache sizes shown are too low. On a side note, it would be interesting to see the output of `cat /proc/locks` when a machine is in trouble. sendmail uses flock to do file locking on plain files so different processes can share those files without a problem. For example, the access DB alluded to in the bug report is used for policy information. Every sendmail process uses a shared lock on that file so it can't be written to while it is in use. makemap, an associated utility which rebuilds the map from a plain text file, locks the file with an exclusive lock before updating it. On a heavily loaded machine, there can 100s of sendmail processes using shared locks before reading data from the maps. It is important to note that this isn't a problem in the way sendmail uses locks. sendmail doesn't do anything out of the ordinary, somewhere along the line, the Linux version of flock() broke. While switching to fcntl() works around the problem, the semantics of fcntl() are not attractive. Specifically, fcntl() locks have the semantic of having the lock owned by the process and not the file descriptor (e.g., on a program which forks children to do work). This requires sendmail to write it's state to disk and reload it in the child process on fcntl() systems, causing a performance loss. Also, sendmail isn't the only program affected by this breakage: From Apache 1.3's CHANGES file: *) PORT: Switch back to using fcntl() locking on Linux -- instabilities have been reported with flock() locking (probably related to kernel version). [Dean Gaudet] PR#2723, 3531 According to the author of Cyrus IMAP and posts in the mailing list, Cyrus IMAP will also be changing as they have had similar problems with flock() on Linux. I sent this bugzilla report to Matthew Wilcox <willy> and this is what he wrote back to me -- Nah, I know what it is. I just don't know how to fix it properly. Here's how to reproduce it: fd = open(); flock(fd); fork(); flock(fd, F_UNLCK) now child's count goes to -1. The file lock accounting code is horribly broken (and I wrote it, I should know). I think the best solution to 2.4 is simply to delete it, at least for BSD-style flocks. Note that 2.5 has the same issue, but I'll fix it differently there. -- Maybe the appropiate people from Redhat/Sendmail might want to get further zen from Matthew about this. Created attachment 63365 [details]
Patch from Matthew Wilcox to remove file-lock accounting
I have attached a patch which was sent to me by Matthew Wilcox which removes file-lock accounting code from 2.4. Please ask him if you have issues about it. I am just the guy relaying messages So as someone who's just come across this problem is that patch going to be merged in to a new kernel errata? Or has it already been? I can't see anything in the changelog about it. it's applied in the current errata at least, not sure when it got committed. (before I took over). |