Bug 59992
| Summary: | Unable to flock file on "busy" Sendmail server | ||||||
|---|---|---|---|---|---|---|---|
| Product: | [Retired] Red Hat Linux | Reporter: | Michael Brock <michael_brock> | ||||
| Component: | kernel | Assignee: | Arjan van de Ven <arjanv> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Brian Brock <bbrock> | ||||
| Severity: | medium | Docs Contact: | |||||
| Priority: | medium | ||||||
| Version: | 7.1 | CC: | chrismcc, chris.ricker, developer.redhat.com, dtong, gdh, joe.simmons, jon, shishz | ||||
| Target Milestone: | --- | ||||||
| Target Release: | --- | ||||||
| Hardware: | i686 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2003-12-17 01:13:39 UTC | Type: | --- | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
|
Description
Michael Brock
2002-02-18 17:28:34 UTC
Please validate and quantify this bug using standard Red Hat kernels. Is NFS involved ? This error has occured under every standard redhat kernel we have used, including standard, smp and enterprise kernels. We have also compiled the most recent kernel from kernel.org using the Redhat Enterprise config script - and the same problem occurs. This only happens with Sendmail Switch - not open source sendmail. No NFS is used on the box. We tried setting RLIMIT_LOCKS to RLIM_INFINITY in the sendmail binary, but it did not help. Compiling sendmail to use fcntl() instead of flock() seems to work around the problem. The same sendmail code (using flock()) is used on all other Unix platforms without showing this behavior. We suspect that this is a bug in the implementation of flock() on linux. We discussed the issue with other open source developers in conferences, and some of them saw similar problems with flock() in their applications on linux as well. An earlier analysis done on the problem (for informational purposes only):
The problem being investigated is why does flock() fail, returning ENOLCK.
My investigation was against glibc-2.2.4-19.3 and kernel-2.4.9-21. However,
after completing this writeup, I noticed that even though I was told the
kernel in use was the above version, the dmesg output provided by customer
states:
Linux version 2.4.17 (root.com) (gcc version 2.96 20000731
(Red Hat Linux 7.1 2.96-85)) #2 SMP Tue Feb 5 12:51:09 PST 2002
After looking at the glibc source, it appears that flock is a direct call
into the kernel. There are two ways to get ENOLCK from flock():
1. A failure in the flock() system call.
2. An attempt to lock a file over NFS.
Since NFS is not in use, I'll ignore the NFS case and concentrate on
the system call case. However, I note rpc.statd is running on the
machine. This process is used by rpc.lockd for NFS file locking services.
It should not be running if NFS is not in use.
flock() is a system call implemented in linux/fs/locks.c:sys_flock().
It calls flock_lock_file() which does:
if (!unlock) {
error = -ENOLCK;
new_fl = flock_make_lock(filp, lock_type);
if (!new_fl)
return error;
}
So the only way to get ENOLCK is if flock_make_lock() fails. That function
only fails if locks_alloc_lock() fails:
struct file_lock *fl = locks_alloc_lock(1);
if (fl == NULL)
return NULL;
locks_alloc_lock() is:
/* Allocate an empty lock structure. */
static struct file_lock *locks_alloc_lock(int account)
{
struct file_lock *fl;
if (account && current->locks >= current->rlim[RLIMIT_LOCKS].rlim_cur)
return NULL;
fl = kmem_cache_alloc(filelock_cache, SLAB_KERNEL);
if (fl)
current->locks++;
return fl;
}
It can fail for two reasons. First, the number of locks reaches the
RLIMIT_LOCKS soft (rlim_cur) resource limit. This may be a system tunable
to set the limits at boot time. I'm not familiar enough with Linux to know
how to tune this at boot time.
This program will print the RLIMIT_LOCKS resource limit. To get a true
picture though, it needs to be run in the same process that starts sendmail
at boot time. This probably means changing the sendmail startup script.
Running the program in your shell will only show the limits of your
particular process and these limits differ per-user and may vary depending
on system state.
#include <sys/types.h>
#include <sys/time.h>
#include <sys/resource.h>
#include <errno.h>
#include <stdio.h>
#include <string.h>
#include <sysexits.h>
int
main(int argc, char **argv)
{
struct rlimit rl;
if (getrlimit(RLIMIT_LOCKS, &rl) < 0)
{
fprintf(stderr, "getrlimit(RLIMIT_LOCKS): %s\n", strerror(errno));
exit(EX_OSERR);
}
printf("RLIMIT_LOCKS rlim_cur = %d\n", rl.rlim_cur);
printf("RLIMIT_LOCKS rlim_max = %d\n", rl.rlim_max);
printf("NOTE: -1 == infinite\n");
exit(EX_OK);
}
The second way which lock_alloc_lock() can fail is if kmem_cache_alloc()
fails. This function is found in linux/mm/slab.c. Note that this code is
different on single processor and multiprocessor machines. This code is a
bit more complex and would require more research to understand fully.
Using the huge assumption (given my limited knowledge of the code) that all
attempts to use an available cached slab will succeed, the only place left
for failure is if kmem_cache_grow() fails to grow the cache. That function
can fail if:
1. The SLAB_NO_GROW flag is set. This isn't the case on the call from
locks_alloc_lock().
2. kmem_cache_slabmgmt() fails. This can be a recursive call into
kmem_cache_alloc() so it's only failure mode is if kmem_getpages()
fails.
3. kmem_getpages() fails. This fails if __get_free_pages() returns NULL.
linux/mm/page_alloc.c:__get_free_pages() returns NULL if alloc_pages()
fails. alloc_pages() is aliased to _alloc_pages() in
linux/include/linux/mm.h. linux/mm/numa.c:_alloc_pages() can fail if
alloc_pages_pgdat() fails to give a new page. That function is simply a
call to linux/mm/page_alloc.c:__alloc_pages(). Again, we have hit some
necessarily complex code. This code tries to allocate memory using
__alloc_pages_limit() in multiple ways. If it gets desperate, it wakes up
kswapd to start swapping and tries more allocations using
__alloc_pages_limit().
I note that the ps list from the customer machine shows kswapd has a lot of
CPU time (224:02) which seems to indicate that the machine has been
swapping. In fact, kswapd has more CPU time than any other process on the
machine. That's pretty odd given the amount of memory these machines have
(3G if I read the dmesg output correctly). This may be a red herring.
Going back to __alloc_pages_limit(), I see that it obeys per-zone memory
limits and will fail if it can't find available memory within the zone
limits. It is possible that the "zone" used for allocating kernel memory
for locks is simply filling up given the number of locks in use. Once
again, I've reached the limit of my Linux kernel internals to know how to
tune these zone limits and/or increase the amount of kernel memory reserved
at boot time.
In summary, ENOLCK is returned by flock() for one of three reasons:
1. NFS file locks in use.
2. The RLIMIT_LOCKS resource limit is reached.
3. The kernel memory for locks is exhausted.
The first is not an issue on this system (except for the running
rpc.statd). The second is a tunable parameter which should be
investigated. The third is probably also tunable and needs to be checked
into. Perhaps some of these dmesg output values will help someone
knowledgeable in kernel tuning to pinpoint the problem:
2815MB HIGHMEM available.
On node 0 totalpages: 950270
zone(0): 4096 pages.
zone(1): 225280 pages.
zone(2): 720894 pages.
Processors: 4
Memory: 3738380k/3801080k available (1162k kernel code, 62316k reserved, 403k
data, 256k init, 2883576k highmem)
Dentry-cache hash table entries: 262144 (order: 9, 2097152 bytes)
Inode-cache hash table entries: 262144 (order: 9, 2097152 bytes)
Mount-cache hash table entries: 65536 (order: 7, 524288 bytes)
Buffer-cache hash table entries: 262144 (order: 8, 1048576 bytes)
Page-cache hash table entries: 524288 (order: 9, 2097152 bytes)
It's possible the cache sizes shown are too low.
On a side note, it would be interesting to see the output of
`cat /proc/locks` when a machine is in trouble.
sendmail uses flock to do file locking on plain files so different processes
can share those files without a problem. For example, the access DB alluded
to in the bug report is used for policy information. Every sendmail process
uses a shared lock on that file so it can't be written to while it is in use.
makemap, an associated utility which rebuilds the map from a plain text file,
locks the file with an exclusive lock before updating it. On a heavily loaded
machine, there can 100s of sendmail processes using shared locks before reading
data from the maps.
It is important to note that this isn't a problem in the way sendmail uses
locks. sendmail doesn't do anything out of the ordinary, somewhere along the
line, the Linux version of flock() broke. While switching to fcntl() works
around the problem, the semantics of fcntl() are not attractive.
Specifically, fcntl() locks have the semantic of having the lock owned by
the process and not the file descriptor (e.g., on a program which forks
children to do work). This requires sendmail to write it's state to disk
and reload it in the child process on fcntl() systems, causing a performance
loss.
Also, sendmail isn't the only program affected by this breakage:
From Apache 1.3's CHANGES file:
*) PORT: Switch back to using fcntl() locking on Linux -- instabilities
have been reported with flock() locking (probably related to kernel
version). [Dean Gaudet] PR#2723, 3531
According to the author of Cyrus IMAP and posts in the mailing list, Cyrus
IMAP will also be changing as they have had similar problems with flock()
on Linux.
I sent this bugzilla report to Matthew Wilcox <willy> and this is what he wrote back to me -- Nah, I know what it is. I just don't know how to fix it properly. Here's how to reproduce it: fd = open(); flock(fd); fork(); flock(fd, F_UNLCK) now child's count goes to -1. The file lock accounting code is horribly broken (and I wrote it, I should know). I think the best solution to 2.4 is simply to delete it, at least for BSD-style flocks. Note that 2.5 has the same issue, but I'll fix it differently there. -- Maybe the appropiate people from Redhat/Sendmail might want to get further zen from Matthew about this. Created attachment 63365 [details]
Patch from Matthew Wilcox to remove file-lock accounting
I have attached a patch which was sent to me by Matthew Wilcox which removes file-lock accounting code from 2.4. Please ask him if you have issues about it. I am just the guy relaying messages So as someone who's just come across this problem is that patch going to be merged in to a new kernel errata? Or has it already been? I can't see anything in the changelog about it. it's applied in the current errata at least, not sure when it got committed. (before I took over). |