Bug 59992

Summary:

Unable to flock file on "busy" Sendmail server

Product:

[Retired] Red Hat Linux

Reporter:

Michael Brock <michael_brock>

Component:

kernel

Assignee:

Arjan van de Ven <arjanv>

Status:

CLOSED ERRATA

QA Contact:

Brian Brock <bbrock>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

7.1

CC:

chrismcc, chris.ricker, developer.redhat.com, dtong, gdh, joe.simmons, jon, shishz

Target Milestone:

---

Target Release:

---

Hardware:

i686

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2003-12-17 01:13:39 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Patch from Matthew Wilcox to remove file-lock accounting	none

Description Michael Brock 2002-02-18 17:28:34 UTC

Description of problem:
I get errors in the messages log: "Feb 7 00:40:26 inet-mail7 sendmail[24808]: 
g178eQd24808: SYSERR(root): cannot flock(/etc/mail/access_db.db, fd=6, type=1, 
omode7777777777, euid=0): No locks available"  The problem keeps mail from 
being delivered everytime the error occurs.  It occurs with other files as well 
(the queues for sendmail) 

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1. Install RH7.1
2. Install Sendmail Switch (commercial sendmail)
3. Configure Sendmail to look in the access_db.db to check for reciepents to 
reject
	

Actual Results:  I get the "cannot flock" error when the box is under load.  
The piece of mail that caused the flock request gets rejected.

Expected Results:  I should get no flock errors.

Additional info:

Sendmail has also looked into this, and they seem to be stumped as well.  We've 
tried kernels 2.4.3, 2.4.9, and 2.4.17, all exhibit the same issue.  Also tried 
increasing /proc/sys/fs/file/file-max with no change to the behavior.

Comment 1 Yil-Kyu Kang 2002-02-27 22:00:31 UTC

Please validate and quantify this bug using standard Red Hat kernels.

Comment 2 Arjan van de Ven 2002-02-28 09:17:27 UTC

Is NFS involved ?

Comment 3 Joe Simmons 2002-02-28 20:23:54 UTC

This error has occured under every standard redhat kernel we have used,
including standard, smp and enterprise kernels.  We have also compiled the most
recent kernel from kernel.org using the Redhat Enterprise config script - and
the same problem occurs.  This only happens with Sendmail Switch - not open
source sendmail.

Comment 4 Joe Simmons 2002-02-28 21:28:58 UTC

No NFS is used on the box.

Comment 5 Need Real Name 2002-06-17 16:44:48 UTC

We tried setting RLIMIT_LOCKS to RLIM_INFINITY in the sendmail binary,
but it did not help.
Compiling sendmail to use fcntl() instead of flock() seems to work
around the problem.
The same sendmail code (using flock()) is used on all other Unix
platforms without showing this behavior.  We suspect that this is
a bug in the implementation of flock() on linux.
We discussed the issue with other open source developers in conferences,
and some of them saw similar problems with flock() in their applications
on linux as well.

Comment 6 Gregory Neil Shapiro 2002-06-24 17:36:07 UTC

An earlier analysis done on the problem (for informational purposes only):

The problem being investigated is why does flock() fail, returning ENOLCK.
My investigation was against glibc-2.2.4-19.3 and kernel-2.4.9-21.  However,
after completing this writeup, I noticed that even though I was told the
kernel in use was the above version, the dmesg output provided by customer
states:

Linux version 2.4.17 (root.com) (gcc version 2.96 20000731
(Red Hat Linux 7.1 2.96-85)) #2 SMP Tue Feb 5 12:51:09 PST 2002

After looking at the glibc source, it appears that flock is a direct call
into the kernel.  There are two ways to get ENOLCK from flock():

1. A failure in the flock() system call.
2. An attempt to lock a file over NFS.

Since NFS is not in use, I'll ignore the NFS case and concentrate on
the system call case.  However, I note rpc.statd is running on the
machine.  This process is used by rpc.lockd for NFS file locking services.
It should not be running if NFS is not in use.


flock() is a system call implemented in linux/fs/locks.c:sys_flock().
It calls flock_lock_file() which does:

        if (!unlock) {
                error = -ENOLCK;
                new_fl = flock_make_lock(filp, lock_type);
                if (!new_fl)
                        return error;
        }

So the only way to get ENOLCK is if flock_make_lock() fails.  That function
only fails if locks_alloc_lock() fails:

        struct file_lock *fl = locks_alloc_lock(1);
        if (fl == NULL)
                return NULL;

locks_alloc_lock() is:

/* Allocate an empty lock structure. */
static struct file_lock *locks_alloc_lock(int account)
{
        struct file_lock *fl;
        if (account && current->locks >= current->rlim[RLIMIT_LOCKS].rlim_cur)
                return NULL;
        fl = kmem_cache_alloc(filelock_cache, SLAB_KERNEL);
        if (fl)
                current->locks++;
        return fl;
}

It can fail for two reasons.  First, the number of locks reaches the
RLIMIT_LOCKS soft (rlim_cur) resource limit.  This may be a system tunable
to set the limits at boot time.  I'm not familiar enough with Linux to know
how to tune this at boot time.

This program will print the RLIMIT_LOCKS resource limit.  To get a true
picture though, it needs to be run in the same process that starts sendmail
at boot time.  This probably means changing the sendmail startup script.
Running the program in your shell will only show the limits of your
particular process and these limits differ per-user and may vary depending
on system state.

#include <sys/types.h>
#include <sys/time.h>
#include <sys/resource.h>
#include <errno.h>
#include <stdio.h>
#include <string.h>
#include <sysexits.h>

int
main(int argc, char **argv)
{
     struct rlimit rl;

     if (getrlimit(RLIMIT_LOCKS, &rl) < 0)
     {
     	fprintf(stderr, "getrlimit(RLIMIT_LOCKS): %s\n", strerror(errno));
     	exit(EX_OSERR);
     }
     printf("RLIMIT_LOCKS rlim_cur = %d\n", rl.rlim_cur);
     printf("RLIMIT_LOCKS rlim_max = %d\n", rl.rlim_max);
     printf("NOTE: -1 == infinite\n");
     exit(EX_OK);
}

The second way which lock_alloc_lock() can fail is if kmem_cache_alloc()
fails.  This function is found in linux/mm/slab.c.  Note that this code is
different on single processor and multiprocessor machines.  This code is a
bit more complex and would require more research to understand fully.
Using the huge assumption (given my limited knowledge of the code) that all
attempts to use an available cached slab will succeed, the only place left
for failure is if kmem_cache_grow() fails to grow the cache.  That function
can fail if:

1. The SLAB_NO_GROW flag is set.  This isn't the case on the call from
   locks_alloc_lock().
2. kmem_cache_slabmgmt() fails.  This can be a recursive call into
   kmem_cache_alloc() so it's only failure mode is if kmem_getpages()
   fails.
3. kmem_getpages() fails.  This fails if __get_free_pages() returns NULL.

linux/mm/page_alloc.c:__get_free_pages() returns NULL if alloc_pages()
fails.  alloc_pages() is aliased to _alloc_pages() in
linux/include/linux/mm.h.  linux/mm/numa.c:_alloc_pages() can fail if
alloc_pages_pgdat() fails to give a new page.  That function is simply a
call to linux/mm/page_alloc.c:__alloc_pages().  Again, we have hit some
necessarily complex code.  This code tries to allocate memory using
__alloc_pages_limit() in multiple ways.  If it gets desperate, it wakes up
kswapd to start swapping and tries more allocations using
__alloc_pages_limit().

I note that the ps list from the customer machine shows kswapd has a lot of
CPU time (224:02) which seems to indicate that the machine has been
swapping.  In fact, kswapd has more CPU time than any other process on the
machine.  That's pretty odd given the amount of memory these machines have
(3G if I read the dmesg output correctly).  This may be a red herring.

Going back to __alloc_pages_limit(), I see that it obeys per-zone memory
limits and will fail if it can't find available memory within the zone
limits.  It is possible that the "zone" used for allocating kernel memory
for locks is simply filling up given the number of locks in use.  Once
again, I've reached the limit of my Linux kernel internals to know how to
tune these zone limits and/or increase the amount of kernel memory reserved
at boot time.


In summary, ENOLCK is returned by flock() for one of three reasons:

1. NFS file locks in use.
2. The RLIMIT_LOCKS resource limit is reached.
3. The kernel memory for locks is exhausted.

The first is not an issue on this system (except for the running
rpc.statd).  The second is a tunable parameter which should be
investigated.  The third is probably also tunable and needs to be checked
into.  Perhaps some of these dmesg output values will help someone
knowledgeable in kernel tuning to pinpoint the problem:

2815MB HIGHMEM available.
On node 0 totalpages: 950270
zone(0): 4096 pages.
zone(1): 225280 pages.
zone(2): 720894 pages.
Processors: 4
Memory: 3738380k/3801080k available (1162k kernel code, 62316k reserved, 403k
data, 256k init, 2883576k highmem)
Dentry-cache hash table entries: 262144 (order: 9, 2097152 bytes)
Inode-cache hash table entries: 262144 (order: 9, 2097152 bytes)
Mount-cache hash table entries: 65536 (order: 7, 524288 bytes)
Buffer-cache hash table entries: 262144 (order: 8, 1048576 bytes)
Page-cache hash table entries: 524288 (order: 9, 2097152 bytes)

It's possible the cache sizes shown are too low.

On a side note, it would be interesting to see the output of
`cat /proc/locks` when a machine is in trouble.

Comment 7 Gregory Neil Shapiro 2002-06-24 17:47:07 UTC

sendmail uses flock to do file locking on plain files so different processes
can share those files without a problem.  For example, the access DB alluded
to in the bug report is used for policy information.  Every sendmail process
uses a shared lock on that file so it can't be written to while it is in use.
makemap, an associated utility which rebuilds the map from a plain text file,
locks the file with an exclusive lock before updating it.  On a heavily loaded
machine, there can 100s of sendmail processes using shared locks before reading
data from the maps.

It is important to note that this isn't a problem in the way sendmail uses
locks.  sendmail doesn't do anything out of the ordinary, somewhere along the
line, the Linux version of flock() broke.  While switching to fcntl() works
around the problem, the semantics of fcntl() are not attractive.
Specifically, fcntl() locks have the semantic of having the lock owned by
the process and not the file descriptor (e.g., on a program which forks
children to do work).  This requires sendmail to write it's state to disk
and reload it in the child process on fcntl() systems, causing a performance
loss.

Also, sendmail isn't the only program affected by this breakage:

From Apache 1.3's CHANGES file:
  *) PORT: Switch back to using fcntl() locking on Linux -- instabilities
     have been reported with flock() locking (probably related to kernel
     version).  [Dean Gaudet] PR#2723, 3531

According to the author of Cyrus IMAP and posts in the mailing list, Cyrus
IMAP will also be changing as they have had similar problems with flock()
on Linux.

Comment 8 Yusuf Goolamabbas 2002-07-02 03:05:50 UTC

I sent this bugzilla report to Matthew Wilcox <willy> and this is
what he wrote back to me

--
Nah, I know what it is.  I just don't know how to fix it properly.  Here's
how to reproduce it:
  
fd = open();
flock(fd);
fork();
flock(fd, F_UNLCK)

now child's count goes to -1.

The file lock accounting code is horribly broken (and I wrote it, I
should know).  I think the best solution to 2.4 is simply to delete it,
at least for BSD-style flocks.

Note that 2.5 has the same issue, but I'll fix it differently there.
--

Maybe the appropiate people from Redhat/Sendmail might want to get further zen
from Matthew about this.

Comment 9 Yusuf Goolamabbas 2002-07-02 03:43:40 UTC

Created attachment 63365 [details]
Patch from Matthew Wilcox to remove file-lock accounting

Comment 10 Yusuf Goolamabbas 2002-07-02 03:44:57 UTC

I have attached a patch which was sent to me by Matthew Wilcox which removes
file-lock accounting code from 2.4. Please ask him if you have issues about it.
I am just the guy relaying messages

Comment 11 Jon Benson 2002-09-12 03:52:56 UTC

So as someone who's just come across this problem is that patch going to be 
merged in to a new kernel errata?  Or has it already been?  I can't see 
anything in the changelog about it.

Comment 12 Dave Jones 2003-12-17 01:13:39 UTC

it's applied in the current errata at least, not sure when it got
committed. (before I took over).