Bug 154603

Summary: flock() over NFS causes very sluggish performance and "system" cpu consumption on SMP machines
Product: [Fedora] Fedora Reporter: Konstantin Olchanski <olchansk>
Component: kernelAssignee: Steve Dickson <steved>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 4CC: davej, riel
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-05-05 21:18:49 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Konstantin Olchanski 2005-04-12 23:05:57 UTC
Description of problem:

Using flock() over NFS under the 2.6.x SMP kernels causes bizarre system-wide
performance degradation. Non-SMP kernels and 2.4.x kernels do not exhibit this
problem. This is what I see:

I was debugging a performance degradation of our compute-intensive application
after upgrading from RHL8.0 to FC3. On our dual-cpu (Athlon and P4) machines,
when running two of these applications at the same time, the machine feels very
sluggish (interactive response is slow, NFS mounts sometimes timeout), "top"
reports 50% "user" and 50% "system" CPU usage (oprofile, vmstat and
"/usr/bin/time" show the same), "strace" shows that both applications do not
make any system calls (they just compute). What?!? No system calls but 50%
"system" CPU usage?!?

I traced the problem to the use of flock() over NFS early in the application. As
we know, flock() over NFS "does not work", but this application worked fine
until the 2.6.x kernels in FC2 and FC3. (The locking code in this application
probably predates Linux) (The locking code in the application is useless and I
am presently removing it).

The problem is with closing the lockfile before releasing the lock. "man flock"
says that it should "just release the lock" and I guess it works in the 2.4 and
the non-SMP 2.6 kernels.

I reduced our application to this example that exhibits the performance problem:

// flock.cc
#include <stdio.h>
#include <unistd.h>
#include <sys/file.h>

int main(int argc,char*argv[])
{
  char *file = argv[1];

  int fd = open(file,O_RDWR);
  if (fd == -1) { perror("open");  return -1; }
  if (flock(fd, LOCK_EX | LOCK_NB )) { perror("flock"); return -1; }

  //flock(fd, LOCK_UN); // add this time to fix the problem
  close(fd); // or remove this line to fix the problem

  while (1) { /* compute stuff */ }
}

Compile and run on an SMP machine:
[olchansk@tw04 flock]$ g++     flock.cc   -o flock
[olchansk@tw04 flock]$ touch /tmp/aaa
[olchansk@tw04 flock]$ ./flock /tmp/aaa <---- use local file
In "vmstat" observe 50% "user", 50% idle and 0% "system" cpu usage:
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 2  0 1074880 1261436   6868  46624    0    0     0     0 1121   178 50  0 50  0


Now run:
[olchansk@tw04 flock]$ ./flock /some/nfs/mounted/file <--- NFS mounted
In "vmstat" observe high "system" cpu usage:
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 4  0 1074880 1261364   6932  46628    0    0     0     0 1217   162 50 37 13  0
 1  0 1074880 1261364   6940  46620    0    0     0    16 1158   279 50 26 24  0
Also try typing shell commands, observe sluggish interactive response.
Inspect the example source code: there are no system calls, nothing to cause
"system" cpu usage or slow down the machine.

Remove the "close" or add the "flock-unlock" statement, recompile, rerun,
observe "system" cpu usage goes back to zero, interactive response returns to
same as for "local file locking".

Version-Release number of selected component (if applicable):

SMP FC2, FC3 are affected. Anything older and non-SMP FC2, FC3 are fine (no
performance degradation). AMD64, Athlon and Pentium4 machines all behave the
same way.

[olchansk@bench flock]$ uname -a
Linux bench.triumf.ca 2.6.10-1.770_FC3smp #1 SMP Thu Feb 24 18:36:43 EST 2005
x86_64 x86_64 x86_64 GNU/Linux
[olchansk@bench flock]$ rpm -q glibc
glibc-2.3.5-0.fc3.1
[olchansk@bench flock]$ rpm -q gcc
gcc-3.4.2-6.fc3

K.O.

Comment 1 Dave Jones 2005-07-15 19:12:54 UTC
An update has been released for Fedora Core 3 (kernel-2.6.12-1.1372_FC3) which
may contain a fix for your problem.   Please update to this new kernel, and
report whether or not it fixes your problem.

If you have updated to Fedora Core 4 since this bug was opened, and the problem
still occurs with the latest updates for that release, please change the version
field of this bug to 'fc4'.

Thank you.

Comment 2 Dave Jones 2006-01-16 22:33:50 UTC
This is a mass-update to all currently open Fedora Core 3 kernel bugs.

Fedora Core 3 support has transitioned to the Fedora Legacy project.
Due to the limited resources of this project, typically only
updates for new security issues are released.

As this bug isn't security related, it has been migrated to a
Fedora Core 4 bug.  Please upgrade to this newer release, and
test if this bug is still present there.

This bug has been placed in NEEDINFO_REPORTER state.
Due to the large volume of inactive bugs in bugzilla, if this bug is
still in this state in two weeks time, it will be closed.

Should this bug still be relevant after this period, the reporter
can reopen the bug at any time. Any other users on the Cc: list
of this bug can request that the bug be reopened by adding a
comment to the bug.

Thank you.


Comment 3 Dave Jones 2006-02-03 06:16:41 UTC
This is a mass-update to all currently open kernel bugs.

A new kernel update has been released (Version: 2.6.15-1.1830_FC4)
based upon a new upstream kernel release.

Please retest against this new kernel, as a large number of patches
go into each upstream release, possibly including changes that
may address this problem.

This bug has been placed in NEEDINFO_REPORTER state.
Due to the large volume of inactive bugs in bugzilla, if this bug is
still in this state in two weeks time, it will be closed.

Should this bug still be relevant after this period, the reporter
can reopen the bug at any time. Any other users on the Cc: list
of this bug can request that the bug be reopened by adding a
comment to the bug.

If this bug is a problem preventing you from installing the
release this version is filed against, please see bug 169613.

Thank you.


Comment 4 John Thacker 2006-05-05 21:18:49 UTC
Closing per previous comment.