Red Hat Bugzilla – Bug 185618
gfs 6.1 performance issue (directory lock contention?)
Last modified: 2010-10-22 00:39:41 EDT
Escalated to Bugzilla from IssueTracker
Going ahead and escalating this to engineering, though I'm not sure what can
really be done with such a pessimal GFS case...
They apparently have this benchmark that operates on a directory with around
10000 files. When they run this benchmark and then do an 'ls' in that directory,
they get long delays before the ls returns.
The problem seems to be slow getdents64() calls, and my guess is that the
problem is ultimately contention for the directory lock.
I've been able to reproduce something like what they are seeing on a 2 node GFS
cluster. One one machine I run the following shell loop in a directory on a GFS
# while true; do for ((i=1;i<=10000;i++)); do rm -f file_$i;touch file_$i; done;
and then on the other machine do an ls in that directory. timing it gives
numbers roughly like these:
Again, not sure what we can do to tune for this, other than telling them "don't
do that", and recommending that they architect things to reduce contention for
the directory lock (more directories and spread the files into multiple dirs).
They also complained that removing files from the directory during this test
also takes a very long time, but my guess is that that is due to the same
problem (directory lock contention), so anything we do to help the first issue,
will probably help the second.
We will take a look at it, but this sounds very much like the postmark benchmark
performance issue. Doing stats of the filesystem requires accessing every
resource group in the filesystem to collect the information. This can take a
This is a design issue with GFS that we are addressing in GFS2. If this is for
a mail server solution, a hierarchical directory structure to avoid directory
contention and mounting with noatime are two options that have been implemented
by other customers with some success.
Moving off the fix list for U4. This one may not be addressable in RHEL4
version of gfs.
Need to open another bugzilla to address the particular customer issue as
described in comment #12. The following is a short description of *this*
bugzilla that hopefully can help people understand the issue better:
The problem here is *much* more than lock contention - it hits several
design and architecture limitations. We have been hoping GFS2 could address
it. For GFS1, it is better to educate people about the ramifications so
they can find the proper workaround for their particular setup. Though I'm
trying to do something about this, the work, however, is not a short term
It is also important to point out that the problems do not exist in GFS
alone. These are the issues that may well challenge other cluster
filesystems with different degrees of severity and/or symptoms.
Users (and support engineers) must understand GFS is a journal filesystem.
That is, we have to ensure filesystem consistency without fsck if all
possible. This implies each meta-data change (transaction) is logged into
journal (file) and there are rules about the "sequence" of these changes.
Say, for example, creating two files on the same (SMP) node at the same
time. Since their meta-data could be written to journal inter-leaved,
sync one file into the disk normally requires sync other files as well.
The performance hit would increase if you have many files that have
meta-data interleaved with each other on the journal.
At the same time, GFS is also a cluster filessytem. That is, it requires
to guarantee cache coherency between different nodes. When an "ls" command
(which is a "read" that requires a shared lock) is issued, if previous
lock holder did some forms of "write" (such as exclusive lock), the data
needs to get flushed into the disk before the shared lock is granted. This
is to ensure other node can have the consistent data view across the cluster.
Now, if you have heavy write activities with *lots of* files within one
single directory across the cluster, when an "ls" is issued, All the files
are required to get "sync"ed into the disk before the directory shared lock
can be granted. Think about the performance hits generated - what else could
heavily impact a computer system performance other than these type of
operations ? Lock contention, disk IO, VM memory requirements, together
with GFS and DLM own overheads.
As we have been repeated said in the past, separate the directories and cut
down the file count within the directory if all possible. Structure your
application and configurations to allow "parallel" processing. GFS allows
concurrent accesses to the filesystem across cluster - however, it doesn't
imply users can use it randomly without knowing "parallel" principles.
I do understand people's need for "load-balancing". However, you must
understand the overheads and internals before load-balancing your workload.
Actually I should have said "locality", instead of parallel processing.
While working on another bugzilla, I happened to catch a thread back trace
and realized GFS pre-fetches two glocks on *every* file within a directory
in its readdir system call. Look to me it was an optimization effort that
assumed whenever "ls" was issued, the file access would followed. However,
on a 1000-files directory as this bugzilla has described, this could be
Will remove that pre-featch logic to see how much we could improve this issue.
ok, I was wrong - it doesn't grab the file locks but the directory lock.
So it refreshes the directory lock for each disk read - that makes sense.
False alarm !
However, GFS does invalidates *all* the pages of this directory file each
time and everytime when the directory lock is refreshed. So there may be
some room for improvement there ..
I was wrong about "I was wrong" in comment #33. It is prefetching the file
locks (not directory lock). Unfortunately, we can't see much improvement
using the test case mentioned in comment #1 (since the file is created via
"touch" that has no length). However, in general case, turning off this
file lock prefetching should help. Will describe the issue in details
before sending out for team code review.
Another thing I plan to turn off is the readahead call. Note that if the
directory lock is ping-pong between two nodes with write activities, each
directory shared lock will involve page invalidation. So readahead would be
useless and even harm the performance.
Will continue to see what else I can squeeze out of this readdir code path.
All the changes will be most likely included within a tunable flag .. may
be named as "fast path ls" or something.
We would like to close this out as "won't fix". Will put the development
efforts into RHEL5 and work with upstream to push "statlite" implementation
to alleivate this known cluster filesystem performance problem.