Bug 185618

Summary:

gfs 6.1 performance issue (directory lock contention?)

Product:

[Retired] Red Hat Cluster Suite

Reporter:

Issue Tracker <tao>

Component:

gfs

Assignee:

Wendy Cheng <nobody+wcheng>

Status:

CLOSED WONTFIX

QA Contact:

GFS Bugs <gfs-bugs>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

CC:

djoo, mkearey, rkenna, rohara, tao

Target Milestone:

---

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2007-09-18 16:02:37 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
GFS module oprofile data	none
GFS customer oprofile data	none

Description Issue Tracker 2006-03-16 13:31:10 UTC

Escalated to Bugzilla from IssueTracker

Comment 5 Jeff Layton 2006-03-16 13:44:47 UTC

Going ahead and escalating this to engineering, though I'm not sure what can
really be done with such a pessimal GFS case...

They apparently have this benchmark that operates on a directory with around
10000 files. When they run this benchmark and then do an 'ls' in that directory,
they get long delays before the ls returns.

The problem seems to be slow getdents64() calls, and my guess is that the
problem is ultimately contention for the directory lock.

I've been able to reproduce something like what they are seeing on a 2 node GFS
cluster. One one machine I run the following shell loop in a directory on a GFS
filesystem:

# while true; do for ((i=1;i<=10000;i++)); do rm -f file_$i;touch file_$i; done;
done

and then on the other machine do an ls in that directory. timing it gives
numbers roughly like these:

real    0m32.235s
user    0m0.307s
sys     0m1.656s

Again, not sure what we can do to tune for this, other than telling them "don't
do that", and recommending that they architect things to reduce contention for
the directory lock (more directories and spread the files into multiple dirs).

They also complained that removing files from the directory during this test
also takes a very long time, but my guess is that that is due to the same
problem (directory lock contention), so anything we do to help the first issue,
will probably help the second.

Comment 6 Kiersten (Kerri) Anderson 2006-03-16 15:16:36 UTC

We will take a look at it, but this sounds very much like the postmark benchmark
performance issue.  Doing stats of the filesystem requires accessing every
resource group in the filesystem to collect the information. This can take a
long time.

This is a design issue with GFS that we are addressing in GFS2.  If this is for
a mail server solution, a hierarchical directory structure to avoid directory
contention and mounting with noatime are two options that have been implemented
by other customers with some success.

Comment 7 Kiersten (Kerri) Anderson 2006-05-16 21:49:03 UTC

Moving off the fix list for U4.  This one may not be addressable in RHEL4
version of gfs.

Comment 29 Wendy Cheng 2006-11-03 18:06:29 UTC

Need to open another bugzilla to address the particular customer issue as
described in comment #12. The following is a short description of *this*
bugzilla that hopefully can help people understand the issue better:

The problem here is *much* more than lock contention - it hits several 
design and architecture limitations. We have been hoping GFS2 could address 
it. For GFS1, it is better to educate people about the ramifications so 
they can find the proper workaround for their particular setup. Though I'm
trying to do something about this, the work, however, is not a short term 
project.

It is also important to point out that the problems do not exist in GFS 
alone. These are the issues that may well challenge other cluster  
filesystems with different degrees of severity and/or symptoms.

Users (and support engineers) must understand GFS is a journal filesystem. 
That is, we have to ensure filesystem consistency without fsck if all 
possible. This implies each meta-data change (transaction) is logged into
journal (file) and there are rules about the "sequence" of these changes. 
Say, for example, creating two files on the same (SMP) node at the same 
time. Since their meta-data could be written to journal inter-leaved, 
sync one file into the disk normally requires sync other files as well. 
The performance hit would increase if you have many files that have 
meta-data interleaved with each other on the journal.

At the same time, GFS is also a cluster filessytem. That is, it requires 
to guarantee cache coherency between different nodes. When an "ls" command
(which is a "read" that requires a shared lock) is issued, if previous 
lock holder did some forms of "write" (such as exclusive lock), the data 
needs to get flushed into the disk before the shared lock is granted. This 
is to ensure other node can have the consistent data view across the cluster.

Now, if you have heavy write activities with *lots of* files within one 
single directory across the cluster, when an "ls" is issued,  All the files 
are required to get "sync"ed into the disk before the directory shared lock 
can be granted. Think about the performance hits generated - what else could
heavily impact a computer system performance other than these type of 
operations ? Lock contention, disk IO, VM memory requirements, together 
with GFS and DLM own overheads.

As we have been repeated said in the past, separate the directories and cut
down the file count within the directory if all possible. Structure your
application and configurations to allow "parallel" processing. GFS allows
concurrent accesses to the filesystem across cluster - however, it doesn't
imply users can use it randomly without knowing "parallel" principles.  

I do understand people's need for "load-balancing". However, you must 
understand the overheads and internals before load-balancing your workload.

Comment 30 Wendy Cheng 2006-11-03 19:49:01 UTC

Actually I should have said "locality", instead of parallel processing.

Comment 32 Wendy Cheng 2007-02-28 20:48:15 UTC

While working on another bugzilla, I happened to catch a thread back trace
and realized GFS pre-fetches two glocks on *every* file within a directory
in its readdir system call. Look to me it was an optimization effort that
assumed whenever "ls" was issued, the file access would followed. However,
on a 1000-files directory as this bugzilla has described, this could be 
overkilled. 

Will remove that pre-featch logic to see how much we could improve this issue.

Comment 33 Wendy Cheng 2007-02-28 23:09:51 UTC

ok, I was wrong - it doesn't grab the file locks but the directory lock.
So it refreshes the directory lock for each disk read - that makes sense.
False alarm !

Comment 34 Wendy Cheng 2007-02-28 23:39:58 UTC

However, GFS does invalidates *all* the pages of this directory file each 
time and everytime when the directory lock is refreshed. So there may be
some room for improvement there ..

Comment 35 Wendy Cheng 2007-03-01 15:02:48 UTC

I was wrong about "I was wrong" in comment #33. It is prefetching the file
locks (not directory lock). Unfortunately, we can't see much improvement 
using the test case mentioned in comment #1 (since the file is created via 
"touch" that has no length). However, in general case, turning off this 
file lock prefetching should help. Will describe the issue in details 
before sending out for team code review.

Another thing I plan to turn off is the readahead call. Note that if the
directory lock is ping-pong between two nodes with write activities, each
directory shared lock will involve page invalidation. So readahead would be 
useless and even harm the performance.

Will continue to see what else I can squeeze out of this readdir code path.
All the changes will be most likely included within a tunable flag .. may 
be named as "fast path ls" or something.

Comment 38 Wendy Cheng 2007-09-18 16:02:37 UTC

We would like to close this out as "won't fix". Will put the development
efforts into RHEL5 and work with upstream to push "statlite" implementation 
to alleivate this known cluster filesystem performance problem.