Description of problem:
The customer is running a distributed application on GFS cluster. The processes
running on different nodes write and read small portions of data from files
located on GFS volumes. The general system statistics do not show any system
bottleneck but the application experiencies significant latencies.
Version-Release number of selected component (if applicable):
* Kernel 2.4.21-32.0.1.ELhugemem
* EMC powerpath modules
1. How does the customer measure application latency ?
The application receives a stream of requests. It logs when the request was
received and when the responce was send. The monitor process goes through the
logs and sends alerts when the processing time is more then a threshold.
20050913131539083: Pkt Count: 357 Total Bytes: 52784 Avg delay: 0.347448
Min delay: 0.018990 Max delay: 2.169373
The max delay is about 100 times longer then average. Each packet is about 150
bytes. This is also a typical file record size.
2. Is bandwidth a problem too ?
No. Gigabit interfaces are not busy. No packet drops.
3. A general description of the system workload would be also helpful. I would
assume the workload consists of many small READs/WRITEs. However, are these
read/write mostly with few files or large amount of small files ?
A dozen of files in different directories on the same GFS file system. One
process writes, another reads. I tried to do "ls" in some of these directories
and in some of them I saw about 2 sec delay.
The customer is currently testing out Jon's latency patch. Waiting for result.
Created attachment 120130 [details]
a short write-up about get_tune tunables.
This is a write-up about the get_tune tunables (detailed background was in
cluster-list sometime ago - searchng for either wcheng and/or dct). More
specifically, we would like to tune the following:
1. demote_secs (default to 300 seconds) - let's start from 200.
2. reclaim_limit (default to 5000) - I think this is way too small with the
amount of locks this cluster has. Let's try 1000000.
Note these are all mount options. If proved to be useful, we can look into
making them as run time tunables.
Created attachment 120141 [details]
The patch to fix the negative lock count when cating lockspace proc file.
Status of this issue:
This latency issue has been identified (mostly by Jon) as excessive lock
caching. The machines operate nominally until the nightly backup occurs. The
backup does a file-tree-walk from one node, causing it to acquire shared locks
on all the files in the file systems. The locks are cached indefinitely, which
ultimately requires more work by the other nodes in the cluster if they wish to
perform file system operations.
The experiments done internally show the locks are held by GFS inode during file
open (lookup). Since inodes are cached by base kernel vfs layer, it hardly gets
purged (unless there are memory pressure) that results GFS inode locks are
hanging around indefiniately. This issue is further exaggerated by hugemem
kernel (the customer has 16GB of memory) that has been particularly designed to
reduce memory pressure.
The tentative solution right now is offering a new GFS tunables
(inode_purge_percent) via settune that would piggy back the logic into gfs_inode
daemon that purges inode_purge_percent % of cached GFS inode each time the
daemon is waked up. The default of inode_purge_percent to set to 0.
The real mechanism behind this new inode purge function is actually forced
purging of vfs layer's dcache entries created by the specified gfs mount point.
We may require a new function exported from base kernel vfs layer.
The prototype implementation is scheduled to get finished by Thursday (oct 27).
One thing of particular note here is that the inode locks are in the "unlocked" state. It is the iopen lock
that is held in the shared state. This is why the customer has 2+M total locks with 1+M locks in the
It is very likely that purging inodes will take care of the excessive lock count, and thus eliminate the
latency. However, no-one has adequately explained _where_ the latency is coming from. Is it coming
from the lock server because it has to handle such a large number of locks? I doubt it's coming from
having to obtain a lock on the file; because that lock is in the unlocked state and should be instantly
obtained. It could be happening if files are being removed, as this would force the 'iopen' lock to be
revoked and any cache associated with it purged.
Exactly what are these machines doing when they start up again in the morning and see the latency?
Are they removing old files and creating new ones? Are they truncating the current files or appending
to them? Does the latency go away over time? Answering these questions may lead to a solution that
doesn't involve code changes.
Jon has touched on the thing I'm most curious about -- why isn't
this a transient effect that goes away soon after the machines
start running again? Maybe there's something elsewhere, like
in gulm, that's effected by the backup process, e.g. gulm's
list of locks grows huge causing all gulm requests to slow down.
1) The RPMs are in perf5.lab.boston.redhat.com:/usr/src/redhat/RPMS/i686/IPUT
directory. TAM (vanhoof) will send the RPM set out (to the customer). We may
have to walk them thru the rpm install.
* kernel version: 2.4.21-37.EL.bz171043
* gfs version: 126.96.36.199-0.bz171043
2) There are two new kernel calls (also export them for GFS to call) - still
looking for a way of NOT touching base kernel. So this may not be the final
3) The new tunable is called inoded_purge, default to 0, set as:
"gfs_tool settune <mount point> inoded_purge <percentage>"
For example, to purge 100% of unused gfs inode from system, do:
shell> gfs_tool settune /mnt/gfs inoded_purge 100
4) Didn't include Jon's rgrp.patch in this set of RPMs. Please test the above
rpm out - if it proves to be favorable, will pull in Jon's patch as well. We
need to do some problem isolation first.
Created attachment 120468 [details]
base kernel patch (kernel_gfs_iput.patch)
Created attachment 120469 [details]
gfs patch (gfs_iput.patch)
A short note on the test:
shell> tar zxvf linux-2.4.30.tar.gz
shell> find . -name "*.[cyh]' -print
shell> watch cat /proc/gulm/lockspace
Without the patch, the shared lock count would stay around 26701 forever.
With the patch and inoded_purge set to 100, the shared lock will drop down to 38
around 5 minutes (due to various daemons clearning requirement).
Again, nothing compared to umount which is fast, clean, and worry free.
A new set of GFS RPMs is in:
perf5.lab.boston.redhat.com:/usr/src/redhat/RPMS/i686/IPUT.v3 directory. The
RPMs should (still) run on top of 2.4.21-37.EL.bz171043hugemem kernel. It includes:
1. the negative lock count fix
2. the inode purge code Cidatel is currently running (tune via inoded_purge in
3. the latency reduction fix based on Jon's rgrp patch (tune via
We're still working on
1. Jon's new prototype patch with a goal to eliminate the need of a new base kernel.
2. Understand (and reduce) our latency limits based on various configurations.
However, 10 ms would be a pretty difficult goal to achieve.
Fixing dependency on kernel bug 173280.
I just tried this out - it works fine on my cluster. Both the outputs of
/proc/gulm/lockspace and "gulm_tool getstats <lock_server>:<lock_partition>" are
roughly identical. Did the customer do
"gfs_tool settune <mount point> inoded_purge <percentage>"
on all the nodes within the cluster to trim the locks ?
It is probably a good idea to have a gfs command that can set the tunable
recognized cluster-wise. Unfortunately, we don't have that yet. One way the
customer can do at this moment is to download some open source tools such as
parallel rsh to ensure the "settune" is set clusterwise. What we do in this
(gfs) "settune inoded_purge" is to trim GFS inode count to the percentage set by
this tunable *on the node* that dispatches this command. This will eventually
free idle GFS locks.
On the latency side, with Linux write-back design, it is really hard to ensure
low latency since data is accumulated in the cache and then write out upon
certain time intervals. When data is flushing (into the disk), latency jump is
expected. One way to (sort of) quarantee the latency is using Direct IO (but it
has restrictions such as buffer alignment) or mount with sync option (but it has
nontrivial performance hits). We're looking into the issues further to see what
we can do to improve it. No firm date and/or plan for this latency issue yet.
A further question about the workload.... I wonder what the mapping is between
a query entering the system and the files it touches. Is it a case of query X
always looks exactly at file Y or do some queries require multiple files to
be looked at?
If there is a 1:1 mapping, or close to it, one possibility might be to suggest
that the application should try to ensure that queries which look at a certain
file or set of files are always (assuming no node failure) processed by the same
node. This would reduce lock (and cache) bounce between the nodes. Depending on
what the application is, this might not be possible of course.
Steven, for a "query", I assume you mean "lookup" a file - that normally
involves more than file itself. The directories above the file need to get
I believe this particular customer knows about locality issues. This issue is
about doing backup where all the file locks get accumulated into one single node.
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.