Bug 171043 - Critical custom application experiences internal latencies maybe ralated to GFS
Summary: Critical custom application experiences internal latencies maybe ralated to GFS
Alias: None
Product: Red Hat Cluster Suite
Classification: Retired
Component: gfs
Version: 3
Hardware: All
OS: Linux
Target Milestone: ---
Assignee: Wendy Cheng
QA Contact: GFS Bugs
Depends On: 173280
TreeView+ depends on / blocked
Reported: 2005-10-17 17:08 UTC by Wendy Cheng
Modified: 2010-01-12 03:08 UTC (History)
7 users (show)

Clone Of:
Last Closed: 2006-03-27 18:12:02 UTC

Attachments (Terms of Use)
a short write-up about get_tune tunables. (5.03 KB, text/plain)
2005-10-18 18:08 UTC, Wendy Cheng
no flags Details
The patch to fix the negative lock count when cating lockspace proc file. (745 bytes, patch)
2005-10-18 20:15 UTC, Wendy Cheng
no flags Details | Diff
base kernel patch (kernel_gfs_iput.patch) (1.99 KB, patch)
2005-10-27 15:39 UTC, Wendy Cheng
no flags Details | Diff
gfs patch (gfs_iput.patch) (2.98 KB, patch)
2005-10-27 15:40 UTC, Wendy Cheng
no flags Details | Diff

External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2006:0269 normal SHIPPED_LIVE GFS bug fix update 2006-03-27 05:00:00 UTC

Description Wendy Cheng 2005-10-17 17:08:39 UTC
Description of problem:
The customer is running a distributed application on GFS cluster. The processes
running on different nodes write and read small portions of data from files
located on GFS volumes. The general system statistics do not show any system
bottleneck but the application experiencies significant latencies. 

Version-Release number of selected component (if applicable):
* Kernel 2.4.21-32.0.1.ELhugemem 
* EMC powerpath modules
* GFS-

Additional info:
1. How does the customer measure application latency ? 

The application receives a stream of requests. It logs when the request was
received and when the responce was send. The monitor process goes through the
logs and sends alerts when the processing time is more then a threshold.

20050913131539083: Pkt Count:    357 Total Bytes:     52784 Avg delay: 0.347448
Min delay: 0.018990 Max delay: 2.169373

The max delay is about 100 times longer then average. Each packet is about 150
bytes. This is also a typical file record size.

2. Is bandwidth a problem too ?

No. Gigabit interfaces are not busy. No packet drops.

3. A general description of the system workload would be also helpful. I would
assume the workload consists of many small READs/WRITEs. However, are these
read/write mostly with few files or large amount of small files ?

A dozen of files in different directories on the same GFS file system. One 
process writes, another reads. I tried to do "ls" in some of these directories
and in some of them I saw about 2 sec delay.

Comment 1 Wendy Cheng 2005-10-17 17:12:39 UTC
The customer is currently testing out Jon's latency patch. Waiting for result. 

Comment 7 Wendy Cheng 2005-10-18 18:08:14 UTC
Created attachment 120130 [details]
a short write-up about get_tune tunables. 

This is a write-up about the get_tune tunables (detailed background was in
cluster-list sometime ago - searchng for either wcheng and/or dct). More
specifically, we would like to tune the following:

1. demote_secs (default to 300 seconds) - let's start from 200. 
2. reclaim_limit (default to 5000) - I think this is way too small with the
amount of locks this cluster has. Let's try 1000000.

Note these are all mount options. If proved to be useful, we can look into
making them as run time tunables.

Comment 8 Wendy Cheng 2005-10-18 20:15:53 UTC
Created attachment 120141 [details]
The patch to fix the negative lock count when cating lockspace proc file.

Comment 9 Wendy Cheng 2005-10-25 15:02:56 UTC
Status of this issue:

This latency issue has been identified (mostly by Jon) as excessive lock
caching. The machines operate nominally until the nightly backup occurs. The
backup does a file-tree-walk from one node, causing it to acquire shared locks
on all the files in the file systems. The locks are cached indefinitely, which
ultimately requires more work by the other nodes in the cluster if they wish to
perform file system operations. 

The experiments done internally show the locks are held by GFS inode during file
open (lookup). Since inodes are cached by base kernel vfs layer, it hardly gets
purged (unless there are memory pressure) that results GFS inode locks are
hanging around indefiniately. This issue is further exaggerated by hugemem
kernel (the customer has 16GB of memory) that has been particularly designed to
reduce memory pressure. 

The tentative solution right now is offering a new GFS tunables
(inode_purge_percent) via settune that would piggy back the logic into gfs_inode
daemon that purges inode_purge_percent % of cached GFS inode each time the
daemon is waked up. The default of inode_purge_percent to set to 0.

The real mechanism behind this new inode purge function is actually forced
purging of vfs layer's dcache entries created by the specified gfs mount point.
We may require a new function exported from base kernel vfs layer. 

The prototype implementation is scheduled to get finished by Thursday (oct 27).

Comment 10 Jonathan Earl Brassow 2005-10-25 15:37:46 UTC
One thing of particular note here is that the inode locks are in the "unlocked" state.  It is the iopen lock 
that is held in the shared state.  This is why the customer has 2+M total locks with 1+M locks in the 
shared state.

It is very likely that purging inodes will take care of the excessive lock count, and thus eliminate the 
latency.  However, no-one has adequately explained _where_ the latency is coming from.  Is it coming 
from the lock server because it has to handle such a large number of locks?  I doubt it's coming from 
having to obtain a lock on the file; because that lock is in the unlocked state and should be instantly 
obtained.  It could be happening if files are being removed, as this would force the 'iopen' lock to be 
revoked and any cache associated with it purged.

Exactly what are these machines doing when they start up again in the morning and see the latency?  
Are they removing old files and creating new ones?  Are they truncating the current files or appending 
to them?  Does the latency go away over time?  Answering these questions may lead to a solution that 
doesn't involve code changes.

Comment 11 David Teigland 2005-10-25 15:52:23 UTC
Jon has touched on the thing I'm most curious about -- why isn't
this a transient effect that goes away soon after the machines
start running again?  Maybe there's something elsewhere, like
in gulm, that's effected by the backup process, e.g. gulm's
list of locks grows huge causing all gulm requests to slow down.

Comment 12 Wendy Cheng 2005-10-27 15:31:49 UTC
1) The RPMs are in perf5.lab.boston.redhat.com:/usr/src/redhat/RPMS/i686/IPUT
directory. TAM (vanhoof) will send the RPM set out (to the customer). We may
have to walk them thru the rpm install.

 * kernel version: 2.4.21-37.EL.bz171043 
 * gfs version:

2) There are two new kernel calls (also export them for GFS to call) - still
looking for a way of NOT touching base kernel. So this may not be the final

3) The new tunable is called inoded_purge, default to 0, set as:

 "gfs_tool settune <mount point> inoded_purge <percentage>"

For example, to purge 100% of unused gfs inode from system, do:

 shell> gfs_tool settune /mnt/gfs inoded_purge 100   

4) Didn't include Jon's rgrp.patch in this set of RPMs. Please test the above
rpm out - if it proves to be favorable, will pull in Jon's patch as well. We
need to do some problem isolation first.

Comment 13 Wendy Cheng 2005-10-27 15:39:18 UTC
Created attachment 120468 [details]
base kernel patch (kernel_gfs_iput.patch)

Comment 14 Wendy Cheng 2005-10-27 15:40:27 UTC
Created attachment 120469 [details]
gfs patch (gfs_iput.patch)

Comment 15 Wendy Cheng 2005-10-27 15:48:03 UTC
A short note on the test:

shell> tar zxvf linux-2.4.30.tar.gz
shell> find . -name "*.[cyh]' -print
shell> watch cat /proc/gulm/lockspace

Without the patch, the shared lock count would stay around 26701 forever.
With the patch and inoded_purge set to 100, the shared lock will drop down to 38
around 5 minutes (due to various daemons clearning requirement).

Again, nothing compared to umount which is fast, clean, and worry free.

Comment 16 Wendy Cheng 2005-11-07 15:49:37 UTC
A new set of GFS RPMs is in:
perf5.lab.boston.redhat.com:/usr/src/redhat/RPMS/i686/IPUT.v3 directory. The
RPMs should (still) run on top of 2.4.21-37.EL.bz171043hugemem kernel. It includes: 

1. the negative lock count fix
2. the inode purge code Cidatel is currently running (tune via inoded_purge in
3. the latency reduction fix based on Jon's rgrp patch (tune via

We're still working on 
1. Jon's new prototype patch with a goal to eliminate the need of a new base kernel.
2. Understand (and reduce) our latency limits based on various configurations.

However, 10 ms would be a pretty difficult goal to achieve. 

Comment 22 Ernie Petrides 2005-11-15 22:07:59 UTC
Fixing dependency on kernel bug 173280.

Comment 23 Wendy Cheng 2005-11-16 16:49:58 UTC
I just tried this out - it works fine on my cluster. Both the outputs of
/proc/gulm/lockspace and "gulm_tool getstats <lock_server>:<lock_partition>" are
roughly identical. Did the customer do 

  "gfs_tool settune <mount point> inoded_purge <percentage>" 

on all the nodes within the cluster to trim the locks ? 

Comment 24 Wendy Cheng 2005-11-16 18:51:38 UTC
It is probably a good idea to have a gfs command that can set the tunable
recognized cluster-wise. Unfortunately, we don't have that yet. One way the
customer can do at this moment is to download some open source tools such as
parallel rsh to ensure the "settune" is set clusterwise. What we do in this
(gfs) "settune inoded_purge" is to trim GFS inode count to the percentage set by
this tunable *on the node* that dispatches this command. This will eventually
free idle GFS locks.  

On the latency side, with Linux write-back design, it is really hard to ensure
low latency since data is accumulated in the cache and then write out upon
certain time intervals. When data is flushing (into the disk), latency jump is
expected. One way to (sort of) quarantee the latency is using Direct IO (but it
has restrictions such as buffer alignment) or mount with sync option (but it has
nontrivial performance hits). We're looking into the issues further to see what
we can do to improve it. No firm date and/or plan for this latency issue yet.

Comment 25 Steve Whitehouse 2005-12-01 17:29:36 UTC
A further question about the workload.... I wonder what the mapping is between
a query entering the system and the files it touches. Is it a case of query X
always looks exactly at file Y or do some queries require multiple files to
be looked at?

If there is a 1:1 mapping, or close to it, one possibility might be to suggest
that the application should try to ensure that queries which look at a certain
file or set of files are always (assuming no node failure) processed by the same
node. This would reduce lock (and cache) bounce between the nodes. Depending on
what the application is, this might not be possible of course.

Comment 26 Wendy Cheng 2005-12-02 20:00:41 UTC
Steven, for a "query", I assume you mean "lookup" a file - that normally
involves more than file itself. The directories above the file need to get
involved too.

I believe this particular customer knows about locality issues. This issue is
about doing backup where all the file locks get accumulated into one single node. 

Comment 31 Red Hat Bugzilla 2006-03-27 18:12:02 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.


Note You need to log in before you can comment on or make changes to this bug.