Description of problem: The customer is running a distributed application on GFS cluster. The processes running on different nodes write and read small portions of data from files located on GFS volumes. The general system statistics do not show any system bottleneck but the application experiencies significant latencies. Version-Release number of selected component (if applicable): * Kernel 2.4.21-32.0.1.ELhugemem * EMC powerpath modules * GFS-6.0.2.20-2 Additional info: 1. How does the customer measure application latency ? The application receives a stream of requests. It logs when the request was received and when the responce was send. The monitor process goes through the logs and sends alerts when the processing time is more then a threshold. 20050913131539083: Pkt Count: 357 Total Bytes: 52784 Avg delay: 0.347448 Min delay: 0.018990 Max delay: 2.169373 The max delay is about 100 times longer then average. Each packet is about 150 bytes. This is also a typical file record size. 2. Is bandwidth a problem too ? No. Gigabit interfaces are not busy. No packet drops. 3. A general description of the system workload would be also helpful. I would assume the workload consists of many small READs/WRITEs. However, are these read/write mostly with few files or large amount of small files ? A dozen of files in different directories on the same GFS file system. One process writes, another reads. I tried to do "ls" in some of these directories and in some of them I saw about 2 sec delay.
The customer is currently testing out Jon's latency patch. Waiting for result.
Created attachment 120130 [details] a short write-up about get_tune tunables. This is a write-up about the get_tune tunables (detailed background was in cluster-list sometime ago - searchng for either wcheng and/or dct). More specifically, we would like to tune the following: 1. demote_secs (default to 300 seconds) - let's start from 200. 2. reclaim_limit (default to 5000) - I think this is way too small with the amount of locks this cluster has. Let's try 1000000. Note these are all mount options. If proved to be useful, we can look into making them as run time tunables.
Created attachment 120141 [details] The patch to fix the negative lock count when cating lockspace proc file.
Status of this issue: This latency issue has been identified (mostly by Jon) as excessive lock caching. The machines operate nominally until the nightly backup occurs. The backup does a file-tree-walk from one node, causing it to acquire shared locks on all the files in the file systems. The locks are cached indefinitely, which ultimately requires more work by the other nodes in the cluster if they wish to perform file system operations. The experiments done internally show the locks are held by GFS inode during file open (lookup). Since inodes are cached by base kernel vfs layer, it hardly gets purged (unless there are memory pressure) that results GFS inode locks are hanging around indefiniately. This issue is further exaggerated by hugemem kernel (the customer has 16GB of memory) that has been particularly designed to reduce memory pressure. The tentative solution right now is offering a new GFS tunables (inode_purge_percent) via settune that would piggy back the logic into gfs_inode daemon that purges inode_purge_percent % of cached GFS inode each time the daemon is waked up. The default of inode_purge_percent to set to 0. The real mechanism behind this new inode purge function is actually forced purging of vfs layer's dcache entries created by the specified gfs mount point. We may require a new function exported from base kernel vfs layer. The prototype implementation is scheduled to get finished by Thursday (oct 27).
One thing of particular note here is that the inode locks are in the "unlocked" state. It is the iopen lock that is held in the shared state. This is why the customer has 2+M total locks with 1+M locks in the shared state. It is very likely that purging inodes will take care of the excessive lock count, and thus eliminate the latency. However, no-one has adequately explained _where_ the latency is coming from. Is it coming from the lock server because it has to handle such a large number of locks? I doubt it's coming from having to obtain a lock on the file; because that lock is in the unlocked state and should be instantly obtained. It could be happening if files are being removed, as this would force the 'iopen' lock to be revoked and any cache associated with it purged. Exactly what are these machines doing when they start up again in the morning and see the latency? Are they removing old files and creating new ones? Are they truncating the current files or appending to them? Does the latency go away over time? Answering these questions may lead to a solution that doesn't involve code changes.
Jon has touched on the thing I'm most curious about -- why isn't this a transient effect that goes away soon after the machines start running again? Maybe there's something elsewhere, like in gulm, that's effected by the backup process, e.g. gulm's list of locks grows huge causing all gulm requests to slow down.
1) The RPMs are in perf5.lab.boston.redhat.com:/usr/src/redhat/RPMS/i686/IPUT directory. TAM (vanhoof) will send the RPM set out (to the customer). We may have to walk them thru the rpm install. * kernel version: 2.4.21-37.EL.bz171043 * gfs version: 6.0.2.27-0.bz171043 2) There are two new kernel calls (also export them for GFS to call) - still looking for a way of NOT touching base kernel. So this may not be the final solution. 3) The new tunable is called inoded_purge, default to 0, set as: "gfs_tool settune <mount point> inoded_purge <percentage>" For example, to purge 100% of unused gfs inode from system, do: shell> gfs_tool settune /mnt/gfs inoded_purge 100 4) Didn't include Jon's rgrp.patch in this set of RPMs. Please test the above rpm out - if it proves to be favorable, will pull in Jon's patch as well. We need to do some problem isolation first.
Created attachment 120468 [details] base kernel patch (kernel_gfs_iput.patch)
Created attachment 120469 [details] gfs patch (gfs_iput.patch)
A short note on the test: shell> tar zxvf linux-2.4.30.tar.gz shell> find . -name "*.[cyh]' -print shell> watch cat /proc/gulm/lockspace Without the patch, the shared lock count would stay around 26701 forever. With the patch and inoded_purge set to 100, the shared lock will drop down to 38 around 5 minutes (due to various daemons clearning requirement). Again, nothing compared to umount which is fast, clean, and worry free.
A new set of GFS RPMs is in: perf5.lab.boston.redhat.com:/usr/src/redhat/RPMS/i686/IPUT.v3 directory. The RPMs should (still) run on top of 2.4.21-37.EL.bz171043hugemem kernel. It includes: 1. the negative lock count fix 2. the inode purge code Cidatel is currently running (tune via inoded_purge in seconds) 3. the latency reduction fix based on Jon's rgrp patch (tune via rgrp_try_threshold). We're still working on 1. Jon's new prototype patch with a goal to eliminate the need of a new base kernel. 2. Understand (and reduce) our latency limits based on various configurations. However, 10 ms would be a pretty difficult goal to achieve.
Fixing dependency on kernel bug 173280.
I just tried this out - it works fine on my cluster. Both the outputs of /proc/gulm/lockspace and "gulm_tool getstats <lock_server>:<lock_partition>" are roughly identical. Did the customer do "gfs_tool settune <mount point> inoded_purge <percentage>" on all the nodes within the cluster to trim the locks ?
It is probably a good idea to have a gfs command that can set the tunable recognized cluster-wise. Unfortunately, we don't have that yet. One way the customer can do at this moment is to download some open source tools such as parallel rsh to ensure the "settune" is set clusterwise. What we do in this (gfs) "settune inoded_purge" is to trim GFS inode count to the percentage set by this tunable *on the node* that dispatches this command. This will eventually free idle GFS locks. On the latency side, with Linux write-back design, it is really hard to ensure low latency since data is accumulated in the cache and then write out upon certain time intervals. When data is flushing (into the disk), latency jump is expected. One way to (sort of) quarantee the latency is using Direct IO (but it has restrictions such as buffer alignment) or mount with sync option (but it has nontrivial performance hits). We're looking into the issues further to see what we can do to improve it. No firm date and/or plan for this latency issue yet.
A further question about the workload.... I wonder what the mapping is between a query entering the system and the files it touches. Is it a case of query X always looks exactly at file Y or do some queries require multiple files to be looked at? If there is a 1:1 mapping, or close to it, one possibility might be to suggest that the application should try to ensure that queries which look at a certain file or set of files are always (assuming no node failure) processed by the same node. This would reduce lock (and cache) bounce between the nodes. Depending on what the application is, this might not be possible of course.
Steven, for a "query", I assume you mean "lookup" a file - that normally involves more than file itself. The directories above the file need to get involved too. I believe this particular customer knows about locality issues. This issue is about doing backup where all the file locks get accumulated into one single node.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2006-0269.html