Bug 226736

Summary: The DLM's droplocks callback clobbers GFS2 performance
Product: [Fedora] Fedora Reporter: Steve Whitehouse <swhiteho>
Component: dlm-kernelAssignee: David Teigland <teigland>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: medium Docs Contact:
Priority: medium    
Version: rawhideCC: ccaulfie
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: 2.6.20-1.2944 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-04-11 20:20:07 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Steve Whitehouse 2007-02-01 11:23:48 UTC
I've been a little worried about the DLM's droplocks callback and its effects on
the GFS2's caching of locks & inodes for a while. While doing testing recently
(on what is a rather unusual set up I'll agree) I realised that it was causing a
big performance penalty.

My set up was as follows: one node running gfs2 and dlm. One filesystem mounted
running postmark (set number 100000, set transactions 100000). On my test
machine which has 4G of RAM, the dlm exceeded the lock limit and started to call
droplocks. After a one liner bug fix to this code (it was trying to drop locks
which were still attached to transactions) I noticed that the performance was
really very poor compared with when there was no limit on locks (tested by
commenting out the droplocks callback in the gfs2 code).

Now I could raise the limit so that these callbacks don't happen, but I don't
think that really gets to the root of the problem.

If I comment out the callback, then for my single node set up, all works fine.
The reason is that RAM is a finite resource and the amount of RAM determines the
maximum amount of locks that its reasonable to cache. As the RAM fills up, the
VM system pushes inodes out of cache and as a result the glocks get demoted, so
there is already a feedback mechanism in place that works perfectly well in the
single node case.

Obviously in the cluster case, the same mechanism isn't going to work as the
lock manager will often be mastering locks for other nodes. If we assume for a
moment that the locks are roughly evenly distributed across the cluster then
problems are only likely to occur when there are machines with differing amounts
of RAM available (which may mean differing amounts of physical RAM or just that
one node is virtually idle whilst another has its RAM more or less full with
page cache pages) or where the distribution of locks across the cluster has
become uneven (I think thats then a failure of the hashing).

It seems to me that one solution to this problem would therefore be to migrate
locks away from nodes which have "too many" locks on them by moving the locks
which are held by other nodes to the nodes in question. The problem in this case
is in defining "too many" and probably that needs either an input from the VM or
some way for a node to determine whether its holding an excessive number of
locks compared to the other nodes with which its in contact (probably a better
method overall).

I discussed this idea briefly with Patrick and he suggested that it would be
possible to do this, but that it would take a fair amount of work to do it.

The above might not be the right answer, but I'd like to kick off some
discussion on this point as I'd like to have a solution which keeps a reasonable
balance of lock numbers without needing any user adjustments if at all possible.

Comment 1 David Teigland 2007-02-01 15:15:05 UTC
There are probably a couple of more basic things we could do to
minimize the problem:

First is the way the dlm implements the resource directory which is
very inefficient.  Back when I was seeing the memory outages during
recovery, it would often occur during the initial rebuilding of the
resource directory.  The directory is a hash table of structs
(name/nodeid pairs) that's completely independent of the rsb hash
table which already contains much of the same info.  On my todo list
forever has been to integrate the directory into the rsb hash table
(which is the way the VMS docs imply it's done.)

Second is probably minor, but I've also wanted to have an amount
of space in the rsb itself for the name such that most names will
fit inside and not require a separate malloc for just the name.
i.e. copy how the struct dentry is done.

Third, during remastering, we could have a node be more agressive about
becoming the new master of resources if it had more local memory
available and less agressive (or refuse) to become the new master of
a resource if it was low on memory.

The first two things would improve the dlm's memory usage in general
and also reduce the likelihood of problems during recovery.  The third
would be a way to avoiding a real OOM condition when the situation is
ripe for it.

Migrating or remastering locks during normal operation is something
we may also want to do sometime; that also sounds a bit like what
you're describing.


Comment 2 Steve Whitehouse 2007-02-01 16:44:04 UTC
Yes, thats pretty much what I'm suggesting. Really its an algorithm to decide
when locks should be remastered (or in fact, two possible algorithms as I can't
decide which of these is better at the moment) which look like this in pseudo code:

Algo 1.
-------

Remaster locks when there is memory pressure (can we get the VM to tell us?) and
select the locks to be remastered on the basis of:
 1. Locks which are held by a node other than the current node
 2. Locks which have been locked for a long time and don't change state very much
Locks would be "pushed" from the loaded DLM towards the node where the lock was
currently being held.

Algo 2.
--------

Assume that each time one DLM sends a message to another DLM it includes with
the message a total number of locks which its currently mastering. This allows
any particular DLM instance to know the approx load on its neighbours. Its then
possible to create an algorithm along the lines of:

Migrate locks in case number of local locks > X% above the average number of
locks held on a node, the selection would be the same as Algo 1. The X in this
case could be selected so as to produce a fairly even distribution over time,
but without causing too much migration.



The big question is whether its worth pushing locally held locks to other nodes
in case the other nodes are lightly loaded and the local node is under pressure.
I'm not sure this makes any sense unless we are certain that lock mastering will
take a lot more memory than is used to keep track of the local state of the
lock. I don't know whether thats true or not, and I dare say you have a better
idea than I do. In other words we'd need to compare the memory useage on the
lock holding node in the case where it mastered the lock and the case where it
was mastered remotely and see how much difference there was.

In addition, since the latency of the locking operations will increase with the
lock being remote, it would seem that a certain amount of the potential
advantage of migration away from a local node would be lost in case that
activity on the lock was greater than a certain threshold.

So I guess in summary I'm not that keen on trying to remaster locks which have a
local holder at all. As to which of algo 1 or 2 above is better, is left as an
exercise for the reader :-)




Comment 3 Steve Whitehouse 2007-02-28 14:37:23 UTC
Something else I noticed recently is that the two lock_dlm threads spend a lot
of time waiting for I/O. It might be an idea to try and push the drop locks
stuff into the existing code which scans locks in order to not hold up other
lock requests behind the drop locks request.

There is also another reason that the lock_dlm threads wait for I/O but I'll
save that for bz 221152.

Comment 4 David Teigland 2007-02-28 14:49:47 UTC
You shouldn't be seeing drop-locks callbacks regularly, it should only
be a rare condition when gfs gets that callback.  Drop-locks was never
intended to be a part of the normal operation of the fs, it was intended
to be an "emergency" button that the dlm could press when in danger of
running out of resources and crashing.  If you're getting drop-locks
callbacks during your tests, then you should probably disable them
altogether by echoing 0 into the /sys/fs/gfs/... file.  I've also
mentioned before that we might disable drop-locks by default.


Comment 5 Steve Whitehouse 2007-02-28 15:15:59 UTC
I'd be happy to take a patch to disable the drop locks stuff by default. With
postmark it seems to make an approx 10x reduction in performance once it
triggers. With drop locks off, postmark runs at a pretty similar speed as it
does with lock_nolock (this is a single node dlm set up).



Comment 6 David Teigland 2007-02-28 15:25:49 UTC
or just echo 0 >> /sys/fs/gfs2/locking/drop_count

An alternative to setting it to 0 by default is to set it high
enough that most people/workloads don't hit it.  That's what I'd
hoped to do recently when I increased the default, but maybe I
didn't increase it enough.  It doesn't matter to me which we do.


Comment 7 Steve Whitehouse 2007-04-11 20:20:07 UTC
Fixed in upstream by making 0 the default value of drop_count.