I've been a little worried about the DLM's droplocks callback and its effects on the GFS2's caching of locks & inodes for a while. While doing testing recently (on what is a rather unusual set up I'll agree) I realised that it was causing a big performance penalty. My set up was as follows: one node running gfs2 and dlm. One filesystem mounted running postmark (set number 100000, set transactions 100000). On my test machine which has 4G of RAM, the dlm exceeded the lock limit and started to call droplocks. After a one liner bug fix to this code (it was trying to drop locks which were still attached to transactions) I noticed that the performance was really very poor compared with when there was no limit on locks (tested by commenting out the droplocks callback in the gfs2 code). Now I could raise the limit so that these callbacks don't happen, but I don't think that really gets to the root of the problem. If I comment out the callback, then for my single node set up, all works fine. The reason is that RAM is a finite resource and the amount of RAM determines the maximum amount of locks that its reasonable to cache. As the RAM fills up, the VM system pushes inodes out of cache and as a result the glocks get demoted, so there is already a feedback mechanism in place that works perfectly well in the single node case. Obviously in the cluster case, the same mechanism isn't going to work as the lock manager will often be mastering locks for other nodes. If we assume for a moment that the locks are roughly evenly distributed across the cluster then problems are only likely to occur when there are machines with differing amounts of RAM available (which may mean differing amounts of physical RAM or just that one node is virtually idle whilst another has its RAM more or less full with page cache pages) or where the distribution of locks across the cluster has become uneven (I think thats then a failure of the hashing). It seems to me that one solution to this problem would therefore be to migrate locks away from nodes which have "too many" locks on them by moving the locks which are held by other nodes to the nodes in question. The problem in this case is in defining "too many" and probably that needs either an input from the VM or some way for a node to determine whether its holding an excessive number of locks compared to the other nodes with which its in contact (probably a better method overall). I discussed this idea briefly with Patrick and he suggested that it would be possible to do this, but that it would take a fair amount of work to do it. The above might not be the right answer, but I'd like to kick off some discussion on this point as I'd like to have a solution which keeps a reasonable balance of lock numbers without needing any user adjustments if at all possible.
There are probably a couple of more basic things we could do to minimize the problem: First is the way the dlm implements the resource directory which is very inefficient. Back when I was seeing the memory outages during recovery, it would often occur during the initial rebuilding of the resource directory. The directory is a hash table of structs (name/nodeid pairs) that's completely independent of the rsb hash table which already contains much of the same info. On my todo list forever has been to integrate the directory into the rsb hash table (which is the way the VMS docs imply it's done.) Second is probably minor, but I've also wanted to have an amount of space in the rsb itself for the name such that most names will fit inside and not require a separate malloc for just the name. i.e. copy how the struct dentry is done. Third, during remastering, we could have a node be more agressive about becoming the new master of resources if it had more local memory available and less agressive (or refuse) to become the new master of a resource if it was low on memory. The first two things would improve the dlm's memory usage in general and also reduce the likelihood of problems during recovery. The third would be a way to avoiding a real OOM condition when the situation is ripe for it. Migrating or remastering locks during normal operation is something we may also want to do sometime; that also sounds a bit like what you're describing.
Yes, thats pretty much what I'm suggesting. Really its an algorithm to decide when locks should be remastered (or in fact, two possible algorithms as I can't decide which of these is better at the moment) which look like this in pseudo code: Algo 1. ------- Remaster locks when there is memory pressure (can we get the VM to tell us?) and select the locks to be remastered on the basis of: 1. Locks which are held by a node other than the current node 2. Locks which have been locked for a long time and don't change state very much Locks would be "pushed" from the loaded DLM towards the node where the lock was currently being held. Algo 2. -------- Assume that each time one DLM sends a message to another DLM it includes with the message a total number of locks which its currently mastering. This allows any particular DLM instance to know the approx load on its neighbours. Its then possible to create an algorithm along the lines of: Migrate locks in case number of local locks > X% above the average number of locks held on a node, the selection would be the same as Algo 1. The X in this case could be selected so as to produce a fairly even distribution over time, but without causing too much migration. The big question is whether its worth pushing locally held locks to other nodes in case the other nodes are lightly loaded and the local node is under pressure. I'm not sure this makes any sense unless we are certain that lock mastering will take a lot more memory than is used to keep track of the local state of the lock. I don't know whether thats true or not, and I dare say you have a better idea than I do. In other words we'd need to compare the memory useage on the lock holding node in the case where it mastered the lock and the case where it was mastered remotely and see how much difference there was. In addition, since the latency of the locking operations will increase with the lock being remote, it would seem that a certain amount of the potential advantage of migration away from a local node would be lost in case that activity on the lock was greater than a certain threshold. So I guess in summary I'm not that keen on trying to remaster locks which have a local holder at all. As to which of algo 1 or 2 above is better, is left as an exercise for the reader :-)
Something else I noticed recently is that the two lock_dlm threads spend a lot of time waiting for I/O. It might be an idea to try and push the drop locks stuff into the existing code which scans locks in order to not hold up other lock requests behind the drop locks request. There is also another reason that the lock_dlm threads wait for I/O but I'll save that for bz 221152.
You shouldn't be seeing drop-locks callbacks regularly, it should only be a rare condition when gfs gets that callback. Drop-locks was never intended to be a part of the normal operation of the fs, it was intended to be an "emergency" button that the dlm could press when in danger of running out of resources and crashing. If you're getting drop-locks callbacks during your tests, then you should probably disable them altogether by echoing 0 into the /sys/fs/gfs/... file. I've also mentioned before that we might disable drop-locks by default.
I'd be happy to take a patch to disable the drop locks stuff by default. With postmark it seems to make an approx 10x reduction in performance once it triggers. With drop locks off, postmark runs at a pretty similar speed as it does with lock_nolock (this is a single node dlm set up).
or just echo 0 >> /sys/fs/gfs2/locking/drop_count An alternative to setting it to 0 by default is to set it high enough that most people/workloads don't hit it. That's what I'd hoped to do recently when I increased the default, but maybe I didn't increase it enough. It doesn't matter to me which we do.
Fixed in upstream by making 0 the default value of drop_count.