Bug 226736
Summary: | The DLM's droplocks callback clobbers GFS2 performance | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Steve Whitehouse <swhiteho> |
Component: | dlm-kernel | Assignee: | David Teigland <teigland> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | rawhide | CC: | ccaulfie |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | 2.6.20-1.2944 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2007-04-11 20:20:07 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Steve Whitehouse
2007-02-01 11:23:48 UTC
There are probably a couple of more basic things we could do to minimize the problem: First is the way the dlm implements the resource directory which is very inefficient. Back when I was seeing the memory outages during recovery, it would often occur during the initial rebuilding of the resource directory. The directory is a hash table of structs (name/nodeid pairs) that's completely independent of the rsb hash table which already contains much of the same info. On my todo list forever has been to integrate the directory into the rsb hash table (which is the way the VMS docs imply it's done.) Second is probably minor, but I've also wanted to have an amount of space in the rsb itself for the name such that most names will fit inside and not require a separate malloc for just the name. i.e. copy how the struct dentry is done. Third, during remastering, we could have a node be more agressive about becoming the new master of resources if it had more local memory available and less agressive (or refuse) to become the new master of a resource if it was low on memory. The first two things would improve the dlm's memory usage in general and also reduce the likelihood of problems during recovery. The third would be a way to avoiding a real OOM condition when the situation is ripe for it. Migrating or remastering locks during normal operation is something we may also want to do sometime; that also sounds a bit like what you're describing. Yes, thats pretty much what I'm suggesting. Really its an algorithm to decide when locks should be remastered (or in fact, two possible algorithms as I can't decide which of these is better at the moment) which look like this in pseudo code: Algo 1. ------- Remaster locks when there is memory pressure (can we get the VM to tell us?) and select the locks to be remastered on the basis of: 1. Locks which are held by a node other than the current node 2. Locks which have been locked for a long time and don't change state very much Locks would be "pushed" from the loaded DLM towards the node where the lock was currently being held. Algo 2. -------- Assume that each time one DLM sends a message to another DLM it includes with the message a total number of locks which its currently mastering. This allows any particular DLM instance to know the approx load on its neighbours. Its then possible to create an algorithm along the lines of: Migrate locks in case number of local locks > X% above the average number of locks held on a node, the selection would be the same as Algo 1. The X in this case could be selected so as to produce a fairly even distribution over time, but without causing too much migration. The big question is whether its worth pushing locally held locks to other nodes in case the other nodes are lightly loaded and the local node is under pressure. I'm not sure this makes any sense unless we are certain that lock mastering will take a lot more memory than is used to keep track of the local state of the lock. I don't know whether thats true or not, and I dare say you have a better idea than I do. In other words we'd need to compare the memory useage on the lock holding node in the case where it mastered the lock and the case where it was mastered remotely and see how much difference there was. In addition, since the latency of the locking operations will increase with the lock being remote, it would seem that a certain amount of the potential advantage of migration away from a local node would be lost in case that activity on the lock was greater than a certain threshold. So I guess in summary I'm not that keen on trying to remaster locks which have a local holder at all. As to which of algo 1 or 2 above is better, is left as an exercise for the reader :-) Something else I noticed recently is that the two lock_dlm threads spend a lot of time waiting for I/O. It might be an idea to try and push the drop locks stuff into the existing code which scans locks in order to not hold up other lock requests behind the drop locks request. There is also another reason that the lock_dlm threads wait for I/O but I'll save that for bz 221152. You shouldn't be seeing drop-locks callbacks regularly, it should only be a rare condition when gfs gets that callback. Drop-locks was never intended to be a part of the normal operation of the fs, it was intended to be an "emergency" button that the dlm could press when in danger of running out of resources and crashing. If you're getting drop-locks callbacks during your tests, then you should probably disable them altogether by echoing 0 into the /sys/fs/gfs/... file. I've also mentioned before that we might disable drop-locks by default. I'd be happy to take a patch to disable the drop locks stuff by default. With postmark it seems to make an approx 10x reduction in performance once it triggers. With drop locks off, postmark runs at a pretty similar speed as it does with lock_nolock (this is a single node dlm set up). or just echo 0 >> /sys/fs/gfs2/locking/drop_count An alternative to setting it to 0 by default is to set it high enough that most people/workloads don't hit it. That's what I'd hoped to do recently when I increased the default, but maybe I didn't increase it enough. It doesn't matter to me which we do. Fixed in upstream by making 0 the default value of drop_count. |