Bug 678102
| Summary: | dlm: increase default hash table sizes | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | David Teigland <teigland> | ||||
| Component: | kernel | Assignee: | David Teigland <teigland> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Boris Ranto <branto> | ||||
| Severity: | medium | Docs Contact: | |||||
| Priority: | medium | ||||||
| Version: | 6.2 | CC: | ajb2, bmr, branto, grimme, hlawatschek, kzhang, michael.hagmann, slords, swhiteho | ||||
| Target Milestone: | rc | ||||||
| Target Release: | --- | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | kernel-2.6.32-156.el6 | Doc Type: | Bug Fix | ||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | |||||||
| : | 707974 715603 719357 (view as bug list) | Environment: | |||||
| Last Closed: | 2011-12-06 12:43:50 UTC | Type: | --- | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 707974, 719357 | ||||||
| Attachments: |
|
||||||
|
Description
David Teigland
2011-02-16 18:07:12 UTC
Did someone try larger values and find that they helped? What settings were they? This is scooter's response: http://www.spinics.net/lists/cluster/msg19521.html Very good, it sounds like he set all three hash tables to 1024. Currently, rsbtbl=256, lkbtbl=1024, dirtbl=512, so I'll create a patch to increase rsbtbl and dirtbl defaults to 1024. I would also suggest using vmalloc to allocate the hash tables, rather than kmalloc, otherwise in rhel5 (rhel6 shouldn't have the limitation, since you can kmalloc larger amounts of memory) the default will also almost be the maximum size. Should the defaults be changed in RHEL5 also? I'm using 1024/4096/4096 but there's no apparent change in performance. (Perhaps I need to assign larger values - we currently have 5-7 million glocks in use on each box in a 3 node cluster and 3+ million on the main box in a 2 node cluster setup as failover for mail) Alan, with that number of glocks, using larger values would be a good thing to try (when it is possible) but I'm not at all convinced that it will assist in resolving the unlink issue that we were just speaking about. We are currently setting up some tests to try and reproduce that issue as a separate line of investigation. Dave, I don't think there is any harm in having the same default values in rhel5, at least then it will be less confusing having different values in different versions. FWIW gfs2_inodes and dlm_lkb show similar (slightly smaller) numbers. Slabtop says (trimmed) OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME 6090246 6090188 99% 0.41K 676694 9 2706776K gfs2_glock 6089975 6089948 99% 0.78K 1217995 5 4871980K gfs2_inode 5101122 5070883 99% 0.22K 300066 17 1200264K dlm_lkb 3692934 3691089 99% 0.21K 205163 18 820652K dentry_cache 1441560 1340946 93% 0.09K 36039 40 144156K buffer_head 1276191 1017931 79% 0.52K 182313 7 729252K radix_tree_node 875080 809098 92% 0.09K 21877 40 87508K gfs2_bufdata I'll increased the numbers as discussed and see if it helps. Result: The values for lbktable_size and dirtbl_size both max out at 4096. Attempting to increase beyond that on a running system gives "cannot allocate memory" when subsequent gfs2 mounts are tried. As previously discovered, rsbtbl_size maxes out at 1024 Created attachment 486266 [details]
patch (untested) to use vmalloc instead of kmalloc for DLM tables
Since RHEL 6.1 External Beta has begun, and this bug remains unresolved, it has been rejected as it is not proposed as exception or blocker. Red Hat invites you to ask your support representative to propose this request, if appropriate and relevant, in the next release of Red Hat Enterprise Linux. In addition to increasing the hash table sizes, vfs dentry and inode cache hard limit calculations in the kernel need to be addressed They only allow a maximum of 10% of memory to be allocated for dentry hashes, which doesn't scale to large memory fileservers. I believe this is a hangover from sub-4Gb memory days. Dave, would you fork a bugzilla for rhel5 please? This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. I've tried Bryn's patch against 2.6.18-262 and was able to increase hash sizes: rsbtbl_size = 4096, lbktable_size/dirtbl_size = 16384 Trying values larger than this didn't work - clvmd wouldn't start Things are definitely faster (a factor of at least 30) when under load (load = 5 multiple incremental backups - these would run at 1-3 files/sec each and are now running at 30-100 files/second each. The backups are stat()ing each file in the filesystem) I suspect that larger hash values would help more. Note: I'm pretty sure that using vmalloc won't pass muster for kernel devs - there are notes indicating it's strongly discouraged in kernel and modules. How about using page allocations? You can check the default settings by loading the dlm module and verifying they are 1024: [root@bull-01 ~]# cat /sys/kernel/config/dlm/cluster/dirtbl_size 1024 [root@bull-01 ~]# cat /sys/kernel/config/dlm/cluster/lkbtbl_size 1024 [root@bull-01 ~]# cat /sys/kernel/config/dlm/cluster/rsbtbl_size 1024 Are we thinking of taking the vmalloc patch as well? We have anecdotal reports from an EMEA gfs2 customer that increasing the hash table size beyond the kmalloc limit has a significant impact on performance for their use case. Should I open a separate bug for this? The performance hit is no surprise - it's documented in vmalloc tutorials online and vmalloc is limited to 1Gb (by default on 64bit systems) in any case. Kernel patches using vmalloc are strongly deprecated in favour of page allocations - any final distribution patch should use the latter or it won't be accepted upstream. My understanding was that the vmalloc patch was just a quick proof-of-concept hack to see if the idea worked in general for enlarging hash sizes (which it did). In our case the slight performance hit incurred by using vmalloc was outweighed by the overall performance boost under load. Given this is required to enlarge the hash tables beyond a trival multiplier, I think it should remain within this BZ or we'll just end up with 2 related BZs related to the same sections of code - and the confusion which comes with such things. There is a one to one correspondence between patches and bzs in the RH process. This bz has already been spent on increasing the defaults, so another bz would need to be created to adopt other changes. For upstream, I think we should look at copying the hash table code from fs/ocfs2/dlm/. I suspect that may be too large a change for RHEL, so I wouldn't mind using vmalloc with the current hash tables in RHEL. (One thing to keep in mind is that the max hash buckets the lkbtable will support is 2^16 because the bucket is kept in the top 16 bit of the lkid.) Ok, fair enough. Let's fork a new BZ. Given the number of objects I'm seeing (2-10 million lkbs and glocks), are hash tables the way to go in future though? Perhaps a tree would be better? Yeah, it's probably worth checking if another data structure would work better. The vmalloc approach was Steve's suggestion for RHEL5 (that we'd discussed back in March) where we are more constrained in making changes than upstream. Note that this bug is for RHEL6 - I've cloned it for RHEL5 as bug 715603 and added the request for the vmalloc change. I'm not sure where you arrive at the idea of a 1G vmalloc limit for 64-bit systems; this has never been the case on any arch that I am aware of. See mm/vmalloc.c for details (definition of VMALLOC_SIZE, line 726 in current git). All architectures with BITS_PER_LONG > 32 default to 128G of vmalloc window. Perhaps you are thinking of 32-bit x86 which is limited to 128M due to the need to fit the 896M physical memory identity mapping and vmalloc window into the top 1G of memory that is reserved for the kernel in the standard address space layout (the 4g4g aka hugemem patches alter this restriction but have never been merged upstream and are only supported up to RHEL4). Patch(es) available on kernel-2.6.32-156.el6 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2011-1530.html |