678102 – dlm: increase default hash table sizes

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 678102 - dlm: increase default hash table sizes

Summary: dlm: increase default hash table sizes

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	6.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	David Teigland
QA Contact:	Boris Ranto
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	707974 719357
TreeView+	depends on / blocked

Reported:	2011-02-16 18:07 UTC by David Teigland
Modified:	2011-12-06 12:43 UTC (History)
CC List:	9 users (show)
Fixed In Version:	kernel-2.6.32-156.el6
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	707974 715603 719357 (view as bug list)
Environment:
Last Closed:	2011-12-06 12:43:50 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
patch (untested) to use vmalloc instead of kmalloc for DLM tables (2.42 KB, patch) 2011-03-18 15:18 UTC, Bryn M. Reeves	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2011:1530	0	normal	SHIPPED_LIVE	Moderate: Red Hat Enterprise Linux 6 kernel security, bug fix and enhancement update	2011-12-06 01:45:35 UTC

Description David Teigland 2011-02-16 18:07:12 UTC

Description of problem:

Increasing the hash table sizes may improve dlm/gfs performance when there are many locks being held (gfs files being used).

The default dlm hash table sizes have not been increased in many years, but the total number of locks has increased with the increased number of gfs inodes that can be cached due to the increased amount of system RAM over recent years.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 3 David Teigland 2011-03-03 15:18:24 UTC

Did someone try larger values and find that they helped?  What settings were they?

Comment 4 Steve Whitehouse 2011-03-03 15:24:21 UTC

This is scooter's response:

http://www.spinics.net/lists/cluster/msg19521.html

Comment 5 David Teigland 2011-03-03 15:36:41 UTC

Very good, it sounds like he set all three hash tables to 1024.
Currently, rsbtbl=256, lkbtbl=1024, dirtbl=512, so I'll create a
patch to increase rsbtbl and dirtbl defaults to 1024.

Comment 6 Steve Whitehouse 2011-03-03 15:42:19 UTC

I would also suggest using vmalloc to allocate the hash tables, rather than kmalloc, otherwise in rhel5 (rhel6 shouldn't have the limitation, since you can kmalloc larger amounts of memory) the default will also almost be the maximum size.

Comment 7 David Teigland 2011-03-03 15:48:03 UTC

Should the defaults be changed in RHEL5 also?

Comment 8 Alan Brown 2011-03-03 15:53:01 UTC

I'm using 1024/4096/4096 but there's no apparent change in performance.

(Perhaps I need to assign larger values - we currently have 5-7 million glocks
in use on each box in a 3 node cluster and 3+ million on the main box in a 2
node cluster setup as failover for mail)

Comment 9 Steve Whitehouse 2011-03-03 15:59:51 UTC

Alan, with that number of glocks, using larger values would be a good thing to try (when it is possible) but I'm not at all convinced that it will assist in resolving the unlink issue that we were just speaking about. We are currently setting up some tests to try and reproduce that issue as a separate line of investigation.

Dave, I don't think there is any harm in having the same default values in rhel5, at least then it will be less confusing having different values in different versions.

Comment 10 Alan Brown 2011-03-03 16:08:07 UTC

FWIW gfs2_inodes and dlm_lkb show similar (slightly smaller) numbers.

Slabtop says (trimmed)

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
6090246 6090188  99%    0.41K 676694        9   2706776K gfs2_glock
6089975 6089948  99%    0.78K 1217995        5   4871980K gfs2_inode
5101122 5070883  99%    0.22K 300066       17   1200264K dlm_lkb
3692934 3691089  99%    0.21K 205163       18    820652K dentry_cache
1441560 1340946  93%    0.09K  36039       40    144156K buffer_head
1276191 1017931  79%    0.52K 182313        7    729252K radix_tree_node
875080 809098  92%    0.09K  21877       40     87508K gfs2_bufdata


I'll increased the numbers as discussed and see if it helps.

Comment 11 Alan Brown 2011-03-03 16:24:18 UTC

Result: The values for lbktable_size and dirtbl_size both max out at 4096. 

Attempting to increase beyond that on a running system gives "cannot allocate memory" when subsequent gfs2 mounts are tried.

As previously discovered, rsbtbl_size maxes out at 1024

Comment 12 Bryn M. Reeves 2011-03-18 15:18:42 UTC

Created attachment 486266 [details]
patch (untested) to use vmalloc instead of kmalloc for DLM tables

Comment 13 RHEL Program Management 2011-04-04 02:44:28 UTC

Since RHEL 6.1 External Beta has begun, and this bug remains
unresolved, it has been rejected as it is not proposed as
exception or blocker.

Red Hat invites you to ask your support representative to
propose this request, if appropriate and relevant, in the
next release of Red Hat Enterprise Linux.

Comment 15 Alan Brown 2011-05-18 18:12:26 UTC

In addition to increasing the hash table sizes, vfs dentry and inode cache hard limit calculations in the kernel need to be addressed

They only allow a maximum of 10% of memory to be allocated for dentry hashes, which doesn't scale to large memory fileservers. I believe this is a hangover from sub-4Gb memory days.

Comment 16 Alan Brown 2011-05-26 01:13:25 UTC

Dave, would you fork a bugzilla for rhel5 please?

Comment 17 David Teigland 2011-05-26 13:17:17 UTC

posted:
http://post-office.corp.redhat.com/archives/rhkernel-list/2011-May/msg00969.html

Comment 18 RHEL Program Management 2011-05-26 13:29:45 UTC

This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux maintenance release. Product Management has 
requested further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed 
products. This request is not yet committed for inclusion in an Update release.

Comment 19 Alan Brown 2011-05-31 09:46:39 UTC

I've tried Bryn's patch against 2.6.18-262 and was able to increase hash sizes:

rsbtbl_size = 4096, lbktable_size/dirtbl_size = 16384

Trying values larger than this didn't work - clvmd wouldn't start

Things are definitely faster (a factor of at least 30) when under load 

(load = 5 multiple incremental backups - these would run at 1-3 files/sec each and are now running at 30-100 files/second each. The backups are stat()ing each file in the filesystem)

I suspect that larger hash values would help more.

Note: I'm pretty sure that using vmalloc won't pass muster for kernel devs - there are notes indicating it's strongly discouraged in kernel and modules.

How about using page allocations?

Comment 21 David Teigland 2011-06-09 14:35:39 UTC

You can check the default settings by loading the dlm module and
verifying they are 1024:

[root@bull-01 ~]# cat /sys/kernel/config/dlm/cluster/dirtbl_size 
1024
[root@bull-01 ~]# cat /sys/kernel/config/dlm/cluster/lkbtbl_size 
1024
[root@bull-01 ~]# cat /sys/kernel/config/dlm/cluster/rsbtbl_size 
1024

Comment 22 Bryn M. Reeves 2011-06-17 12:54:55 UTC

Are we thinking of taking the vmalloc patch as well? We have anecdotal reports from an EMEA gfs2 customer that increasing the hash table size beyond the kmalloc limit has a significant impact on performance for their use case.

Should I open a separate bug for this?

Comment 23 Alan Brown 2011-06-17 13:26:20 UTC

The performance hit is no surprise - it's documented in vmalloc tutorials online and vmalloc is limited to 1Gb (by default on 64bit systems) in any case.

Kernel patches using vmalloc are strongly deprecated in favour of page allocations - any final distribution patch should use the latter or it won't be accepted upstream.

My understanding was that the vmalloc patch was just a quick proof-of-concept hack to see if the idea worked in general for enlarging hash sizes (which it did). In our case the slight performance hit incurred by using vmalloc was outweighed by the overall performance boost under load.

Given this is required to enlarge the hash tables beyond a trival multiplier, I think it should remain within this BZ or we'll just end up with 2 related BZs related to the same sections of code - and the confusion which comes with such things.

Comment 24 David Teigland 2011-06-17 15:40:29 UTC

There is a one to one correspondence between patches and bzs in the RH process.  This bz has already been spent on increasing the defaults, so another bz would need to be created to adopt other changes.

For upstream, I think we should look at copying the hash table code from
fs/ocfs2/dlm/.  I suspect that may be too large a change for RHEL, so I wouldn't mind using vmalloc with the current hash tables in RHEL.

(One thing to keep in mind is that the max hash buckets the lkbtable will support is 2^16 because the bucket is kept in the top 16 bit of the lkid.)

Comment 25 Alan Brown 2011-06-17 15:49:08 UTC

Ok, fair enough. Let's fork a new BZ.

Given the number of objects I'm seeing (2-10 million lkbs and glocks), are hash tables the way to go in future though? Perhaps a tree would be better?

Comment 26 David Teigland 2011-06-17 16:00:03 UTC

Yeah, it's probably worth checking if another data structure would work better.

Comment 27 Bryn M. Reeves 2011-06-23 14:23:13 UTC

The vmalloc approach was Steve's suggestion for RHEL5 (that we'd discussed back in March) where we are more constrained in making changes than upstream. Note that this bug is for RHEL6 - I've cloned it for RHEL5 as bug 715603 and added the request for the vmalloc change.

I'm not sure where you arrive at the idea of a 1G vmalloc limit for 64-bit systems; this has never been the case on any arch that I am aware of. See mm/vmalloc.c for details (definition of VMALLOC_SIZE, line 726 in current git). All architectures with BITS_PER_LONG > 32 default to 128G of vmalloc window.

Perhaps you are thinking of 32-bit x86 which is limited to 128M due to the need to fit the 896M physical memory identity mapping and vmalloc window into the top 1G of memory that is reserved for the kernel in the standard address space layout (the 4g4g aka hugemem patches alter this restriction but have never been merged upstream and are only supported up to RHEL4).

Comment 29 Aristeu Rozanski 2011-08-11 20:09:09 UTC

Patch(es) available on kernel-2.6.32-156.el6

Comment 33 errata-xmlrpc 2011-12-06 12:43:50 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2011-1530.html

Note You need to log in before you can comment on or make changes to this bug.