Bug 1247924 - RFE: MariaDB max_connections must be based on controller count * core count
Summary: RFE: MariaDB max_connections must be based on controller count * core count
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rhosp-director
Version: 7.0 (Kilo)
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ---
: ---
Assignee: Damien Ciabrini
QA Contact: Udi Shkalim
URL:
Whiteboard:
: 1273557 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-07-29 09:13 UTC by Giulio Fidente
Modified: 2023-02-22 23:02 UTC (History)
18 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of: 1240824
Environment:
Last Closed: 2019-07-04 12:13:55 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Comment 2 Ofer Blaut 2015-08-04 11:59:42 UTC
We need automated solution for https://bugzilla.redhat.com/show_bug.cgi?id=1240824 , not just configure one manual parameter

Comment 3 Giulio Fidente 2015-08-06 11:21:45 UTC
Some investigation reveals the process which fork to the number of cores are the following:

neutron-metadata-agent
heat-engine
glance-registry
cinder-api
keystone-all (x2 per core)
glance-api
proxy-server
neutron-server (x2 per core)
nova-conductor

Comment 4 Giulio Fidente 2015-08-06 11:23:01 UTC
There are also some more recommendations in https://access.redhat.com/articles/1432053 from https://bugzilla.redhat.com/show_bug.cgi?id=1195292

Comment 6 Jiri Stransky 2015-09-16 11:20:48 UTC
The performance tuning document which Giulio linked suggests setting max_connections = 15360. Should we just use this as default? Details why i'd like to avoid doing this dynamically follow.

-----

I'm not convinced to go with the dynamically generated default, at least not on Puppet level. In case the controllers are deployed on not fully homogenous hardware (e.g. PoC environments), each Galera cluster member could have a different max_connections value. If this would cause some issue with Galera, it could be hard to find the cause (one doesn't expect this setting to be different on each cluster member). Furthermore, computing that value means that user doesn't know it before triggering the deployment, it wouldn't be visible from config files or manifests (grepping the puppet manifests or hiera files for the particular number would yield no results), so it wouldn't be immediately obvious where that value came from. I think dynamic defaults computed on Puppet level like this add unpredictability and obscurity into the deployment, and we'd probably better avoid them if possible.

Another way to do this dynamically would be to shift the responsibility into the phase before triggering Heat stack creation. That means the code would reside in CLI (or better yet tripleo-common), it would work with ironic introspection data and generate a parameter for the Heat stack. This would mean there's a single value for the whole cluster always, and that the user can review the value prior to kicking off the deployment. However, that's probably quite a major RFE and the cost/benefit ratio would better be evaluated here (i'm not convinced it's worth it at this point).

Comment 7 Jiri Stransky 2015-09-16 11:24:24 UTC
(In reply to Jiri Stransky from comment #6)
> it wouldn't be visible from config files

Just to clarify -- i meant it wouldn't be visible from *Puppet* config files (the Hiera files under /etc/puppet/hieradata).

Comment 8 Michael Bayer 2015-09-16 14:49:23 UTC
> The performance tuning document which Giulio linked suggests setting max_connections = 15360. Should we just use this as default?

That's not a suggestion, that's an *example*.   This document has no guidance at all on how to set this number, and this is a question I'm working on right now.    

Suffice to say 15360 is not a number that is achievable on all hardware because max_connections directly links to number of threads that is feasible to be run per-process on the target server, e.g. the number that is in /proc/sys/kernel/threads-max, which is determined by the kernel at boot time based on available memory pages.  However, the number in threads-max is still twice the size of the default ulimit on the server, so really the max connections without changing ulimits is that of "ulimit -u".  In my own testing, I've experimented with not only upping the ulimits but also upping the threads-max kernel setting, and so far it seems that while you can make as many threads as you want by manipulating these numbers, those connections/threads are unusable for doing any work even at very low levels of activity, so the "ulimit -u" default number is already a pretty good baseline to work from.    There is also the question of using mariadb's thread-pooling feature which again makes lots more *idle* connections possible, but doesn't increase the amount of concurrent SQL work that is feasible. 

tldr; we don't have a non-arbitrary number for max_connections right now, nor does anyone else.

Comment 9 Mike Burns 2015-10-20 17:12:10 UTC
*** Bug 1273557 has been marked as a duplicate of this bug. ***

Comment 10 Sadique Puthen 2015-10-20 17:30:38 UTC
I have worked with Micheal Bayer to develop an article that explains the formula to get the number. It's here https://access.redhat.com/solutions/1990433

Can we apply this logic and decide this dynamically for each deployment?

Comment 11 Jiri Stransky 2015-10-21 09:10:30 UTC
As i wrote in comments 6/7, my opinion is that doing this dynamically with the current capabilities (on Heat/Puppet level) would open a possibility for uncertainty and new type of deployment bugs. Setting this value automatically in the phase of GUI/CLI, before a deployment is triggered, could be better, but is probably a non-trivial RFE and should be weighed in priority against other planned features.

However, if a short term improvement is needed, we can raise the current default of 4096 to a higher value, if there's some recommendation for a better value. (I don't know the numbers, but my thinking here is that if we see that we need to raise the value for, say, half of the deployments, we could raise the default to work well for a higher percentage of deployments.)

Comment 12 Mike Burns 2016-04-07 20:47:27 UTC
This bug did not make the OSP 8.0 release.  It is being deferred to OSP 10.

Comment 14 Michael Bayer 2016-11-17 18:56:30 UTC
looking at http://lists.openstack.org/pipermail/openstack-dev/2016-September/104819.html

this subject area is changing.

Per the above thread, processorcount is no longer the only determining factor in worker configuration, and i think it means that the puppet installer is going to install a non-blank value in the service .conf files that caps process count at 8. Additionally, most openstack services are moving off the eventlet server and onto Apache, which has a much more mature process model.

at the very least, max_connections should be driving off of these new values if they are actually being pushed into the .conf files and/or apache service files.

Comment 16 Red Hat Bugzilla Rules Engine 2017-06-04 02:23:21 UTC
This bugzilla has been removed from the release and needs to be reviewed and Triaged for another Target Release.

Comment 17 Chris Jones 2019-07-04 12:13:55 UTC
AFAIK, OpenStack service workers are no longer scaled linearly with CPU core counts, given the dramatic increase in server core density. I'd be interested to see if there are improvements we can make in keeping max_connections in line with whatever determination Director is making on how many service workers to run, but I believe tying it to core count no longer makes any sense.

Comment 18 Michael Bayer 2019-07-04 13:51:53 UTC
it would have to aggregate the settings that are being made within all the service-level puppet-xyz packages, e.g. connection pool settings * number of worker processes * number of controllers for each one, then add all that up.   it might be best if each puppet-xyz can report this number individually in case there are idiosyncratic behaviors specific to a certain sub-package.     whatever the number is though it should probably be doubled and floored at 4096 in any case.


Note You need to log in before you can comment on or make changes to this bug.