Bug 1030965 - Number of registered contexts negatively affects mod_cluster performance
Number of registered contexts negatively affects mod_cluster performance
Status: CLOSED CURRENTRELEASE
Product: JBoss Enterprise Application Platform 6
Classification: JBoss
Component: mod_cluster (Show other bugs)
6.2.0
Unspecified Unspecified
high Severity high
: ER3
: EAP 6.3.0
Assigned To: Jean-frederic Clere
Michal Karm Babacek
Russell Dickenson
:
Depends On:
Blocks: 1084882 1164327 1079156
  Show dependency treegraph
 
Reported: 2013-11-15 07:26 EST by Michal Karm Babacek
Modified: 2014-11-14 11:48 EST (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
A performance issue has been identified on the Apache HTTP Server with mod_cluster configured as a load balancer. httpd shared memory operations on the `workers->nodes` table negatively affects the performance of the load balancer. As a result, performance of the httpd load balancer decreases as the number of registered contexts increases. To workaround this issue, attempt to lower the number of registered contexts. To fix this bug, the httpd has been modified to utilize local memory rather than shared memory.
Story Points: ---
Clone Of:
: 1079156 1164327 (view as bug list)
Environment:
Last Closed: 2014-06-28 11:44:30 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
JBoss Issue Tracker MODCLUSTER-372 Major Closed Number of registered contexts negatively affects mod_cluster performance 2017-06-08 09:58 EDT

  None (edit)
Description Michal Karm Babacek 2013-11-15 07:26:05 EST
Follow the https://issues.jboss.org/browse/MODCLUSTER-372
Comment 1 JBoss JIRA Server 2013-11-15 12:11:10 EST
Michal Babacek <mbabacek@redhat.com> made a comment on jira MODCLUSTER-372

I can see I forgot to add {{ServerLimit    40}}, but it would only allow for MaxClients to be higher than 1920, and actually shouldn't have any bearing on the case of mod_cluster and registered contexts.
Comment 2 JBoss JIRA Server 2013-11-16 04:50:01 EST
Jean-Frederic Clere <jfclere@jboss.org> made a comment on jira MODCLUSTER-372

Looking in mod_proxy_cluster.c find_node_context_host() I have spotted something very bad:
+++
        /* keep only the contexts corresponding to our balancer */
        if (balancer != NULL) {
            nodeinfo_t *node;
            if (node_storage->read_node(context->node, &node) != APR_SUCCESS)
                continue;
+++
We are reading the shared memory for each context that is _very_ bad.
Comment 3 JBoss JIRA Server 2013-11-18 04:50:37 EST
Michal Babacek <mbabacek@redhat.com> made a comment on jira MODCLUSTER-372

I've identified these bottle necks:
Three the most expensive functions and their callers:


 # Function {{ap_slotmem_mem}} from mod_slotmem/sharedmem_util.c called by:
  * {{get_context}} from mod_manager/context.c
  * {{get_node}} from mod_manager/node.c
  * {{loc_read_node}} from mod_manager/mod_manager.c
  * {{find_node_context_host}} from mod_proxy_cluster/mod_proxy_cluster.c
  * {{loc_read_context}} mod_manager/context.c
  * {{read_context_table}} mod_proxy_cluster/mod_proxy_cluster.c
  * {{manager_info}} from mod_manager/mod_manager.c

 # Function {{ap_slotmem_do}} from mod_slotmem/sharedmem_util.c called mostly by
  * httpd_request and httpd_core functions from httpd sources.

 # Function {{find_node_context_host}} from mod_proxy_cluster/mod_proxy_cluster.c called by
  * httpd sources, mostly on request_process and request_connection.

So, IMHO, {{find_node_context_host}} is indeed the trouble :)

Attaching the profiling valgrind logs [^callgrind.zip], created as: 
{noformat}
valgrind --tool=callgrind --dump-instr=yes --simulate-cache=yes --collect-jumps=yes --compress-strings=no --compress-pos=no --collect-systime=yes ./httpd -f /tmp/hudson/httpd/conf/httpd.conf -E /tmp/hudson/httpd/logs/httpd.log
{noformat}
with this debug setting:
{code}
<IfModule worker.c>
ThreadLimit         50
StartServers        1
ServerLimit         1
MinSpareThreads     50
MaxSpareThreads     50
MaxClients          50
ThreadsPerChild     50
MaxRequestsPerChild 0
</IfModule>
{code}
and 4 worker nodes, 65 contexts each, and several dozens of concurrent client sessions.
Comment 4 JBoss JIRA Server 2013-11-18 04:57:34 EST
Michal Babacek <mbabacek@redhat.com> made a comment on jira MODCLUSTER-372

I've identified these bottle necks:
Three the most expensive functions and their callers.They keep this order regardless of the event metric being Instruction fetch/Data read/Data write.

 # Function {{ap_slotmem_mem}} from mod_slotmem/sharedmem_util.c called by:
  * {{get_context}} from mod_manager/context.c
  * {{get_node}} from mod_manager/node.c
  * {{loc_read_node}} from mod_manager/mod_manager.c
  * {{find_node_context_host}} from mod_proxy_cluster/mod_proxy_cluster.c
  * {{loc_read_context}} mod_manager/context.c
  * {{read_context_table}} mod_proxy_cluster/mod_proxy_cluster.c
  * {{manager_info}} from mod_manager/mod_manager.c

 # Function {{ap_slotmem_do}} from mod_slotmem/sharedmem_util.c called mostly by
  * httpd_request and httpd_core functions from httpd sources.

 # Function {{find_node_context_host}} from mod_proxy_cluster/mod_proxy_cluster.c called by
  * httpd sources, mostly on request_process and request_connection.

So, IMHO, {{find_node_context_host}} is indeed the trouble :)

Attaching the profiling valgrind logs [^callgrind.zip], created as: 
{noformat}
valgrind --tool=callgrind --dump-instr=yes --simulate-cache=yes --collect-jumps=yes --compress-strings=no --compress-pos=no --collect-systime=yes ./httpd -f /tmp/hudson/httpd/conf/httpd.conf -E /tmp/hudson/httpd/logs/httpd.log
{noformat}
with this debug setting:
{code}
<IfModule worker.c>
ThreadLimit         50
StartServers        1
ServerLimit         1
MinSpareThreads     50
MaxSpareThreads     50
MaxClients          50
ThreadsPerChild     50
MaxRequestsPerChild 0
</IfModule>
{code}
and 4 worker nodes, 65 contexts each, and several dozens of concurrent client sessions.
Comment 5 JBoss JIRA Server 2013-11-18 16:43:41 EST
Michal Babacek <mbabacek@redhat.com> made a comment on jira MODCLUSTER-372

Jean-Frederic is experimenting with reading from local instead of shared mem, the preliminary results of my unified tests look very promising:

||test||balancer CPU usage peak||
|1.2.x branch, 4 workers, 61 contexts each|70%|
|1.2.x branch, 4 workers, 1 context each|17%|
|MODCLUSTER-372 branch, 4 workers, 61 contexts each|28%|
|MODCLUSTER-372 branch, 4 workers, 1 context each|17%|
Comment 7 JBoss JIRA Server 2014-02-06 11:18:58 EST
Jean-Frederic Clere <jfclere@jboss.org> updated the status of jira MODCLUSTER-372 to Resolved
Comment 8 JBoss JIRA Server 2014-02-20 11:07:38 EST
Michal Babacek <mbabacek@redhat.com> updated the status of jira MODCLUSTER-372 to Reopened
Comment 9 JBoss JIRA Server 2014-03-03 07:15:30 EST
Michal Babacek <mbabacek@redhat.com> updated the status of jira MODCLUSTER-372 to Resolved
Comment 10 Michal Karm Babacek 2014-03-03 07:42:00 EST
Fix ported to 1.2.x branch [1], it's ready for EAP6. Follow the linked Jira for more details. It will be verified as soon as the productized bits are ready.

[1] https://github.com/modcluster/mod_cluster/pull/64
Comment 11 Michal Karm Babacek 2014-03-06 05:17:43 EST
Failed. Please, note DR2 contains the old mod_cluster 1.2.6. The component update is needed. I'm setting DR3 as a new milestone.
Comment 12 Michal Karm Babacek 2014-03-12 06:38:38 EDT
Fail.

mod_cluster was not upgraded. According to [BZ 1050223], it was supposed to be in version 1.2.8, but it is the old 1.2.6.
Comment 14 Michal Karm Babacek 2014-05-14 10:18:44 EDT
Assigning to myself for verification...
Comment 15 Michal Karm Babacek 2014-05-14 10:20:54 EDT
Verified :-)

Note: Don't expect this fix in EAP 6.3.0 Beta, because native libraries are out of the Beta release scope.
Comment 16 Nichola Moore 2014-05-15 00:33:13 EDT
Changed bug type from Known Issue to Bug Fix as part of bz 1097719, taking in to account the comment above.

Note You need to log in before you can comment on or make changes to this bug.