Bug 1394012

Summary: Tomcat hangs when number of StartServers increased
Product: Red Hat Satellite 5 Reporter: Neal Kim <nkim>
Component: ServerAssignee: Tomáš Kašpárek <tkasparek>
Status: CLOSED NOTABUG QA Contact: Red Hat Satellite QA List <satqe-list>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 570CC: shughes, tlestach, wpinheir
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-11-10 21:01:09 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
javacore from Tomcat while "hung" none

Description Neal Kim 2016-11-10 20:09:20 UTC
Created attachment 1219525 [details]
javacore from Tomcat while "hung"

Description of problem:

In an attempt to increase the number of systems able to provision from Satellite at any one time we increase the value of StartServers in:

/etc/httpd/conf.d/zz-spacewalk-server.conf

From:

<IfModule prefork.c>
  StartServers         8
  
To a modest:

<IfModule prefork.c>
  StartServers         20
  
When attempting to kickstart more than ~5 systems in parallel at a time, the first several HTTP requests return but then eventually fails. At which point subsequent HTTP requests fail and the WebUI becomes unavailable.

Tomcat appears to be "hung" (or at least waiting for something to happen) and restarting the Satellite services is the only way to restore service. Restarting Apache also seems to restore service but not right away (several minutes).

During this time both memory and cpu utilization are at nominal values.

Modifying the number of AJP connector maxThreads and ProxyTimeout has no effect.

Restoring the default number of StartServers apparently works much better.


Version-Release number of selected component (if applicable):

Satellite 5.7
spacewalk-schema-2.3.2-27.el6sat.noarch
satellite-schema-5.7.0.24-1.el6sat.noarch


How reproducible:

Easily, on a fresh install of Satellite 5.7.


Steps to Reproduce:

Change the number of StartServers in /etc/httpd/conf.d/zz-spacewalk-server.conf:

<IfModule prefork.c>
  StartServers         20
  
Restart Satellite:

# rhn-satellite restart

Simulate kickstart traffic:

[root@sat57 conf.d]# ab -n 1000 -c 20 http://<SATELLITE_FQDN>/ks/dist/org/1/file/does/not/exist
This is ApacheBench, Version 2.3 <$Revision: 655654 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking sat57 (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Completed 600 requests
Completed 700 requests
Completed 800 requests
Completed 900 requests
apr_poll: The timeout specified has expired (70007)
Total of 994 requests completed

Observe that the WebUI is unavailable and subsequent HTTP requests fail until Satellite services are restarted.


Actual results:

WebUI is unavailable and Tomcat appears "hung".


Expected results:

WebUI is available and Tomcat not hanging.


Additional info:

Kickstarts use the following rewrite rule:

RewriteRule ^/ks/dist(.*)$ /rhn/common/DownloadFile.do?url=/ks/dist$1

At least in my testing there appears to be some relation between the number of ESTABLISHED AJP connections and when Tomcat "hangs". Somewhere around ~20 or so but it varies.

Will attach javacore while Tomcat is "hung".