Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 899402 (JBEWS-83)

Summary: Failover triggers 500 Internal Server Error, 503 Service Temporarily Unavailable
Product: [JBoss] JBoss Enterprise Web Server 1 Reporter: Michal Karm Babacek <mbabacek>
Component: unspecifiedAssignee: Jean-Frederic Clere <jfclere>
Status: CLOSED NOTABUG QA Contact:
Severity: high Docs Contact:
Priority: high    
Version: EWS 1.0.2CC: ddoyle, jfclere, mbabacek, rajesh.rajasekaran
Target Milestone: ---   
Target Release: TBD EWS   
Hardware: Unspecified   
OS: Unspecified   
URL: http://jira.jboss.org/jira/browse/JBEWS-83
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
rhel, solaris
Last Closed: 2011-05-26 16:10:19 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
9-eap-5x-mod_cluster-rhel6-x86_64-failover-load-driver.zip
none
9-eap-5x-mod_cluster-rhel6-x86_64-failover-access_log.zip
none
9-eap-5x-mod_cluster-rhel6-x86_64-failover-error_log.zip none

Description Michal Karm Babacek 2011-05-04 08:13:48 UTC
project_key: JBEWS

Simple scenario: mod_cluster and 4 JBoss nodes are deployed. When a JBoss server fails (JVM is forcibly killed), slightly less then 1% of sessions result in 503.

Related to: https://issues.jboss.org/browse/JBPAPP-3614


Test results (EWS 1.0.2 CR1):
http://hudson.qa.jboss.com/hudson/job/eap-5x-mod_cluster-rhel6-x86_64-failover/9/
Test results (EWS 1.0.2 CR2):
http://hudson.qa.jboss.com/hudson/job/eap-5x-mod_cluster-rhel6-x86_64-failover/10/

perf01-perf04 = JBoss nodes
perf17 = mod_cluster + load driver node

Mod_Cluster configuration (httpd):
Listen 8888

<VirtualHost perf17:8888>
  ServerName perf17
KeepAliveTimeout 60
MaxKeepAliveRequests 0
ManagerBalancerName qacluster
AdvertiseGroup 224.0.2.105
AdvertiseFrequency 5
</VirtualHost>

Comment 1 Michal Karm Babacek 2011-05-04 08:13:48 UTC
Link: Added: This issue Cloned from JBPAPP-3614


Comment 2 Michal Karm Babacek 2011-05-04 08:21:35 UTC
Writer: Removed: mistysj Added: mbabacek


Comment 3 Michal Karm Babacek 2011-05-04 08:22:06 UTC
Affects: Removed: [Documentation (Ref Guide, User Guide, etc.)] 


Comment 4 Michal Karm Babacek 2011-05-04 08:22:45 UTC
Release Notes Docs Status: Removed: Documented as Known Issue 
Release Notes Text: Removed: Due to a race condition caused by a bug in mod_proxy, a <literal>503</literal> error may be returned when a node is killed, instead of failing over to another node.  


Comment 5 Michal Karm Babacek 2011-05-04 08:27:42 UTC
Log from the driver node perf17.

Comment 6 Michal Karm Babacek 2011-05-04 08:27:42 UTC
Attachment: Added: 8-eap-5x-mod_cluster-rhel6-x86_64-failover-log.zip


Comment 7 Michal Karm Babacek 2011-05-04 09:07:14 UTC
Attachment: Removed: 8-eap-5x-mod_cluster-rhel6-x86_64-failover-log.zip 


Comment 8 Michal Karm Babacek 2011-05-04 09:08:21 UTC
Load driver (client) log from perf17.

Comment 9 Michal Karm Babacek 2011-05-04 09:08:21 UTC
Attachment: Added: 9-eap-5x-mod_cluster-rhel6-x86_64-failover-load-driver.zip


Comment 10 Michal Karm Babacek 2011-05-04 09:09:17 UTC
Attachment: Added: 9-eap-5x-mod_cluster-rhel6-x86_64-failover-access_log.zip


Comment 11 Michal Karm Babacek 2011-05-04 09:09:40 UTC
Attachment: Added: 9-eap-5x-mod_cluster-rhel6-x86_64-failover-error_log.zip


Comment 12 Jean-Frederic Clere 2011-05-04 13:05:10 UTC
[jfclere@jfcpc TMP]$ grep 503 access_log 
10.16.90.52 - - [04/May/2011:04:45:41 -0400] "GET /st/SessionTestServlet?invalidate=true HTTP/1.1" 503 399 "-" "Jakarta Commons-HttpClient/3.1"
10.16.90.52 - - [04/May/2011:04:51:05 -0400] "GET /st/SessionTestServlet?invalidate=true HTTP/1.1" 503 399 "-" "Jakarta Commons-HttpClient/3.1"
[jfclere@jfcpc TMP]$ wc -l access_log
367040 access_log

Hm that is NOT 1%

Comment 13 Michal Karm Babacek 2011-05-04 13:41:05 UTC
@Jean-Frederic Clere:

Correction:

Around 0.0005449% of sessions result in 503.

:-)

Comment 14 Jean-Frederic Clere 2011-05-04 13:54:10 UTC
I can't find any clues in the error_log... The error comes from httpd I would be need a debug of httpd to find what goes wrong.

Comment 15 Michal Karm Babacek 2011-05-09 07:29:13 UTC
I've encountered this behaviour in several other scenarios on some other platforms as well, however I haven't been able to reproduce it deterministically. I'm gonna provide additional data as soon as I figure it out.

Comment 16 Michal Karm Babacek 2011-05-09 15:22:25 UTC
You may download these results:

*Httpd configuration:*
  http://scratch.englab.brq.redhat.com/mbabacek/34-eap-5x-mod_cluster-rhel6-i386-failover-conf.zip

*Httpd logs:*
  http://scratch.englab.brq.redhat.com/mbabacek/34-eap-5x-mod_cluster-rhel6-i386-failover-logs.zip

*Client's log:*
  http://scratch.englab.brq.redhat.com/mbabacek/34-eap-5x-mod_cluster-rhel6-i386-failover-console.zip

*The actual run link:*
  http://hudson.qa.jboss.com/hudson/job/eap-5x-mod_cluster-rhel6-i386-failover/34/


*The client's log shows the following types of errors:*
{noformat} 
  Error sampling data:  &lt;org.jboss.smartfrog.loaddriver.RequestProcessingException: Invalid response code: 500 Content: &lt;!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
  Error sampling data:  &lt;org.jboss.smartfrog.loaddriver.RequestProcessingException: Invalid response code: 503 Content: &lt;!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
  Error sampling data:  &lt;org.jboss.smartfrog.loaddriver.RequestProcessingException: IO error: java.net.BindException: Address already in use>
  Error sampling data:  &lt;org.jboss.smartfrog.loaddriver.RequestProcessingException: Stale session data received. Expected 10, received 0, Runner: 1034>
{noformat} 
*Httpd error_log contains these types of errors:*
{noformat} 
  [error] (104)Connection reset by peer: ajp_ilink_receive() can't receive header
  [error] (111)Connection refused: proxy: ajp: attempt to connect to 10.16.88.179:8009 (10.16.88.179) failed
  [error] (111)Connection refused: proxy: AJP: attempt to connect to 10.16.88.179:8009 (10.16.88.179) failed
  [error] (111)Connection refused: proxy: ajp: attempt to connect to 10.16.88.180:8009 (10.16.88.180) failed
  [error] (111)Connection refused: proxy: AJP: attempt to connect to 10.16.88.180:8009 (10.16.88.180) failed
  [error] (111)Connection refused: proxy: ajp: attempt to connect to 10.16.88.181:8009 (10.16.88.181) failed
  [error] (111)Connection refused: proxy: AJP: attempt to connect to 10.16.88.181:8009 (10.16.88.181) failed
  [error] (111)Connection refused: proxy: ajp: attempt to connect to 10.16.88.182:8009 (10.16.88.182) failed
  [error] (120006)APR does not understand this error code: proxy: AJP: cping/cpong failed to 10.16.88.179:8009 (10.16.88.179)
  [error] (120006)APR does not understand this error code: proxy: AJP: cping/cpong failed to 10.16.88.181:8009 (10.16.88.181)
  [error] (120006)APR does not understand this error code: proxy: dialog to 10.16.88.179:8009 (10.16.88.179) failed
  [error] (70014)End of file found: ajp_ilink_receive() can't receive header
  [error] ajp_cping_cpong: apr_socket_recv failed
  [error] ajp_handle_cping_cpong: ajp_ilink_receive failed
  [error] ajp_read_header: ajp_ilink_receive failed
  [error] ap_proxy_connect_backend disabling worker for (10.16.88.179)
  [error] ap_proxy_connect_backend disabling worker for (10.16.88.180)
  [error] ap_proxy_connect_backend disabling worker for (10.16.88.181)
  [error] ap_proxy_connect_backend disabling worker for (10.16.88.182)
  [error] proxy: ajp: disabled connection for (10.16.88.179)
  [error] proxy: AJP: disabled connection for (10.16.88.179)
  [error] proxy: AJP: failed to make connection to backend: 10.16.88.179
  [error] proxy: AJP: failed to make connection to backend: 10.16.88.180
  [error] proxy: AJP: failed to make connection to backend: 10.16.88.181
  [error] proxy: dialog to 10.16.88.179:8009 (10.16.88.179) failed
{noformat} 
The test was a failover scenario, so "failed to make connection" are related to that...

Comment 17 Jean-Frederic Clere 2011-05-09 15:36:49 UTC
[jfclere@jfcpc TMP]$ grep 500 access_log
10.16.88.178 - - [09/May/2011:10:00:54 -0400] "GET /st/SessionTestServlet? HTTP/1.1" 500 609 "-" "Jakarta Commons-HttpClient/3.1"
10.16.88.178 - - [09/May/2011:10:01:01 -0400] "GET /st/SessionTestServlet? HTTP/1.1" 500 609 "-" "Jakarta Commons-HttpClient/3.1"
10.16.88.178 - - [09/May/2011:10:05:23 -0400] "GET /st/SessionTestServlet? HTTP/1.1" 500 609 "-" "Jakarta Commons-HttpClient/3.1"
[jfclere@jfcpc TMP]$ grep 503 access_log
10.16.88.178 - - [09/May/2011:10:02:40 -0400] "GET /st/SessionTestServlet? HTTP/1.1" 503 398 "-" "Jakarta Commons-HttpClient/3.1"


Comment 18 Jean-Frederic Clere 2011-05-09 16:07:47 UTC
the 500 correspond to something like:
+++
[Mon May 09 10:05:23 2011] [debug] proxy_util.c(2011): proxy: AJP: has acquired connection for (10.16.88.179)
[Mon May 09 10:05:23 2011] [debug] proxy_util.c(2067): proxy: connecting ajp://10.16.88.179:8009/st/SessionTestServlet? to 10.16.88.179:8009
[Mon May 09 10:05:23 2011] [debug] proxy_util.c(2193): proxy: connected /st/SessionTestServlet? to 10.16.88.179:8009
[Mon May 09 10:05:23 2011] [debug] ajp_utils.c(31): Into ajp_handle_cping_cpong
[Mon May 09 10:05:23 2011] [debug] ajp_utils.c(102): ajp_handle_cping_cpong: Done
[Mon May 09 10:05:23 2011] [debug] ajp_header.c(224): Into ajp_marshal_into_msgb
[Mon May 09 10:05:23 2011] [debug] ajp_header.c(290): ajp_marshal_into_msgb: Header[0] [User-Agent] = [Jakarta Commons-HttpClient/3.1]
[Mon May 09 10:05:23 2011] [debug] ajp_header.c(290): ajp_marshal_into_msgb: Header[1] [Host] = [dev89:8888]
[Mon May 09 10:05:23 2011] [debug] ajp_header.c(290): ajp_marshal_into_msgb: Header[2] [Cookie] = [$Version=0; JSESSIONID=jBhmXOxp9S191fthutFcNw__.node1; $Path=/st]
[Mon May 09 10:05:23 2011] [debug] ajp_header.c(450): ajp_marshal_into_msgb: Done
[Mon May 09 10:05:23 2011] [debug] mod_proxy_ajp.c(265): proxy: APR_BUCKET_IS_EOS
[Mon May 09 10:05:23 2011] [debug] mod_proxy_ajp.c(270): proxy: data to read (max 8186 at 4)
[Mon May 09 10:05:23 2011] [debug] mod_proxy_ajp.c(285): proxy: got 0 bytes of data
[Mon May 09 10:05:23 2011] [debug] ajp_header.c(687): ajp_read_header: ajp_ilink_received 09
[Mon May 09 10:05:23 2011] [debug] ajp_header.c(697): ajp_parse_type: got 09
[Mon May 09 10:05:23 2011] [debug] mod_proxy_ajp.c(546): proxy: Processing of request failed backend: 1, output: 0
[Mon May 09 10:05:23 2011] [error] proxy: dialog to 10.16.88.179:8009 (10.16.88.179) failed
+++
the backend seems to send a CPONG instead the headers.

Comment 19 Michal Karm Babacek 2011-05-18 08:47:24 UTC
I've just verified that the same behaviour takes place on Solaris 10 Sparc64 as well:
*Logs*
 - [31-eap-5x-mod_cluster-solaris10-sparc64-access_log.zip|http://scratch.englab.brq.redhat.com/mbabacek/31-eap-5x-mod_cluster-solaris10-sparc64-access_log.zip]
 - [31-eap-5x-mod_cluster-solaris10-sparc64-error_log.zip|http://scratch.englab.brq.redhat.com/mbabacek/31-eap-5x-mod_cluster-solaris10-sparc64-error_log.zip]

*Client's errors*
{noformat}
2011/05/17 11:54:40:794 EDT [WARN ][Runner - 20]  Invalid response code: 500 Content: &lt;!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
2011/05/17 11:54:46:010 EDT [WARN ][Runner - 194]  Invalid response code: 500 Content: &lt;!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
2011/05/17 12:02:24:789 EDT [WARN ][Runner - 225]  Invalid response code: 503 Content: &lt;!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
2011/05/17 12:02:24:819 EDT [WARN ][Runner - 225] SFCORE_LOG - Error sampling data:  &lt;Invalid response code: 503 Content: &lt;!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
2011/05/17 12:02:24:820 EDT [WARN ][Runner - 96]  Invalid response code: 503 Content: &lt;!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
2011/05/17 12:02:24:821 EDT [WARN ][Runner - 96] SFCORE_LOG - Error sampling data:  &lt;Invalid response code: 503 Content: &lt;!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
{noformat}

*Error_log digest*
{noformat}
[hudson@dev89 logs]$ cat error_log | grep -A4 -B3 "failed backend" 

[Mon May 09 10:00:54 2011] [debug] mod_proxy_ajp.c(285): proxy: got 0 bytes of data
[Mon May 09 10:00:54 2011] [debug] ajp_header.c(687): ajp_read_header: ajp_ilink_received 09
[Mon May 09 10:00:54 2011] [debug] ajp_header.c(697): ajp_parse_type: got 09
[Mon May 09 10:00:54 2011] [debug] mod_proxy_ajp.c(546): proxy: Processing of request failed backend: 1, output: 0
[Mon May 09 10:00:54 2011] [error] proxy: dialog to 10.16.88.179:8009 (10.16.88.179) failed
[Mon May 09 10:00:54 2011] [debug] proxy_util.c(2029): proxy: AJP: has released connection for (10.16.88.179)
[Mon May 09 10:00:54 2011] [debug] mod_proxy_cluster.c(1245): proxy: byrequests balancer DONE (ajp://10.16.88.181:8009)
[Mon May 09 10:00:54 2011] [debug] mod_proxy.c(1015): Running scheme balancer handler (attempt 0)
--
[Mon May 09 10:01:01 2011] [debug] mod_proxy_ajp.c(285): proxy: got 0 bytes of data
[Mon May 09 10:01:01 2011] [debug] ajp_header.c(687): ajp_read_header: ajp_ilink_received 09
[Mon May 09 10:01:01 2011] [debug] ajp_header.c(697): ajp_parse_type: got 09
[Mon May 09 10:01:01 2011] [debug] mod_proxy_ajp.c(546): proxy: Processing of request failed backend: 1, output: 0
[Mon May 09 10:01:01 2011] [error] proxy: dialog to 10.16.88.179:8009 (10.16.88.179) failed
[Mon May 09 10:01:01 2011] [debug] proxy_util.c(2029): proxy: AJP: has released connection for (10.16.88.179)
[Mon May 09 10:01:01 2011] [debug] mod_proxy_cluster.c(1739): cluster: Found value aVjGbiLTA6+bo4Kx9OmqDA__.node1 for stickysession JSESSIONID|jsessionid
[Mon May 09 10:01:01 2011] [debug] mod_proxy_cluster.c(2108): cluster: Using route node1
--
[Mon May 09 10:02:40 2011] [error] (104)Connection reset by peer: ajp_ilink_receive() can't receive header
[Mon May 09 10:02:40 2011] [debug] mod_proxy_ajp.c(531): (120006)APR does not understand this error code: ajp_read_header failed
[Mon May 09 10:02:40 2011] [error] ajp_handle_cping_cpong: ajp_ilink_receive failed
[Mon May 09 10:02:40 2011] [debug] mod_proxy_ajp.c(546): proxy: Processing of request failed backend: 1, output: 0
[Mon May 09 10:02:40 2011] [debug] ajp_utils.c(102): ajp_handle_cping_cpong: Done
[Mon May 09 10:02:40 2011] [error] (120006)APR does not understand this error code: proxy: dialog to 10.16.88.179:8009 (10.16.88.179) failed
[Mon May 09 10:02:40 2011] [error] (120006)APR does not understand this error code: proxy: AJP: cping/cpong failed to 10.16.88.179:8009 (10.16.88.179)
[Mon May 09 10:02:40 2011] [debug] proxy_util.c(2067): proxy: connecting ajp://10.16.88.179:8009/st/SessionTestServlet? to 10.16.88.179:8009
--
[Mon May 09 10:05:23 2011] [debug] mod_proxy_ajp.c(285): proxy: got 0 bytes of data
[Mon May 09 10:05:23 2011] [debug] ajp_header.c(687): ajp_read_header: ajp_ilink_received 09
[Mon May 09 10:05:23 2011] [debug] ajp_header.c(697): ajp_parse_type: got 09
[Mon May 09 10:05:23 2011] [debug] mod_proxy_ajp.c(546): proxy: Processing of request failed backend: 1, output: 0
[Mon May 09 10:05:23 2011] [error] proxy: dialog to 10.16.88.179:8009 (10.16.88.179) failed
[Mon May 09 10:05:23 2011] [debug] proxy_util.c(2029): proxy: AJP: has released connection for (10.16.88.179)
[Mon May 09 10:05:23 2011] [debug] ajp_utils.c(102): ajp_handle_cping_cpong: Done
[Mon May 09 10:05:23 2011] [debug] ajp_header.c(224): Into ajp_marshal_into_msgb
{noformat}

Comment 20 Michal Karm Babacek 2011-05-23 23:34:29 UTC
Link: Added: This issue Cloned to JBPAPP-6578


Comment 21 Rajesh Rajasekaran 2011-05-24 20:43:25 UTC
Michal, it would have been better to have reopened JBPAPP-3614 than continue the discussion in a new JIRA.
Or is this a new bug after fixing the mod_proxy in httpd2.2.17 as I see from the comments on JBPAPP-3614?

Comment 22 Michal Karm Babacek 2011-05-24 20:55:02 UTC
@[~rrajesh] I guess that this is a new, or at least slightly different problem.
@[~jfclere] What do you think?

Comment 23 Jean-Frederic Clere 2011-05-25 11:16:03 UTC
I have reproduced JBPAPP-6451, It looks like something different and a bug in the log on the ajp module will cause me more work to investigate it.

Comment 24 Michal Karm Babacek 2011-05-26 12:13:25 UTC
@[~jfclere] Is there anything I can do for you to facilitate the debug process?
I'm talking not only about [JBPAPP-6451|https://issues.jboss.org/browse/JBPAPP-6451], the [JBPAPP-6578|https://issues.jboss.org/browse/JBPAPP-6578] is an issue as well. As I see it, the 500 Http Error affects all platforms (didn't try Windoz) as long as the balancer is stressed a bit, or to be more specific, as long as there are enough sessions for a certain time.

Regarding the [JBPAPP-6578|https://issues.jboss.org/browse/JBPAPP-6578], would it help you if I run the same scenario with a very small number of sessions, quite high number of sessions and a very high number of sessions and cross-reference the *mod_proxy_ajp* logs?

Comment 25 Jean-Frederic Clere 2011-05-26 16:08:14 UTC
the 503 correspond to normal errors: we can't replay a request interrupted while waiting for the answer. The 500 are JBPAPP-6578

Comment 26 Jean-Frederic Clere 2011-05-26 16:10:19 UTC
The 503 are normal errors. The 500 are in JBPAPP-6578

Comment 27 Jiri Skrabal 2012-11-13 15:56:42 UTC
Writer: Removed: mbabacek