Bug 899173 (JBEWS-240)
| Summary: | Failover triggers 503 Service Temporarily Unavailable | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | [JBoss] JBoss Enterprise Web Server 1 | Reporter: | Radoslav Husar <rhusar> | ||||||
| Component: | unspecified | Assignee: | Permaine Cheung <pcheung> | ||||||
| Status: | CLOSED NEXTRELEASE | QA Contact: | |||||||
| Severity: | high | Docs Contact: | |||||||
| Priority: | high | ||||||||
| Version: | unspecified | CC: | jfclere, misty, pcheung, pslavice, rajesh.rajasekaran | ||||||
| Target Milestone: | --- | ||||||||
| Target Release: | EWS 1.0.2 | ||||||||
| Hardware: | Unspecified | ||||||||
| OS: | Unspecified | ||||||||
| URL: | http://jira.jboss.org/jira/browse/JBEWS-240 | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: |
perf lab (and all other platforms as well)
|
|||||||
| Last Closed: | 2011-05-17 11:25:14 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Attachments: |
|
||||||||
|
Description
Radoslav Husar
2010-02-01 14:01:37 UTC
If the JVM is forcibly killed, all requests accepted by that node *should* result in a 50x error. It would be incorrect for these requests to failover, since it may result in duplicated processing. In the case of a JVM crash, the best we can do is to provide failover for subsequent requests. So a problem would emerge if (e.g.) all the transactions on the server committed successfully and while/before sending the response to the client, the server would crash. To avoid duplicate processing we don't failover. Valid point. Practically speaking, aren't most users getting such error going to simply refresh (the RFC says that this is a temporary problem and suggests that user retries after time)? Couldnt be that apps which are sensitive to double processing have this covered (i.e. request already contains which 'transaction id' it is therefore would see it completed if tried to double process)? Arent most websites be wanting just HA, shouldn't there be an option to failover anyway? 503 (HTTP_SERVICE_UNAVAILABLE) causes a retry if Maxattempts is greater than one See http://www.jboss.org/mod_cluster/java/properties.html maxAttempts (default is one: no retry. Thanks for the hint Jean-Frederic, but I am still not happy :-)
Tested with maxAttempts to value 2, still several sessions return 503 to the client. This should not be happening (see test run [1]).
My speculation is that 503 were retried but all 500 (from JBPAPP-3615) were not retried and these now resulted in 503 on 2nd attempt, see the code:
else if (access_status == HTTP_INTERNAL_SERVER_ERROR) {
/* Unrecoverable server error.
* We can not failover to another worker.
* Mark the worker as unusable if member of load balancer
*/
if (balancer) {
worker->s->status |= PROXY_WORKER_IN_ERROR;
worker->s->error_time = apr_time_now();
}
break;
}
or the problem is why the 500 were returned on the 1st attempt (and therefore not retried now).
[1] http://hudson.qa.jboss.com/hudson/view/mod_cluster-QE/job/eap-50-mod_cluster-failover-rhel4-x86_64/8/console-perf08/
Actually mod_proxy (and mod_cluster is based on it) doesn't allow to retry on a status returned by the worker. Note that access_status is not r->status. Probably in fact the 500 and 503 are return by AS so it is JBPAPP-3463 and not JBPAPP-3614, please check in the error_log or access_log. The server was forcefully killed (-9) in this case, it cannot have returned anything, so that cant be the case. Maybe we are onto other problem which I meant by JBPAPP-3615 - it should return 503, not 500 in my opinion. "it cannot have returned anything" how can you be so sure that some response have been truncated or that the request where processed but killed before sending the response? Please give a error_log (in debug better) and I am sure I will prove that those error can't be retried (http can't tell if the request has been processed or not processed. You are right, I cannot be sure there wasnt a partial response. Those are the problem here right; even with maxAttempts. I am attaching about first 25% of the log, its very verbose :-) Can you please grep through it and find the evidence for that? So is there another option to allow retry even on these responses? Attachment: Added: error_log.zip Unfortunately the first part doesnt include such request. Two of them (503) happen at times: 07:55:23:944 07:58:20:494 Here are the full logs: http://hudson.qa.jboss.com/hudson/view/mod_cluster-QE/job/eap-50-mod_cluster-failover-rhel4-x86_64/9/artifact/error_log.zip According to the error_log at least 14 request shouldn't be retried so you should have at least 14 error. How many errors do you get? Strange. Only 4. The test was running from 07:48:47:246 to 08:01:18:625 - maybe 14 was including the test tear-down? http://hudson.qa.jboss.com/hudson/view/mod_cluster-QE/job/eap-50-mod_cluster-failover-rhel4-x86_64/9/artifact/report/ (summing up 'sampling errors' field:) http://hudson.qa.jboss.com/hudson/view/mod_cluster-QE/job/eap-50-mod_cluster-failover-rhel4-x86_64/9/artifact/report/failover.txt ok let me have a closer look... Ok I was wrong in fact there should an errors.... it seems you are testing with 3 nodes and kill one but for some reasons the others can't accept the requests that are failed over. Check that the nodes are configured correctly to support the increase of load. I think that the: Connection refused: proxy: AJP: attempt to connect to 10.16.88.182:8009 (10.16.88.182) .. are due to maxThreads too small in <Connector/> Nope, not the case, here is the config:
<!-- A AJP 1.3 Connector on port 8009 -->
<Connector protocol="AJP/1.3" port="8009" address="${cluster.route.name}"
redirectPort="8443" maxThreads="10000"/>
Yes, 4 nodes, one is down for a minute then up again, then another fails, etc.
I have found it... I think it is a bug in mod_proxy not in mod_cluster. There are serveral: [Thu Feb 11 07:58:20 2010] [debug] proxy_util.c(1989): proxy: BALANCER: worker for (10.16.88.182) has been marked for retry they are too early: at the same time the worker failed. I have commented that in httpd-dev list. Hi Jean-Frederic, did you get any reply on the dev list? This is a candidate, but not committed to EAP 5.1.1 Release Notes Docs Status: Added: Needs More Info It is not fixed as it requires further investigations and a fix in upstream httpd. Note the fix I proposed worked in my tests but that is a race condition that can't be fixed easy... Release Notes Docs Status: Removed: Needs More Info Added: Not Yet Documented Writer: Added: mistysj Release Notes Docs Status: Removed: Not Yet Documented Added: Needs More Info cause: mod_proxy bug race condition. consequence: 503 could be returned instead failing over to another node after killing a node. fix: none work-around: none. Release Notes Docs Status: Removed: Needs More Info Added: Documented as Known Issue Release Notes Text: Added: Due to a race condition caused by a bug in mod_proxy, a <literal>503</literal> error may be returned when a node is killed, instead of failing over to another node. Patch for httpd-2.2.17 Attachment: Added: patch.txt Patch incorporated in: httpd22-2.2.17-11.ep5.el4 httpd-2.2.17-11.1.ep5.el5 httpd-2.2.17-11.2.ep5.el6 The patch is also applied in Hudson builds. The fix will be in EWS 1.0.2 CR2. Please verify in EWS CR2 1.0.2 build. Link: Added: This issue Cloned to JBPAPP-6451 Release Notes Docs Status: Removed: Documented as Known Issue Writer: Removed: mistysj Release Notes Text: Removed: Due to a race condition caused by a bug in mod_proxy, a <literal>503</literal> error may be returned when a node is killed, instead of failing over to another node. |