| Summary: | [GSS](6.4.z) MODCLUSTER000022: Failed to drain n remaining pending requests | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | [JBoss] JBoss Enterprise Application Platform 6 | Reporter: | Michal Karm Babacek <mbabacek> | ||||||
| Component: | mod_cluster | Assignee: | Jean-frederic Clere <jclere> | ||||||
| Status: | CLOSED WONTFIX | QA Contact: | Michal Karm Babacek <mbabacek> | ||||||
| Severity: | medium | Docs Contact: | Russell Dickenson <rdickens> | ||||||
| Priority: | medium | ||||||||
| Version: | 6.2.0 | CC: | bmaxwell, bperkins, cdewolf, dmasirka, jclere, jkudrnac, jpallich, mbabacek, msochure, rhusar, rjanik, rpelisse, vtunka | ||||||
| Target Milestone: | DR2 | ||||||||
| Target Release: | EAP 6.4.9 | ||||||||
| Hardware: | sparc | ||||||||
| OS: | Solaris | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | Known Issue | |||||||
| Doc Text: |
Cause: ??
Consequence:
Workaround (if any):
Result:
|
Story Points: | --- | ||||||
| Clone Of: | |||||||||
| : | 1338645 (view as bug list) | Environment: | |||||||
| Last Closed: | 2017-06-26 18:34:44 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Bug Depends On: | |||||||||
| Bug Blocks: | 1338645 | ||||||||
| Attachments: |
|
||||||||
|
Description
Michal Karm Babacek
2013-10-14 08:48:22 UTC
To investigate the problem an error_log at debug is needed. It is a race condition, I will attach a proper reproducer and logs as soon as I nail it. Created attachment 812860 [details]
Logs from httpd and EAP.
OK, I think I've got it.
- Soalris 11 sparc or x86.
- Oracle JDK 7
- Either stopContextTimeout 10 or 20 seconds, it seems arbitrary.
- Many clients access simplecontext webapp which is just a static JSP.
- Abruptly, EAP begins shutdown process.
- There is the dreaded MODCLUSTER000022: Failed to drain -1 remaining pending requests from default-host:/simplecontext-666 within 20.0 seconds
I've failed to reproduce it on JDK 6 on the same box with the same test, yet, it doesn't prove much because it is clearly some kind of race condition and even If I executed the test several times, it might have been just a coincidence that JDK 6 didn't seem affected.
Created attachment 812861 [details]
Relevant configs.
Adding some configs, yet, there is nothing special but for stop-context-timeout="20". Please, note that I managed to reproduce the issue with stop-context-timeout="10" as well.
@Jitka: AFAIK it's linked only to Solaris...but I can't be 100% sure about that. Giving it a second thought, I'm inclined to think it's nothing to do with your performance observations, because MCMP messages are sent already by the time AS is draining sessions/requests... {code}
manager_trans STOP-APP (/)
manager_handler STOP-APP (/) processing: "JVMRoute=jboss-eap-6.2&Alias=default-host%2Clocalhost%2Cexample.com&Context=%2Fsimplecontext-666"
process_appl_cmd: STOP-APP nbrequests -1
manager_handler STOP-APP OK
{code}
So, IMHO, we initialize it as:
ou->nbrequests = 0;
in insert_update_context function and the it passes out of our hands, because:
/* We don't update nbrequests it belongs to mod_proxy_cluster logic */
in insert_update function, right?
I suggest either investigate, why it's being updated with -1, or just put something like:
if (ou->nbrequests < 0) {
ou->nbrequests = 0;
}
just before this line:
ap_rprintf(r, "%.*s, Status: %s Request: %d ", (int) sizeof(ou->context), ou->context, status, ou->nbrequests);
Hmm?
Sorry, it's Friday today: there are actually two places in _our_ code, where we decrease the number of requests: In proxy_cluster_pre_request, when we found a right worker for the context: upd_context_count(context_id, -1); and in proxy_cluster_post_request again. Please, ignore the suggestion in [Comment 6], I think this is better: {code} --- a/native/mod_proxy_cluster/mod_proxy_cluster.c +++ b/native/mod_proxy_cluster/mod_proxy_cluster.c @@ -3378,7 +3378,11 @@ static void upd_context_count(const char *id, int val) int ident = atoi(id); contextinfo_t *context; if (context_storage->read_context(ident, &context) == APR_SUCCESS) { - context->nbrequests = context->nbrequests + val; + int newVal = context->nbrequests + val; + if (newVal < 0) { + newVal = 0; + } + context->nbrequests = newVal; } } {code} WDYT? @Jean-Frederic: If you think it's weird/wrong/hack, say so... If it's OK, I'll patch and test... In the proxy_cluster_pre_request() we make -1 because it is a fail-over that is correct we will use another worker. In the proxy_cluster_post_request() we make -1 because we have finished the request. I still don't see how it could be -1 :-( Note that upd_context_count() is protected by a apr_thread_mutex_lock(lock). So that is not a direct concurrency problem. I have revisited comment #10 and adjust the code to move the apr_thread_mutex_lock(lock) outside the loop. See https://issues.jboss.org/browse/MODCLUSTER-383 too. It need to be ported to 1.2.x. THX, I'll test it and see whether I can get -1 again... EAP 6.4.0.ER1, mod_cluster 1.2.10.Final
> 06:49:20,117 WARN [org.jboss.modcluster] (ServerService Thread Pool -- 14) MODCLUSTER000022: Failed to drain -1 remaining pending requests from default-host:/simplecontext-666 within 10.0 seconds
> Red Hat Bugzilla 2015-01-27 00:12:13 EST
> Doc Type: Known Issue → Bug Fix
> Michal Babacek 2015-01-27 04:49:27 EST
> Doc Type: Bug Fix → Known Issue
> Red Hat Bugzilla 2015-01-27 04:49:27 EST
> Doc Type: Known Issue → Bug Fix
Dear Lucas, I put it in your hands...
Thanks Michal. Bugzilla is very picky with how we have to change it! https://github.com/modcluster/mod_cluster/commit/dbda5776efbee2854a582a29d314b33d4af1779 needs to be ported to the 1.2.x branch looks easy. |