It appears that the counter of present requests keeps an illegal value -1, thus leading to a very log time of waiting (timeout 20s e.g.). This happens per context, so for 200 contexts, shutdown takes 200*20 seconds. At the moment, I can't say whether this issue is limited solely to Solaris, but it is definitely easiest to reproduce there. JDK6 or 7 doesn't seem to play any role in it. I've failed to reproduce it on RHEL to date, gonna keep investigating.
To investigate the problem an error_log at debug is needed.
It is a race condition, I will attach a proper reproducer and logs as soon as I nail it.
Created attachment 812860 [details] Logs from httpd and EAP. OK, I think I've got it. - Soalris 11 sparc or x86. - Oracle JDK 7 - Either stopContextTimeout 10 or 20 seconds, it seems arbitrary. - Many clients access simplecontext webapp which is just a static JSP. - Abruptly, EAP begins shutdown process. - There is the dreaded MODCLUSTER000022: Failed to drain -1 remaining pending requests from default-host:/simplecontext-666 within 20.0 seconds I've failed to reproduce it on JDK 6 on the same box with the same test, yet, it doesn't prove much because it is clearly some kind of race condition and even If I executed the test several times, it might have been just a coincidence that JDK 6 didn't seem affected.
Created attachment 812861 [details] Relevant configs. Adding some configs, yet, there is nothing special but for stop-context-timeout="20". Please, note that I managed to reproduce the issue with stop-context-timeout="10" as well.
@Jitka: AFAIK it's linked only to Solaris...but I can't be 100% sure about that. Giving it a second thought, I'm inclined to think it's nothing to do with your performance observations, because MCMP messages are sent already by the time AS is draining sessions/requests...
{code} manager_trans STOP-APP (/) manager_handler STOP-APP (/) processing: "JVMRoute=jboss-eap-6.2&Alias=default-host%2Clocalhost%2Cexample.com&Context=%2Fsimplecontext-666" process_appl_cmd: STOP-APP nbrequests -1 manager_handler STOP-APP OK {code} So, IMHO, we initialize it as: ou->nbrequests = 0; in insert_update_context function and the it passes out of our hands, because: /* We don't update nbrequests it belongs to mod_proxy_cluster logic */ in insert_update function, right? I suggest either investigate, why it's being updated with -1, or just put something like: if (ou->nbrequests < 0) { ou->nbrequests = 0; } just before this line: ap_rprintf(r, "%.*s, Status: %s Request: %d ", (int) sizeof(ou->context), ou->context, status, ou->nbrequests); Hmm?
Sorry, it's Friday today: there are actually two places in _our_ code, where we decrease the number of requests: In proxy_cluster_pre_request, when we found a right worker for the context: upd_context_count(context_id, -1); and in proxy_cluster_post_request again. Please, ignore the suggestion in [Comment 6], I think this is better: {code} --- a/native/mod_proxy_cluster/mod_proxy_cluster.c +++ b/native/mod_proxy_cluster/mod_proxy_cluster.c @@ -3378,7 +3378,11 @@ static void upd_context_count(const char *id, int val) int ident = atoi(id); contextinfo_t *context; if (context_storage->read_context(ident, &context) == APR_SUCCESS) { - context->nbrequests = context->nbrequests + val; + int newVal = context->nbrequests + val; + if (newVal < 0) { + newVal = 0; + } + context->nbrequests = newVal; } } {code} WDYT?
@Jean-Frederic: If you think it's weird/wrong/hack, say so... If it's OK, I'll patch and test...
In the proxy_cluster_pre_request() we make -1 because it is a fail-over that is correct we will use another worker. In the proxy_cluster_post_request() we make -1 because we have finished the request. I still don't see how it could be -1 :-(
Note that upd_context_count() is protected by a apr_thread_mutex_lock(lock). So that is not a direct concurrency problem.
I have revisited comment #10 and adjust the code to move the apr_thread_mutex_lock(lock) outside the loop.
See https://issues.jboss.org/browse/MODCLUSTER-383 too. It need to be ported to 1.2.x.
THX, I'll test it and see whether I can get -1 again...
EAP 6.4.0.ER1, mod_cluster 1.2.10.Final > 06:49:20,117 WARN [org.jboss.modcluster] (ServerService Thread Pool -- 14) MODCLUSTER000022: Failed to drain -1 remaining pending requests from default-host:/simplecontext-666 within 10.0 seconds
> Red Hat Bugzilla 2015-01-27 00:12:13 EST > Doc Type: Known Issue → Bug Fix > Michal Babacek 2015-01-27 04:49:27 EST > Doc Type: Bug Fix → Known Issue > Red Hat Bugzilla 2015-01-27 04:49:27 EST > Doc Type: Known Issue → Bug Fix Dear Lucas, I put it in your hands...
Thanks Michal. Bugzilla is very picky with how we have to change it!
https://github.com/modcluster/mod_cluster/commit/dbda5776efbee2854a582a29d314b33d4af1779 needs to be ported to the 1.2.x branch looks easy.