Bug 1018705 - [GSS](6.4.z) MODCLUSTER000022: Failed to drain n remaining pending requests
[GSS](6.4.z) MODCLUSTER000022: Failed to drain n remaining pending requests
Status: CLOSED WONTFIX
Product: JBoss Enterprise Application Platform 6
Classification: JBoss
Component: mod_cluster (Show other bugs)
6.2.0
sparc Solaris
medium Severity medium
: DR2
: EAP 6.4.9
Assigned To: Jean-frederic Clere
Michal Karm Babacek
Russell Dickenson
:
Depends On:
Blocks: 1338645
  Show dependency treegraph
 
Reported: 2013-10-14 04:48 EDT by Michal Karm Babacek
Modified: 2017-07-17 09:35 EDT (History)
13 users (show)

See Also:
Fixed In Version:
Doc Type: Known Issue
Doc Text:
Cause: ?? Consequence: Workaround (if any): Result:
Story Points: ---
Clone Of:
: 1338645 (view as bug list)
Environment:
Last Closed: 2017-06-26 14:34:44 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Logs from httpd and EAP. (69.07 KB, application/zip)
2013-10-16 06:46 EDT, Michal Karm Babacek
no flags Details
Relevant configs. (4.57 KB, application/zip)
2013-10-16 06:49 EDT, Michal Karm Babacek
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
JBoss Issue Tracker MODCLUSTER-383 Critical Resolved Session draining broken: requests counting broken on load-balancer 2018-05-08 22:06 EDT

  None (edit)
Description Michal Karm Babacek 2013-10-14 04:48:22 EDT
It appears that the counter of present requests keeps an illegal value -1, thus leading to a very log time of waiting (timeout 20s e.g.). This happens per context, so for 200 contexts, shutdown takes 200*20 seconds.

At the moment, I can't say whether this issue is limited solely to Solaris, but it is definitely easiest to reproduce there. JDK6 or 7 doesn't seem to play any role in it. I've failed to reproduce it on RHEL to date, gonna keep investigating.
Comment 1 Jean-frederic Clere 2013-10-14 04:53:29 EDT
To investigate the problem an error_log at debug is needed.
Comment 2 Michal Karm Babacek 2013-10-14 11:34:52 EDT
It is a race condition, I will attach a proper reproducer and logs as soon as I nail it.
Comment 3 Michal Karm Babacek 2013-10-16 06:46:53 EDT
Created attachment 812860 [details]
Logs from httpd and EAP.

OK, I think I've got it.

 - Soalris 11 sparc or x86.
 - Oracle JDK 7
 - Either stopContextTimeout 10 or 20 seconds, it seems arbitrary.
 - Many clients access simplecontext webapp which is just a static JSP.
 - Abruptly, EAP begins shutdown process.
 - There is the dreaded MODCLUSTER000022: Failed to drain -1 remaining pending requests from default-host:/simplecontext-666 within 20.0 seconds
 
I've failed to reproduce it on JDK 6 on the same box with the same test, yet, it doesn't prove much because it is clearly some kind of race condition and even If I executed the test several times, it might have been just a coincidence that JDK 6 didn't seem affected.
Comment 4 Michal Karm Babacek 2013-10-16 06:49:21 EDT
Created attachment 812861 [details]
Relevant configs.

Adding some configs, yet, there is nothing special but for stop-context-timeout="20". Please, note that I managed to reproduce the issue with stop-context-timeout="10" as well.
Comment 5 Michal Karm Babacek 2013-10-24 03:46:51 EDT
@Jitka: AFAIK it's linked only to Solaris...but I can't be 100% sure about that.
Giving it a second thought, I'm inclined to think it's nothing to do with your performance observations, because MCMP messages are sent already by the time AS is draining sessions/requests...
Comment 6 Michal Karm Babacek 2013-10-25 05:43:10 EDT
{code}
manager_trans STOP-APP (/)
manager_handler STOP-APP (/) processing: "JVMRoute=jboss-eap-6.2&Alias=default-host%2Clocalhost%2Cexample.com&Context=%2Fsimplecontext-666"
process_appl_cmd: STOP-APP nbrequests -1
manager_handler STOP-APP  OK
{code}

So, IMHO, we initialize it as:
ou->nbrequests = 0;
in insert_update_context function and the it passes out of our hands, because:
/* We don't update nbrequests it belongs to mod_proxy_cluster logic */
in insert_update function, right?

I suggest either investigate, why it's being updated with -1, or just put something like: 
if (ou->nbrequests < 0) {
    ou->nbrequests = 0;
}
just before this line:
    ap_rprintf(r, "%.*s, Status: %s Request: %d ", (int) sizeof(ou->context), ou->context, status, ou->nbrequests);

Hmm?
Comment 7 Michal Karm Babacek 2013-10-25 05:55:15 EDT
Sorry, it's Friday today:

there are actually two places in _our_ code, where we decrease the number of requests:
In proxy_cluster_pre_request, when we found a right worker for the context:
   upd_context_count(context_id, -1);
and in proxy_cluster_post_request again.

Please, ignore the suggestion in [Comment 6], I think this is better:
{code}
--- a/native/mod_proxy_cluster/mod_proxy_cluster.c
+++ b/native/mod_proxy_cluster/mod_proxy_cluster.c
@@ -3378,7 +3378,11 @@ static void upd_context_count(const char *id, int val)
     int ident = atoi(id);
     contextinfo_t *context;
     if (context_storage->read_context(ident, &context) == APR_SUCCESS) {
-        context->nbrequests = context->nbrequests + val;
+        int newVal = context->nbrequests + val;
+        if (newVal < 0) {
+            newVal = 0;
+        }
+        context->nbrequests = newVal;
     }
 }
{code}

WDYT?
Comment 8 Michal Karm Babacek 2013-11-04 18:55:39 EST
@Jean-Frederic: If you think it's weird/wrong/hack, say so... If it's OK, I'll patch and test...
Comment 9 Jean-frederic Clere 2013-11-05 11:06:41 EST
In the proxy_cluster_pre_request() we make -1 because it is a fail-over that is correct we will use another worker.
In the proxy_cluster_post_request() we make -1 because we have finished the request.
I still don't see how it could be -1 :-(
Comment 10 Jean-frederic Clere 2013-11-05 11:12:36 EST
Note that upd_context_count() is protected by a apr_thread_mutex_lock(lock). So that is not a direct concurrency problem.
Comment 11 Jean-frederic Clere 2014-02-02 03:51:10 EST
I have revisited comment #10 and adjust the code to move the apr_thread_mutex_lock(lock) outside the loop.
Comment 12 Jean-frederic Clere 2014-02-04 07:33:58 EST
See https://issues.jboss.org/browse/MODCLUSTER-383 too. It need to be ported to 1.2.x.
Comment 13 Michal Karm Babacek 2014-02-05 06:04:15 EST
THX, I'll test it and see whether I can get -1 again...
Comment 16 Michal Karm Babacek 2015-01-20 10:15:15 EST
EAP 6.4.0.ER1, mod_cluster 1.2.10.Final

> 06:49:20,117 WARN  [org.jboss.modcluster] (ServerService Thread Pool -- 14) MODCLUSTER000022: Failed to drain -1 remaining pending requests from default-host:/simplecontext-666 within 10.0 seconds
Comment 17 Michal Karm Babacek 2015-01-27 04:51:06 EST
> Red Hat Bugzilla 2015-01-27 00:12:13 EST
> Doc Type: Known Issue → Bug Fix
> Michal Babacek 2015-01-27 04:49:27 EST
> Doc Type: Bug Fix → Known Issue
> Red Hat Bugzilla 2015-01-27 04:49:27 EST
> Doc Type: Known Issue → Bug Fix

Dear Lucas, I put it in your hands...
Comment 18 Lucas Costi 2015-01-27 17:45:41 EST
Thanks Michal. Bugzilla is very picky with how we have to change it!
Comment 21 Jean-frederic Clere 2015-12-09 07:28:57 EST
https://github.com/modcluster/mod_cluster/commit/dbda5776efbee2854a582a29d314b33d4af1779 needs to be ported to the 1.2.x branch looks easy.

Note You need to log in before you can comment on or make changes to this bug.