1018705 – [GSS](6.4.z) MODCLUSTER000022: Failed to drain n remaining pending requests

Bug 1018705 - [GSS](6.4.z) MODCLUSTER000022: Failed to drain n remaining pending requests

Summary: [GSS](6.4.z) MODCLUSTER000022: Failed to drain n remaining pending requests

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	JBoss Enterprise Application Platform 6
Classification:	JBoss
Component:	mod_cluster
Sub Component:
Version:	6.2.0
Hardware:	sparc
OS:	Solaris
Priority:	medium
Severity:	medium
Target Milestone:	DR2
Target Release:	EAP 6.4.9
Assignee:	Jean-frederic Clere
QA Contact:	Michal Karm Babacek
Docs Contact:	Russell Dickenson
URL:
Whiteboard:
Depends On:
Blocks:	1338645
TreeView+	depends on / blocked

Reported:	2013-10-14 08:48 UTC by Michal Karm Babacek
Modified:	2019-09-12 07:46 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Clone Of:
Clones:	1338645 (view as bug list)
Environment:
Last Closed:	2017-06-26 18:34:44 UTC
Type:	Bug
Embargoed:

Attachments	(Terms of Use)
Logs from httpd and EAP. (69.07 KB, application/zip) 2013-10-16 10:46 UTC, Michal Karm Babacek	no flags	Details
Relevant configs. (4.57 KB, application/zip) 2013-10-16 10:49 UTC, Michal Karm Babacek	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	MODCLUSTER-383	0	Critical	Resolved	Session draining broken: requests counting broken on load-balancer	2018-05-09 02:06:22 UTC

Description Michal Karm Babacek 2013-10-14 08:48:22 UTC

It appears that the counter of present requests keeps an illegal value -1, thus leading to a very log time of waiting (timeout 20s e.g.). This happens per context, so for 200 contexts, shutdown takes 200*20 seconds.

At the moment, I can't say whether this issue is limited solely to Solaris, but it is definitely easiest to reproduce there. JDK6 or 7 doesn't seem to play any role in it. I've failed to reproduce it on RHEL to date, gonna keep investigating.

Comment 1 Jean-frederic Clere 2013-10-14 08:53:29 UTC

To investigate the problem an error_log at debug is needed.

Comment 2 Michal Karm Babacek 2013-10-14 15:34:52 UTC

It is a race condition, I will attach a proper reproducer and logs as soon as I nail it.

Comment 3 Michal Karm Babacek 2013-10-16 10:46:53 UTC

Created attachment 812860 [details]
Logs from httpd and EAP.

OK, I think I've got it.

 - Soalris 11 sparc or x86.
 - Oracle JDK 7
 - Either stopContextTimeout 10 or 20 seconds, it seems arbitrary.
 - Many clients access simplecontext webapp which is just a static JSP.
 - Abruptly, EAP begins shutdown process.
 - There is the dreaded MODCLUSTER000022: Failed to drain -1 remaining pending requests from default-host:/simplecontext-666 within 20.0 seconds
 
I've failed to reproduce it on JDK 6 on the same box with the same test, yet, it doesn't prove much because it is clearly some kind of race condition and even If I executed the test several times, it might have been just a coincidence that JDK 6 didn't seem affected.

Comment 4 Michal Karm Babacek 2013-10-16 10:49:21 UTC

Created attachment 812861 [details]
Relevant configs.

Adding some configs, yet, there is nothing special but for stop-context-timeout="20". Please, note that I managed to reproduce the issue with stop-context-timeout="10" as well.

Comment 5 Michal Karm Babacek 2013-10-24 07:46:51 UTC

@Jitka: AFAIK it's linked only to Solaris...but I can't be 100% sure about that.
Giving it a second thought, I'm inclined to think it's nothing to do with your performance observations, because MCMP messages are sent already by the time AS is draining sessions/requests...

Comment 6 Michal Karm Babacek 2013-10-25 09:43:10 UTC

{code}
manager_trans STOP-APP (/)
manager_handler STOP-APP (/) processing: "JVMRoute=jboss-eap-6.2&Alias=default-host%2Clocalhost%2Cexample.com&Context=%2Fsimplecontext-666"
process_appl_cmd: STOP-APP nbrequests -1
manager_handler STOP-APP  OK
{code}

So, IMHO, we initialize it as:
ou->nbrequests = 0;
in insert_update_context function and the it passes out of our hands, because:
/* We don't update nbrequests it belongs to mod_proxy_cluster logic */
in insert_update function, right?

I suggest either investigate, why it's being updated with -1, or just put something like: 
if (ou->nbrequests < 0) {
    ou->nbrequests = 0;
}
just before this line:
    ap_rprintf(r, "%.*s, Status: %s Request: %d ", (int) sizeof(ou->context), ou->context, status, ou->nbrequests);

Hmm?

Comment 7 Michal Karm Babacek 2013-10-25 09:55:15 UTC

Sorry, it's Friday today:

there are actually two places in _our_ code, where we decrease the number of requests:
In proxy_cluster_pre_request, when we found a right worker for the context:
   upd_context_count(context_id, -1);
and in proxy_cluster_post_request again.

Please, ignore the suggestion in [Comment 6], I think this is better:
{code}
--- a/native/mod_proxy_cluster/mod_proxy_cluster.c
+++ b/native/mod_proxy_cluster/mod_proxy_cluster.c
@@ -3378,7 +3378,11 @@ static void upd_context_count(const char *id, int val)
     int ident = atoi(id);
     contextinfo_t *context;
     if (context_storage->read_context(ident, &context) == APR_SUCCESS) {
-        context->nbrequests = context->nbrequests + val;
+        int newVal = context->nbrequests + val;
+        if (newVal < 0) {
+            newVal = 0;
+        }
+        context->nbrequests = newVal;
     }
 }
{code}

WDYT?

Comment 8 Michal Karm Babacek 2013-11-04 23:55:39 UTC

@Jean-Frederic: If you think it's weird/wrong/hack, say so... If it's OK, I'll patch and test...

Comment 9 Jean-frederic Clere 2013-11-05 16:06:41 UTC

In the proxy_cluster_pre_request() we make -1 because it is a fail-over that is correct we will use another worker.
In the proxy_cluster_post_request() we make -1 because we have finished the request.
I still don't see how it could be -1 :-(

Comment 10 Jean-frederic Clere 2013-11-05 16:12:36 UTC

Note that upd_context_count() is protected by a apr_thread_mutex_lock(lock). So that is not a direct concurrency problem.

Comment 11 Jean-frederic Clere 2014-02-02 08:51:10 UTC

I have revisited comment #10 and adjust the code to move the apr_thread_mutex_lock(lock) outside the loop.

Comment 12 Jean-frederic Clere 2014-02-04 12:33:58 UTC

See https://issues.jboss.org/browse/MODCLUSTER-383 too. It need to be ported to 1.2.x.

Comment 13 Michal Karm Babacek 2014-02-05 11:04:15 UTC

THX, I'll test it and see whether I can get -1 again...

Comment 16 Michal Karm Babacek 2015-01-20 15:15:15 UTC

EAP 6.4.0.ER1, mod_cluster 1.2.10.Final

> 06:49:20,117 WARN  [org.jboss.modcluster] (ServerService Thread Pool -- 14) MODCLUSTER000022: Failed to drain -1 remaining pending requests from default-host:/simplecontext-666 within 10.0 seconds

Comment 17 Michal Karm Babacek 2015-01-27 09:51:06 UTC

> Red Hat Bugzilla 2015-01-27 00:12:13 EST
> Doc Type: Known Issue → Bug Fix
> Michal Babacek 2015-01-27 04:49:27 EST
> Doc Type: Bug Fix → Known Issue
> Red Hat Bugzilla 2015-01-27 04:49:27 EST
> Doc Type: Known Issue → Bug Fix

Dear Lucas, I put it in your hands...

Comment 18 Lucas Costi 2015-01-27 22:45:41 UTC

Thanks Michal. Bugzilla is very picky with how we have to change it!

Comment 21 Jean-frederic Clere 2015-12-09 12:28:57 UTC

https://github.com/modcluster/mod_cluster/commit/dbda5776efbee2854a582a29d314b33d4af1779 needs to be ported to the 1.2.x branch looks easy.

Note You need to log in before you can comment on or make changes to this bug.