Bug 1204055

Summary:	EJB client regression in 6.4.0.CR1
Product:	[JBoss] JBoss Enterprise Application Platform 6	Reporter:	Ladislav Thon <lthon>
Component:	Clustering, EJB	Assignee:	Tomas Hofman <thofman>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Jitka Kozana <jkudrnac>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	6.4.0	CC:	cdewolf, jmartisk, kkhan, mvinkler, myarboro, rachmato, thofman, wfink
Target Milestone:	CR2	Keywords:	Regression
Target Release:	EAP 6.4.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-08-19 12:43:25 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1206597

Description Ladislav Thon 2015-03-20 09:44:56 UTC

We are seeing a behavioral regression in 6.4.0.CR1. This only manifests in a single test case: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-ejb-2clusters-ejbremote-shutdown-repl-async/

In this test, a standalone client invokes an EJB in a cluster ("forwarder") and that EJB itself invokes another EJB in a different cluster ("target"). During that test, we repeatedly bring one node down (:shutdown), wait a while, bring it back up, and continue with another node.

Specifically, in our setting, perf17 is the standalone client, perf18-19 are the "target" cluster and perf20-21 are the "forwarder" cluster.

See https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-ejb-2clusters-ejbremote-shutdown-repl-async/58/artifact/report/failover.txt for a summary of the test from perf17 point of view, running against 6.4.0.ER3, and https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-ejb-2clusters-ejbremote-shutdown-repl-async/59/artifact/report/failover.txt for the same summary against CR1.

You can see that with CR1, when perf19 is down, the standalone client stops receiving responses. We set a 2-minute invocation timeout on the standalone client (the "invocation.timeout" property in jboss-ejb-client.properties), and you can see in the perf17 log that this deadline is not met frequently (TimeoutException: No invocation response received in 120000 milliseconds).

During our investigation, we concluded that this is caused by the fix for bug 1192471, because when we reverted the commit (https://github.com/jbossas/jboss-ejb-client/commit/61bd6290), things went back to normal (see https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/lthon___eap-6x-failover-ejb-2clusters-ejbremote-shutdown-repl-async/15/artifact/report/failover.txt). Note that even though we only reverted that commit in the in-server EJB client library (and the standalone client still runs vanilla 1.0.29), the problem indeed disappeared.

The commit itself looks suspicious, because it doesn't hold a lock while copying synchronized collections. Yet, we didn't see any ConcurrentModificationExceptions, and when we tried to copy those collections properly (i.e., under the respective lock), it didn't make a difference.

(What follows is a huge speculation based on my very limited understanding of the EJB client library. It can easily be wrong or misleading.)

I believe that while the commit fixes problematic behavior in one situation, it causes problematic behavior in an opposite situation. Originally, when a single invocation figured out that a cluster node is not available, even before topology change was propagated, that single invocation deleted that cluster node from the collections and the other invocations didn't bother trying that node. This doesn't happen anymore. Until a topology change is propagated, all invocations will keep trying that node, even if it is down. This will cause a reconnection process to start in the background, and all the invocations will wait for that reconnection to happen. It only happens when the node is back up, which effectively means that client failover won't happen when it should.

Comment 5 Richard Achmatowicz 2015-03-23 15:08:11 UTC

Been on PTO for the past two days. Let me have a look this afternoon.

Comment 6 Richard Achmatowicz 2015-03-23 21:19:21 UTC

Ladislav

Could you please run the test again with "debug" level logging turned on for these classes:
org.jboss.ejb.client.remoting.MaxAttemptsReconnectHandler
org.jboss.ejb.client.EJBClientContext
org.jboss.ejb.client.ClusterContext

This should give us some more information on what is happening when the EJBReceiver is being selected. To be safe, add this on both the original client as well as the forwarder.

Comment 7 Ladislav Thon 2015-03-24 08:52:38 UTC

The original runs mentioned above already have debug logging enabled for the org.jboss.ejb.client package, but only for the in-server EJB client.

Here's a run with the same debug logging enabled for the standalone client as well: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/lthon___debug___eap-6x-failover-ejb-2clusters-ejbremote-shutdown-repl-async/1/ Note that I had to run it on different machines: here, perf29 is the standalone client, perf30-31 is the target cluster and perf32-33 is the forwarder cluster (so it's just like the original + 12).

Comment 8 Richard Achmatowicz 2015-03-24 15:08:57 UTC

I can see where the problem is. When the nodes {perf30,perf31} are taken down individually, module un-availability reports are sent to the clients {perf32, perf33}. For some reason, the module un-availability reports for clusterbench-ee6, clusterbench-ee6-we, and clusterbench-ee6-web-granular arrive in together (on perf32, for example), yet the un-availability report for clusterbench-ee6-ejb (the app receiving the invocations and causing the problems) arrives 33 seconds later. This means that perf32 thinks the module is available during that time, when it is not - it has been undeployed or is in the process of being undeployed.

The perf32 client receives exceptional return results to that effect when the invocation fails and, at this stage, it should try the other node perf31. But it doesn't - it repeatedly tries perf30.

After the unavailability report arrives, things return to normal. 

Investigating further.

Comment 9 Richard Achmatowicz 2015-03-25 00:54:36 UTC

Ladislav

I could really benefit from being able to pass in my own build (with changes to the EJB client libraries) and run the test case against that build. Would it be possible to set up a copy of the SmartFrog Jenkins job which runs this test so that I can pass in my own copy of either the whole build or just a replacement EJBClient jar?

How busy is the lab these days for running such jobs? If I could get a run in every couple of hours, that would be fine. 

I'll be up early tomorrow so that I might have more overlap with Europe.

Comment 19 Kabir Khan 2015-03-27 18:26:04 UTC

Upgrade to ejb client 1.0.30 BZ1206597 backs out the fix for BZ1192471 and so this regression should be gone in CR2

Comment 22 Ladislav Thon 2015-03-31 09:31:08 UTC

Verified with EAP 6.4.0.CR2.