1280507 – [GSS](6.4.z) Endpoint can be closed before doConnect tasks finished causing AbstractHandleableCloseable.close to wait forever

Bug 1280507 - [GSS](6.4.z) Endpoint can be closed before doConnect tasks finished causing AbstractHandleableCloseable.close to wait forever

Summary: [GSS](6.4.z) Endpoint can be closed before doConnect tasks finished causing A...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	JBoss Enterprise Application Platform 6
Classification:	JBoss
Component:	Remoting
Sub Component:
Version:	6.4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	CR1
Target Release:	EAP 6.4.8
Assignee:	Enrique Gonzalez Martinez
QA Contact:	Peter Mackay
Docs Contact:
URL:
Whiteboard:
Depends On:	1278606
Blocks:	eap648-payload 1289243 1297478 1316385 1329530
TreeView+	depends on / blocked

Reported:	2015-11-11 23:20 UTC by Brad Maxwell
Modified:	2019-10-10 10:29 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Clone Of:
Clones:	1297478 1316385 (view as bug list)
Environment:
Last Closed:	2017-01-17 12:33:03 UTC
Type:	Bug
Embargoed:

Attachments	(Terms of Use)
Test project using EJB client (29.05 KB, application/zip) 2015-12-03 09:00 UTC, Osamu Nagano	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	JBEAP-2017	Critical	Verified	[GSS](7.0.z) Endpoint can be closed before doConnect tasks finished causing AbstractHandleableCloseable.close to wait fo...	2020-12-04 06:06:21 UTC
Red Hat Issue Tracker	REM3-216	Major	Resolved	Endpoint can be closed before doConnect tasks finished causing AbstractHandleableCloseable.close to wait forever	2020-12-04 06:06:21 UTC
Red Hat Issue Tracker	REM3-221	Major	Resolved	Timeout errors in JCA bits in testsuite after JBEAP-2017	2020-12-04 06:06:47 UTC
Red Hat Knowledge Base (Solution)	2136871	None	None	None	2016-01-22 22:46:39 UTC

Description Brad Maxwell 2015-11-11 23:20:52 UTC

Endpoint can be closed before doConnect tasks finished causing AbstractHandleableCloseable.close to wait forever.

Seen when application is making remote naming calls to a remote server.  

Such as this:

        Properties props = new Properties();
        props.put("java.naming.factory.initial", "org.jboss.naming.remote.client.InitialContextFactory");
        ...
        Context remoteContext = null; 
        Object object = null;
        try {
           remoteContext = new InitialContext(props);
           object = remoteContext.lookup(lookup);
        } finally {
           remoteContext.close();
        }
        ...
           
With cpu starvation under load, the timing of the connection start up tasks can be delayed.  so JBoss received a connection before the endpoint close, but the full connection start (handleDone) didn't occur until after the endpoint close was initiated. This gets us into a state where the connection is able to come in, endpoint closes, then the connection finishes adding itself to the closed endpoint's connections map and will never be closed.

EndpointImpl.doConnect 
 - connHandlerFuture.getIoFuture().addNotifier(...
   - public void handleDone

EndpointImpl.closeAction()

https://github.com/jboss-remoting/jboss-remoting/blob/3.3.6.Final/src/main/java/org/jboss/remoting3/EndpointImpl.java#L277


"Remoting "master:client-one:MANAGEMENT" task-3" prio=10 tid=0x00007f4a100fc800 nid=0x44b3 in Object.wait() [0x00007f4a68999000]
   java.lang.Thread.State: WAITING (on object monitor)
  at java.lang.Object.wait(Native Method) 
  - waiting on <0x00000000fe87aee0> (a java.lang.Object)
  at java.lang.Object.wait(Object.java:485)
  at org.jboss.remoting3.spi.AbstractHandleableCloseable.close(AbstractHandleableCloseable.java:190)
  - locked <0x00000000fe87aee0> (a java.lang.Object)
  at org.xnio.IoUtils.safeClose(IoUtils.java:137)
  at org.jboss.remoting3.EndpointImpl$5.handleDone(EndpointImpl.java:317)
  at org.jboss.remoting3.EndpointImpl$5.handleDone(EndpointImpl.java:293)
  at org.xnio.IoFuture$HandlingNotifier.notify(IoFuture.java:212)
  at org.xnio.AbstractIoFuture$NotifierRunnable.run(AbstractIoFuture.java:729)
  at org.xnio.IoUtils$2.execute(IoUtils.java:72)
  at org.xnio.AbstractIoFuture.runNotifier(AbstractIoFuture.java:702)
  at org.xnio.AbstractIoFuture$NotifierState.doNotify(AbstractIoFuture.java:275)
  at org.xnio.AbstractIoFuture$NotifierState.notifyDone(AbstractIoFuture.java:256)
  at org.xnio.AbstractIoFuture.setResult(AbstractIoFuture.java:628)
  at org.xnio.FutureResult.setResult(FutureResult.java:83)
  at org.xnio.IoUtils$ResultNotifier.handleDone(IoUtils.java:684)
  at org.xnio.IoUtils$ResultNotifier.handleDone(IoUtils.java:673)
  at org.xnio.IoFuture$HandlingNotifier.notify(IoFuture.java:212)
  at org.xnio.AbstractIoFuture$NotifierRunnable.run(AbstractIoFuture.java:729)
  at org.xnio.IoUtils$2.execute(IoUtils.java:72)
  at org.xnio.AbstractIoFuture.runNotifier(AbstractIoFuture.java:702)
  at org.xnio.AbstractIoFuture$NotifierState.doNotify(AbstractIoFuture.java:275)
  at org.xnio.AbstractIoFuture$NotifierState.notifyDone(AbstractIoFuture.java:256)
  at org.xnio.AbstractIoFuture$NotifierState.notifyDone(AbstractIoFuture.java:257)
  at org.xnio.AbstractIoFuture$CancellableState.notifyDone(AbstractIoFuture.java:334)
  at org.xnio.AbstractIoFuture.setResult(AbstractIoFuture.java:628)
  at org.xnio.FutureResult.setResult(FutureResult.java:83)
  at org.jboss.remoting3.remote.ClientConnectionOpenListener$Authentication$2.run(ClientConnectionOpenListener.java:763)
  at org.jboss.remoting3.EndpointImpl$TrackingExecutor$1.run(EndpointImpl.java:731)
  at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
  at java.lang.Thread.run(Thread.java:662)

Comment 9 Enrique Gonzalez Martinez 2015-11-24 11:48:15 UTC

Hi Aaron:

That stacktrace is different from the one in the description of this BZ (it is the one in BZ#1278606). 

This line:

https://github.com/bmaxwell/jboss-remoting/blob/dfe063d954038f780026f58e540dfd64d614ae53/src/main/java/org/jboss/remoting3/EndpointImpl.java#L317

is causing the worker thread to get stuck in there because it tries to close the connection in a sync way. In spite the other end sends back the close signal it is never processed by the client side (that's the reason why removing the line makes remoting to process properly the sequence of closing the connection).

From my point of view we should go back to the first bug and try to see the problem in there (naming component).

Comment 11 Enrique Gonzalez Martinez 2015-11-26 12:32:38 UTC

PR upstream https://github.com/jboss-remoting/jboss-remoting/pull/56
PR 6.4.z: https://github.com/jboss-remoting/jboss-remoting/pull/57

Comment 15 Osamu Nagano 2015-12-03 09:00:57 UTC

Created attachment 1101716 [details]
Test project using EJB client

An EJB and its client is included.  Remoting library is updated to 3.3.6.Final (EAP 6.4.5 equivalent).

Thread dumps and log file are included.  There are 3 snapshots in breakpoints.jstack: at first break point, at second break point, and when the client hangs at close().

breakpoints.jstack
breakpoints.log

Comment 22 Enrique Gonzalez Martinez 2016-01-11 13:16:01 UTC

merged: 

https://github.com/jboss-remoting/jboss-remoting/commit/083c125bdcadd4b65da77ddafa1e71e5aa3dc8e0

Comment 24 JBoss JIRA Server 2016-01-27 14:39:04 UTC

Enrique González Martínez <elguardian> updated the status of jira JBEAP-2017 to Resolved

Comment 25 JBoss JIRA Server 2016-02-17 15:00:04 UTC

Jason Greene <jason.greene> updated the status of jira JBEAP-2017 to Reopened

Comment 27 Enrique Gonzalez Martinez 2016-02-19 11:15:05 UTC

the fix uncovered another issue in upstream; that's the reason why it was reopened. 

https://issues.jboss.org/browse/REM3-216

Another fix was pushed in 3.3.x branch. 
see https://bugzilla.redhat.com/show_bug.cgi?id=1289243#c1

That commmit is related to REM3-216, but for some reason the integration test still hangs sometimes (I wasn't able to reproduce the problem so far)

This let 7.0.0 and 6.4.z out of sync as the issue in upstream was moved to 7.0.z

PS: The fix is still in the branch so current status of this BZ is not matching the PR state. I don't think we have a procedure for these cases.

Comment 28 Enrique Gonzalez Martinez 2016-02-19 11:26:17 UTC

I missed the right link. The uncovered bug is this one:

https://issues.jboss.org/browse/REM3-221

Comment 35 JBoss JIRA Server 2016-03-15 12:05:45 UTC

Bartosz Baranowski <bbaranow> updated the status of jira JBEAP-2017 to Resolved

Comment 36 JBoss JIRA Server 2016-03-30 08:12:37 UTC

Ladislav Thon <lthon> updated the status of jira JBEAP-2017 to Reopened

Comment 38 Peter Mackay 2016-05-05 12:06:02 UTC

Verified with EAP 6.4.8.CP.CR2

Comment 39 JBoss JIRA Server 2016-05-13 13:29:31 UTC

Enrique González Martínez <elguardian> updated the status of jira JBEAP-2017 to Resolved

Comment 40 JBoss JIRA Server 2016-09-19 08:50:18 UTC

Enrique González Martínez <elguardian> updated the status of jira JBEAP-2017 to Resolved

Comment 41 Petr Penicka 2017-01-17 12:33:03 UTC

Retroactively bulk-closing issues from released EAP 6.4 cumulative patches.

Note You need to log in before you can comment on or make changes to this bug.