888429 – JGroups: SuccessfulResponse not serializable exception on shutdown

Bug 888429 - JGroups: SuccessfulResponse not serializable exception on shutdown

Summary: JGroups: SuccessfulResponse not serializable exception on shutdown

Keywords:
Status:	CLOSED UPSTREAM
Alias:	None
Product:	JBoss Data Grid 6
Classification:	JBoss
Component:	Infinispan
Sub Component:
Version:	6.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	low
Target Milestone:	---
Target Release:	6.2.0
Assignee:	Tristan Tarrant
QA Contact:	Nobody
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2012-12-18 17:08 UTC by Michal Linhard
Modified:	2025-02-10 03:27 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2025-02-10 03:27:11 UTC
Type:	Bug
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	ISPN-2751	0	Minor	Open	NPE after JBossMarshaller stops	2014-07-30 13:42:42 UTC

Internal Links: 917599

Description Michal Linhard 2012-12-18 17:08:27 UTC

http://www.qa.jboss.com/~mlinhard/hyperion3/run0015/report/loganalysis/server/

this actually happens in a catch block for Throwable, so it might be some different exception

Comment 1 Michal Linhard 2012-12-18 17:09:53 UTC

testcase:
- start 8 nodes
- wait for cluster
- load some entries to hotrod test caches
- stop all nodes gracefully
- wait for all java processes to die

Comment 2 Michal Linhard 2012-12-18 18:15:24 UTC

And it is different exception. It's something about externalizer shutting down.

About incorrect log message in JGroups I filed https://issues.jboss.org/browse/JGRP-1556

Comment 3 Michal Linhard 2013-01-23 16:46:20 UTC

Resolution of JGRP-1556 only helped to show that the root cause behind this a NPE: http://www.qa.jboss.com/~mlinhard/hyperion3/run0038-startup-ER9/report/loganalysis/server/categories/cat8_entry1.txt

the actual NPE cause is ISPN-2751

so this is still present in JDG 6.1.0.ER9, however impact of this bug is near zero.

Comment 4 JBoss JIRA Server 2013-01-28 16:43:18 UTC

Galder Zamarreño <galder.zamarreno> made a comment on jira ISPN-2751

Michal, do you have the full stacktrace of the NPE?

Comment 5 JBoss JIRA Server 2013-01-28 17:34:31 UTC

Michal Linhard <mlinhard> made a comment on jira ISPN-2751

Because this exception is reported by JGroups, it's output only as one line ERROR message (see the link).
I managed to catch it via exception breakpoint and see it directly in Eclipse, but I can't find my notes, where I pasted the stacktrace and it's quite a long process to recreate it.
The JGroups code triggering it is this one:
https://github.com/belaban/JGroups/blob/master/src/org/jgroups/blocks/RequestCorrelator.java#L514

Comment 6 JBoss JIRA Server 2013-01-29 10:30:57 UTC

Dan Berindei <dberinde> made a comment on jira ISPN-2751

Interesting, we were just looking at the ISPN-2572 stack trace yesterday, which shows that all outstanding requests are "closed" by JGroups (by returning an exception) before actually shutting down the channel. The ExternalizerTable component only shuts down after the JGroups channel, so it shouldn't be possible to have active requests after the ExternalizerTable has stopped.

Do you have TRACE logs?

Comment 7 JBoss JIRA Server 2013-01-29 10:31:38 UTC

Dan Berindei <dberinde> made a comment on jira ISPN-2751

Interesting, we were just looking at the ISPN-2752 stack trace yesterday, which shows that all outstanding requests are "closed" by JGroups (by returning an exception) before actually shutting down the channel. The ExternalizerTable component only shuts down after the JGroups channel, so it shouldn't be possible to have active requests after the ExternalizerTable has stopped.

Do you have TRACE logs?

Comment 8 JBoss JIRA Server 2013-01-29 11:34:48 UTC

Michal Linhard <mlinhard> made a comment on jira ISPN-2751

serverlogs.zip trace logs from run of 6 nodes on my laptop. the NPE occured 3 times:

{code}
node05.log	872902	12:15:03,782 ERROR [org.jgroups.blocks.RequestCorrelator] failed marshalling rsp (SuccessfulResponse{responseValue={}} ): java.lang.NullPointerException
node06.log	892500	12:15:03,865 ERROR [org.jgroups.blocks.RequestCorrelator] failed marshalling rsp (SuccessfulResponse{responseValue={}} ): java.lang.NullPointerException
node03.log	1080114	12:15:04,639 ERROR [org.jgroups.blocks.RequestCorrelator] failed marshalling rsp (SuccessfulResponse{responseValue={}} ): java.lang.NullPointerException
{code}

there were also some other issues during the run, but hopefully this run illustrates what you need.

Comment 9 JBoss JIRA Server 2013-01-29 14:30:11 UTC

Galder Zamarreño <galder.zamarreno> made a comment on jira ISPN-2751

In the past we had issues where the RPC dispatcher was stopping *after* the cache manager had stopped. This lead to NPEs in ExternalizerTable (ISPN-2025) as a result of trying to deserialize stuff after the cache managed had stopped. At the time we added a workaround to handle null commands, which effectively are the result of externalizer table not being able to deserialize objects (normally due to the sequence of events mentioned) and we reverted it (ISPN-2063) once AS7-3180 had been implemented.

The problem now does not appear to be that we cannot deserialize something, but rather than while sending a response to *something*, the cache manager was stopped leading to a null externalizer table. The circumstances look very similar to ISPN-2752 actually and the result seems to be circumstantial:

* In ISPN-2752 we see that the cache manager has started stopping but the externalizer table is still there, because the response is being marshalled.
* In this JIRA, we see that the cache manager has fully stopped, the externalizer table is not there any more, but the JGroups transport (or RPC dispatcher) is still actively trying to send a reply to a command.

Some TRACE logs, including not only org.jgroups and org.infinispan, but also org.jboss.as.clustering would be very handy to try to better understand how to fix this.

Comment 10 JBoss JIRA Server 2013-01-29 16:46:41 UTC

Michal Linhard <mlinhard> made a comment on jira ISPN-2751

serverlogs2.zip - trace logs with "org.jboss.as.clustering" category this time only 5 nodes. See node04 for NPE.

Comment 11 JBoss JIRA Server 2013-01-29 17:40:59 UTC

Dan Berindei <dberinde> made a comment on jira ISPN-2751

I was actually able to reproduce this while running a test from the IDE. 

I think I (we) understood the ISPN-2752 stacktrace incorrectly - {{GroupRequest.requests}} represents the requests sent from the local node, so the {{IllegalStateException}} is thrown when the JChannel shuts down and there are still requests waiting for responses from the other nodes. In that case, we send a CH_UPDATE from a background thread and we don't wait for that thread to end before shutting down the JChannel.

There is a somewhat similar mechanism on the receiving end, as well - when stopping, JChannel calls {{shutdownNow()}} on the TP thread pools, which interrupts the threads. But if the threads don't do any I/O, they don't get a {{InterruptedException}} and they proceed with their work. 

In this case the OOB thread was actually was calling {{getorCreateComponent()}} on the GlobalComponentRegistry, which is synchronized. But the main thread was calling {{stop()}} on the GlobalComponentRegistry, which is also synchronized, at the same time. So the OOB thread could not proceed to do anything until the main thread finished stopping the cache manager (which included waiting 3 seconds for the OOB thread pool to shut down).

I think the problem is that {{AbstractComponentRegistry.stop()}} holds the component registry locked for too long. It should not hold the lock while it is invoking {{stop()}} on all the components.

Comment 13 Misha H. Ali 2013-05-07 03:26:27 UTC

Nominated for 6.2 release notes.

Comment 18 Red Hat Bugzilla 2025-02-10 03:27:11 UTC

This product has been discontinued or is no longer tracked in Red Hat Bugzilla.

Note You need to log in before you can comment on or make changes to this bug.