Bug 760895

Summary: Reopened: Error detecting crashed member during shutdown of EDG 6.0.0.Beta
Product: [JBoss] JBoss Data Grid 6 Reporter: Ondrej Nevelik <onevelik>
Component: InfinispanAssignee: Tristan Tarrant <ttarrant>
Status: VERIFIED --- QA Contact: Martin Gencur <mgencur>
Severity: low Docs Contact:
Priority: low    
Version: 6.0.0CC: galder.zamarreno, jdg-bugs, nobody, oskutka, pjha
Target Milestone: ER6Keywords: Reopened
Target Release: 6.1.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Occasionally, when shutting down nodes in a cluster, the following message is reported: "ERROR [org.infinispan.server.hotrod.HotRodServer] ISPN006002: Error detecting crashed member: java.lang.IllegalStateException: Cache '___hotRodTopologyCache' is in 'STOPPING' state and this is an invocation not belonging to an on-going transaction, so it does not accept new invocations. Either restart it or recreate the cache container." </para><para> This is due to the fact that a node has detected another node's shutdown and is attempting to update the topology cache while itself is also shutting down. The message is harmless, and it will be removed in a future release
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-04-04 12:58:47 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Ondrej Nevelik 2011-12-07 09:59:12 UTC
Description of problem:
Given a cluster of a few EDG servers an error is occuring in ~25% of our performance test runs (independent of client type tested) while gracefully shutting down the servers - stopping in parallel (kill "edg_pid", without -9 switch): 
ERROR [org.infinispan.server.hotrod.HotRodServer] ISPN006002: Error detecting crashed member: java.lang.IllegalStateException: Cache '___hotRodTopologyCache' is in 'STOPPING' state and this is an invocation not belonging to an on-going transaction, so it does not accept new invocations. Either restart it or recreate the cache container.

The whole log can be found at http://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-REPORTS-PERF/job/edg-60-perf-client-stress-test-rest/13/console-edg-perf02/

Comment 1 Tristan Tarrant 2012-03-02 08:22:10 UTC
The log referenced in the above comment is missing. Does this still happen ?

Comment 2 mark yarborough 2012-04-04 12:58:47 UTC
Tristan Tarrant indicates has not been reproduced in recent builds. Reopen if necessary.

Comment 3 Ondrej Nevelik 2012-04-05 06:21:26 UTC
I am seeing this exception again in ER6 (rest client stress test) - see server log of node01: http://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-REPORTS-PERF/job/edg-60-perf-client-stress-test-rest/68/artifact/report/size4/serverlogs.zip

Comment 4 Tristan Tarrant 2012-04-20 13:07:10 UTC
I think this ERROR is harmless. Asking Galder

Comment 5 Galder ZamarreƱo 2012-05-23 13:39:34 UTC
The error log is noisy but should be harmless. What happens is that a node A has detected that node B has gone down, and node A is trying to remove node B from its address cache. However, while doing that, node A is shutting down too, so it cannot update the address cache. The reason this is harmless is because each node tries to do this locally, so if any node is left still running, they'll still remove the node from their address cache.

I can probably improve the code in CrashedMemberDetectorListener to check whether invocations are allowed, rather than only checking whether the cache is terminated. I'll add a jira for this.

Comment 6 mark yarborough 2012-06-06 13:32:05 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Tristan will provide CCFR or will route to appropriate developer.

Comment 7 Tristan Tarrant 2012-06-12 15:36:12 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1 +1,3 @@
-Tristan will provide CCFR or will route to appropriate developer.+Occasionally, when shutting down nodes in a cluster, the following message is reported: 
+"ERROR [org.infinispan.server.hotrod.HotRodServer] ISPN006002: Error detecting crashed member: java.lang.IllegalStateException: Cache '___hotRodTopologyCache' is in 'STOPPING' state and this is an invocation not belonging to an on-going transaction, so it does not accept new invocations. Either restart it or recreate the cache container."
+This is due to the fact that a node has detected another node's shutdown and is attempting to update the topology cache while itself is also shutting down. The message is harmless, and it will be removed in a future release

Comment 8 Misha H. Ali 2012-06-12 15:39:07 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1,3 +1,4 @@
 Occasionally, when shutting down nodes in a cluster, the following message is reported: 
 "ERROR [org.infinispan.server.hotrod.HotRodServer] ISPN006002: Error detecting crashed member: java.lang.IllegalStateException: Cache '___hotRodTopologyCache' is in 'STOPPING' state and this is an invocation not belonging to an on-going transaction, so it does not accept new invocations. Either restart it or recreate the cache container."
+</para><para>
 This is due to the fact that a node has detected another node's shutdown and is attempting to update the topology cache while itself is also shutting down. The message is harmless, and it will be removed in a future release

Comment 9 mark yarborough 2012-11-14 14:42:32 UTC
ttarrant will add jira links as appropriate.

Comment 10 Michal Linhard 2012-12-18 16:40:58 UTC
Not present in 6.1.0.ER6

tested: 
- start 8 nodes (with hotrod endpoint + some test caches)
- fill some values to test cache
- stop (gracefully) all nodes
- wait till all java processes naturally die
- couldn't see any IllegalStateException in any of the server logs