Bug 892565
Summary: | StateConsumerImpl can build InvalidateL1Command with replicated cache | ||
---|---|---|---|
Product: | [JBoss] JBoss Data Grid 6 | Reporter: | Radim Vansa <rvansa> |
Component: | Infinispan | Assignee: | Tristan Tarrant <ttarrant> |
Status: | VERIFIED --- | QA Contact: | Martin Gencur <mgencur> |
Severity: | high | Docs Contact: | |
Priority: | medium | ||
Version: | 6.1.0 | CC: | jdg-bugs |
Target Milestone: | DR2 | ||
Target Release: | 6.2.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
JBoss Data Grid throws a NullPointerException on cache shutdown, which is a harmless warning. This exception is never propagated to user code. This situation occurs when on cache shutdown, the node sometimes receives the new consistent hash, where all its data segments are removed. It will then attempt to discard that data or move it to L1 if enabled. This fails due to NullPointerException as some cache components are already shut down (DistributionManager). This affects both distributed and replicated caches.
</para>
<para>
This last consistent hash update is irrelevant and is not always received, thus not always reproducible. Failing to process it is logged as a warning. This issue can only be avoided by ensuring LocalTopologyManager does not deliver consistent hash updates after leave.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | Type: | Bug | |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Radim Vansa
2013-01-07 10:55:21 UTC
Radim Vansa <rvansa> updated the status of jira ISPN-2692 to Closed Radim Vansa <rvansa> made a comment on jira ISPN-2692 This is basically a duplicate of ISPN-2589 (sorry, noticed that too late) closely related to ISPN-2691. Mircea Markus <mmarkus> made a comment on jira ISPN-2589 this is a replicated cache though, L1 logic should not be present at all. Adrian Nistor <anistor> made a comment on jira ISPN-2589 Indeed, this being a repl cache there should be no L1 logic. But even if L1 invalidation logic exist (and it's there because same code is used for dist mode) it should be harmless because that code path should never be reached since in repl mode no segments are ever removed. So I'm thinking the whole issue is an anomaly that disappears after we solve ISPN-2691. Michal Linhard <mlinhard> made a comment on jira ISPN-2589 Occurence in JDG 6.1.0.ER9 (infinispan 5.2.0.CR2): http://www.qa.jboss.com/~mlinhard/hyperion3/run0037-startup-ER9/report/loganalysis/server/categories/cat8_entry0.txt Michal Linhard <mlinhard> made a comment on jira ISPN-2589 sorry this is not REPL_SYNC case, should this be a new JIRA ? Michal Linhard <mlinhard> made a comment on jira ISPN-2589 This is my config: http://www.qa.jboss.com/~mlinhard/hyperion3/run0037-startup-ER9/report/configs.zip added standalone_node0001.xml Adrian Nistor <anistor> made a comment on jira ISPN-2589 Can't access www.qa.jboss.com even with the vpn. So what cache mode is it if not REPL_SYNC? Michal Linhard <mlinhard> made a comment on jira ISPN-2589 from attached file standalone_node0001.xml: {code} <subsystem default-cache-container="default" xmlns="urn:jboss:datagrid:infinispan:6.1"> <cache-container default-cache="testCache" name="default"> <transport executor="infinispan-transport" lock-timeout="600000"/> <distributed-cache batching="false" indexing="NONE" l1-lifespan="0" mode="SYNC" name="testCache" owners="2" remote-timeout="60000" segments="40" start="EAGER"> <locking acquire-timeout="3000" concurrency-level="1000" isolation="REPEATABLE_READ" striping="false"/> <transaction mode="NONE"/> <state-transfer enabled="true" timeout="600000"/> <eviction max-entries="-1" strategy="NONE"/> </distributed-cache> </cache-container> </subsystem> {code} Adrian Nistor <anistor> made a comment on jira ISPN-2589 And the NPE is thrown from exactly the same line from InvalidateL1Command? (sorry to ask so many details - cannot access the qa machine for logs) Michal Linhard <mlinhard> made a comment on jira ISPN-2589 No probs. Does the link http://dev39.mw.lab.eng.bos.redhat.com/~mlinhard/hyperion3/run0037-startup-ER9/report/loganalysis/server/categories/cat8_entry0.txt work for you ? if not I'll paste here Adrian Nistor <anistor> made a comment on jira ISPN-2589 Works. Thanks! Michal Linhard <mlinhard> made a comment on jira ISPN-2589 And one more occurence in 5.2.0.CR2 with DIST_SYNC mode, now originating in JGroups stack: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/EDG6/view/EDG-REPORTS-PERF/job/jdg-cs-perf-client-stress-test-hotrod/3/artifact/report/size8/loganalysis/server/categories/cat8_entry0.txt Adrian Nistor <anistor> made a comment on jira ISPN-2589 It seems this exception is actually a warning logged during cache shutdown. The exception is benign and is never propagated to user code. All we can do about it is add a simple check to prevent it happening. Adrian Nistor <anistor> updated the status of jira ISPN-2589 to Coding In Progress Adrian Nistor <anistor> made a comment on jira ISPN-2589 On cache leave (shutdown) the node _sometimes_ receives the new CH in which all its data segments are removed and it will (needlessly) attempt to discard that data (or move it to L1 if enabled) but fails due to NPE because some cache components are already shut down (DistributionManager). This affects both DIST and REPL caches because the state transfer code is common but it was never expected that segments are ever removed in a REPL cache (can happen only during shutdown). This last CH update is irrelevant and is not always received, thus not always reproducible and failing to process it is just logged as a warning. The best way to fix this is to ensure LocalTopologyManager does not deliver CH updates after leave. |