Bug 1224770
Summary: | Backup Failure policy setting is ignored | ||||||
---|---|---|---|---|---|---|---|
Product: | [JBoss] JBoss Data Grid 6 | Reporter: | Giuseppe Bonocore <gbonocor> | ||||
Component: | unspecified | Assignee: | Pedro Ruivo <pruivo> | ||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Martin Gencur <mgencur> | ||||
Severity: | unspecified | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 6.4.1 | CC: | dmehra, dstahl, jdg-bugs, vjuranek, wfink | ||||
Target Milestone: | CR1 | ||||||
Target Release: | 6.5.0 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | |||||||
: | 1231926 1231928 (view as bug list) | Environment: | |||||
Last Closed: | 2015-06-23 12:25:35 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1225028, 1231926, 1231928 | ||||||
Attachments: |
|
Update: The problem is not in Infinispan but in JGroups. The JGroups's bundler puts address and site_address in the same batch. If the first message had a site_address, all the messages in the batch would be delivered with that address. When Infinispan sees a site_address, it thinks that it is from a remote site but that is not true and the IllegalStateException is thrown. JGroups' JIRA: https://issues.jboss.org/browse/JGRP-1930 I also made a pull request with a fix (tested using the dockers images. thanks for that) Thank you very much for your feedback. Don't know if relevant at this stage, but we noticed that apparently the problem does not occurs with Java Hotrod clients. Sample Java Client Snippet: ... ConfigurationBuilder builder = new ConfigurationBuilder(); builder.addServers( ... ); cacheManager = new RemoteCacheManager(builder.build()); cache = cacheManager.getCache(); while (true) { cache.put(UUID.randomUUID().toString(),UUID.randomUUID().toString()); } ... Instead, it occurs systematically with simple C++ hotRod clients: .... infinispan::hotrod::ConfigurationBuilder builder; builder.addServers(address); infinispan::hotrod::RemoteCacheManager cacheManager(builder.build()); infinispan::hotrod::RemoteCache<std::string, std::string> cache(cacheManager.getCache<std::string, std::string>(false)); boost::uuids::random_generator generator; while(true) { std::string uuid_key = boost::lexical_cast<std::string>(generator()); std::string uuid_value = boost::lexical_cast<std::string>(generator()); cache.put(uuid_key, uuid_value); } ... I'm not sure about the C++ client, but the Java client retries the operation in another server if it fails. The issue only occurs if the operation is executed against node_2. I believe that the Java client will catch any exception from node_2 and retries in node_1. Fix from https://issues.jboss.org/browse/JGRP-1930 incorporated into 3.6.3.Final-redhat-3 build (https://bugzilla.redhat.com/show_bug.cgi?id=1225028) just FIY a short note about the reproducer - if you want to make sure the environment is reproducible (e.g. you include compiled C++ binary), please make sure you user version in your docker files, not just "FROM centos" I updated reproducer for JDG 6.5.0.CR1, the stack trace is not logged any more and error is not propagated in the cluster or to client, but primary node still logs ERROR messages: 11:50:37,455 ERROR [org.jgroups.protocols.relay.RELAY2] (OOB-15,shared=udp) node-1/PRIMARY: no route to DR: dropping message 11:50:37,455 ERROR [org.jgroups.protocols.relay.RELAY2] (HotRodServerWorker-5) node-1/PRIMARY: no route to DR: dropping message 11:50:37,471 ERROR [org.jgroups.protocols.relay.RELAY2] (OOB-18,shared=udp) node-1/PRIMARY: no route to DR: dropping message 11:50:37,484 ERROR [org.jgroups.protocols.relay.RELAY2] (OOB-18,shared=udp) node-1/PRIMARY: no route to DR: dropping message 11:50:37,472 ERROR [org.jgroups.protocols.relay.RELAY2] (HotRodServerWorker-1) node-1/PRIMARY: no route to DR: dropping message 11:50:37,499 ERROR [org.jgroups.protocols.relay.RELAY2] (Incoming-5,shared=udp) node-1/PRIMARY: no route to DR: dropping message 11:50:37,502 ERROR [org.jgroups.protocols.relay.RELAY2] (HotRodServerWorker-2) node-1/PRIMARY: no route to DR: dropping message 11:50:37,509 ERROR [org.jgroups.protocols.relay.RELAY2] (Incoming-5,shared=udp) node-1/PRIMARY: no route to DR: dropping message I can't do much about that errors in the logs. What you can do is turn off the logs for RELAY2. Since the backup is done asynchronously, Infinispan is never aware that the other site is down. So, it never stops trying to send data there. Giuseppe, could you please comment on it, if you are fine with ERRORs in log? Otherwise I didn't find any other issue, so if you are ok with that, I'll mark it as verified. Thanks Hello Vojtech. If the error message is harmless, it's ok to us to keep it Thanks Giuseppe, marking then as verified. *** Bug 1231926 has been marked as a duplicate of this bug. *** |
Created attachment 1029506 [details] Reproducer based on docker (see readme) Description of problem: In a 2 node cluster with backup DR, in case of DR down, the Backup Failure policy setting is ignored and the exception is logged and raised to the client Version-Release number of selected component (if applicable): How reproducible: 1) Create a 2 nodes UDP cluster (default conf), and a backup node (not part of the same cluster) 2) Create a distributed cache with a backup, e.g. : ... <distributed-cache name="default" mode="SYNC" segments="20" owners="2" remote-timeout="30000" start="EAGER"> <locking isolation="READ_COMMITTED" acquire-timeout="30000" concurrency-level="1000" striping="false"/> <transaction mode="NONE"/> <backups> <backup site="DR" strategy="ASYNC" failure-policy="IGNORE"/> </backups> </distributed-cache> ... 3) Start putting entries in the cache using an hotrod c++ client (e.g. in a loop) 4) Kill the backup node (see the attached reproducer for details) Steps to Reproduce: Attached a full reproducer based on docker Actual results: Primary cluster node starts logging ERRORS, the other one logs warning. Exception is raised to the client. And this, even if the backup is configured as " strategy="ASYNC" failure-policy="IGNORE" " (so in theory not even logging should occurs). Exception logged: (MASTER) 14:52:54,131 ERROR [org.infinispan.interceptors.InvocationContextInterceptor] (HotRodServerWorker-7) ISPN000136: Execution error: org.infinispan.remoting.RemoteException: ISPN000217: Received exception from node-2/PRIMARY, see cause for remote stack trace at org.infinispan.remoting.transport.AbstractTransport.checkResponse(AbstractTransport.java:44) [infinispan-core-6.2.1.Final-redhat-2.jar:6.2.1.Final-redhat-2] .... Caused by: org.infinispan.commons.CacheException: Problems invoking command. at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.handle(CommandAwareRpcDispatcher.java:219) [infinispan-core-6.2.1.Final-redhat-2.jar:6.2.1.Final-redhat-2] at org.jgroups.blocks.RequestCorrelator.handleRequest(RequestCorrelator.java:460) [jgroups-3.6.2.Final-redhat-2.jar:3.6.2.Final-redhat-2] ... Caused by: java.lang.IllegalStateException: Only XSiteReplicateCommand commands expected as a result of xsite calls but got org.infinispan.commands.remote.SingleRpcCommand at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.executeCommandFromRemoteSite(CommandAwareRpcDispatcher.java:229) [infinispan-core-6.2.1.Final-redhat-2.jar:6.2.1.Final-redhat-2] Expected results: The cache should works correctly even if the DR is down (given that the policy is set to failure-policy="IGNORE" ) Additional info: