Bug 1224770

Summary: Backup Failure policy setting is ignored
Product: [JBoss] JBoss Data Grid 6 Reporter: Giuseppe Bonocore <gbonocor>
Component: unspecifiedAssignee: Pedro Ruivo <pruivo>
Status: CLOSED CURRENTRELEASE QA Contact: Martin Gencur <mgencur>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 6.4.1CC: dmehra, dstahl, jdg-bugs, vjuranek, wfink
Target Milestone: CR1   
Target Release: 6.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1231926 1231928 (view as bug list) Environment:
Last Closed: 2015-06-23 12:25:35 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1225028, 1231926, 1231928    
Attachments:
Description Flags
Reproducer based on docker (see readme) none

Description Giuseppe Bonocore 2015-05-25 15:37:06 UTC
Created attachment 1029506 [details]
Reproducer based on docker (see readme)

Description of problem:

In a 2 node cluster with backup DR, in case of DR down, the Backup Failure policy setting is ignored and the exception is logged and raised to the client

Version-Release number of selected component (if applicable):


How reproducible:

1) Create a 2 nodes UDP cluster (default conf), and a backup node (not part of the same cluster)

2) Create a distributed cache with a backup, e.g. :

...

<distributed-cache name="default" mode="SYNC" segments="20" owners="2" remote-timeout="30000" start="EAGER">
                    <locking isolation="READ_COMMITTED" acquire-timeout="30000" concurrency-level="1000" striping="false"/>
                    <transaction mode="NONE"/>
					<backups>
                    	<backup site="DR" strategy="ASYNC" failure-policy="IGNORE"/>
					</backups>
                </distributed-cache>

...

3) Start putting entries in the cache using an hotrod c++ client (e.g. in a loop)

4) Kill the backup node

(see the attached reproducer for details)


Steps to Reproduce:

Attached a full reproducer based on docker

Actual results:

Primary cluster node starts logging ERRORS, the other one logs warning. Exception is raised to the client. And this, even if the backup is configured as " strategy="ASYNC" failure-policy="IGNORE" " (so in theory not even logging should occurs).

Exception logged:

(MASTER)

14:52:54,131 ERROR [org.infinispan.interceptors.InvocationContextInterceptor] (HotRodServerWorker-7) ISPN000136: Execution error: org.infinispan.remoting.RemoteException: ISPN000217: Received exception from node-2/PRIMARY, see cause for remote stack trace
        at org.infinispan.remoting.transport.AbstractTransport.checkResponse(AbstractTransport.java:44) [infinispan-core-6.2.1.Final-redhat-2.jar:6.2.1.Final-redhat-2]

....

Caused by: org.infinispan.commons.CacheException: Problems invoking command.
        at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.handle(CommandAwareRpcDispatcher.java:219) [infinispan-core-6.2.1.Final-redhat-2.jar:6.2.1.Final-redhat-2]
        at org.jgroups.blocks.RequestCorrelator.handleRequest(RequestCorrelator.java:460) [jgroups-3.6.2.Final-redhat-2.jar:3.6.2.Final-redhat-2]

...
Caused by: java.lang.IllegalStateException: Only XSiteReplicateCommand commands expected as a result of xsite calls but got org.infinispan.commands.remote.SingleRpcCommand
        at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.executeCommandFromRemoteSite(CommandAwareRpcDispatcher.java:229) [infinispan-core-6.2.1.Final-redhat-2.jar:6.2.1.Final-redhat-2]


Expected results:

The cache should works correctly even if the DR is down (given that the policy is set to failure-policy="IGNORE" )


Additional info:

Comment 3 Pedro Ruivo 2015-05-27 17:14:33 UTC
Update:

The problem is not in Infinispan but in JGroups.

The JGroups's bundler puts address and site_address in the same batch. If the first message had a site_address, all the messages in the batch would be delivered with that address. When Infinispan sees a site_address, it thinks that it is from a remote site but that is not true and the IllegalStateException is thrown.

JGroups' JIRA: https://issues.jboss.org/browse/JGRP-1930

I also made a pull request with a fix (tested using the dockers images. thanks for that)

Comment 4 Giuseppe Bonocore 2015-05-28 10:04:00 UTC
Thank you very much for your feedback. 

Don't know if relevant at this stage, but we noticed that apparently the problem does not occurs with Java Hotrod clients.

Sample Java Client Snippet:

...
        ConfigurationBuilder builder = new ConfigurationBuilder();
        builder.addServers( ... );
        cacheManager = new RemoteCacheManager(builder.build());
        cache = cacheManager.getCache();
       while (true) {
cache.put(UUID.randomUUID().toString(),UUID.randomUUID().toString());
	 }
...



Instead, it occurs systematically with simple C++ hotRod clients:

....
    infinispan::hotrod::ConfigurationBuilder builder;
    builder.addServers(address);
    infinispan::hotrod::RemoteCacheManager cacheManager(builder.build());
    infinispan::hotrod::RemoteCache<std::string, std::string>   cache(cacheManager.getCache<std::string, std::string>(false));
	boost::uuids::random_generator generator;
	while(true) {
    	std::string uuid_key = boost::lexical_cast<std::string>(generator());
	    std::string uuid_value = boost::lexical_cast<std::string>(generator());
	    cache.put(uuid_key, uuid_value);
	}
...

Comment 5 Pedro Ruivo 2015-05-28 10:26:02 UTC
I'm not sure about the C++ client, but the Java client retries the operation in another server if it fails.

The issue only occurs if the operation is executed against node_2. I believe that the Java client will catch any exception from node_2 and retries in node_1.

Comment 6 Dave Stahl 2015-05-28 18:31:32 UTC
Fix from https://issues.jboss.org/browse/JGRP-1930 incorporated into 3.6.3.Final-redhat-3 build (https://bugzilla.redhat.com/show_bug.cgi?id=1225028)

Comment 7 Vojtech Juranek 2015-06-03 11:44:34 UTC
just FIY a short note about the reproducer - if you want to make sure the environment is reproducible (e.g. you include compiled C++ binary), please make sure you user version in your docker files, not just "FROM centos"

Comment 8 Vojtech Juranek 2015-06-03 12:06:01 UTC
I updated reproducer for JDG 6.5.0.CR1, the stack trace is not logged any more and error is not propagated in the cluster or to client, but primary node still logs ERROR messages:

11:50:37,455 ERROR [org.jgroups.protocols.relay.RELAY2] (OOB-15,shared=udp) node-1/PRIMARY: no route to DR: dropping message
11:50:37,455 ERROR [org.jgroups.protocols.relay.RELAY2] (HotRodServerWorker-5) node-1/PRIMARY: no route to DR: dropping message
11:50:37,471 ERROR [org.jgroups.protocols.relay.RELAY2] (OOB-18,shared=udp) node-1/PRIMARY: no route to DR: dropping message
11:50:37,484 ERROR [org.jgroups.protocols.relay.RELAY2] (OOB-18,shared=udp) node-1/PRIMARY: no route to DR: dropping message
11:50:37,472 ERROR [org.jgroups.protocols.relay.RELAY2] (HotRodServerWorker-1) node-1/PRIMARY: no route to DR: dropping message
11:50:37,499 ERROR [org.jgroups.protocols.relay.RELAY2] (Incoming-5,shared=udp) node-1/PRIMARY: no route to DR: dropping message
11:50:37,502 ERROR [org.jgroups.protocols.relay.RELAY2] (HotRodServerWorker-2) node-1/PRIMARY: no route to DR: dropping message
11:50:37,509 ERROR [org.jgroups.protocols.relay.RELAY2] (Incoming-5,shared=udp) node-1/PRIMARY: no route to DR: dropping message

Comment 9 Pedro Ruivo 2015-06-03 13:11:42 UTC
I can't do much about that errors in the logs. What you can do is turn off the logs for RELAY2. 

Since the backup is done asynchronously, Infinispan is never aware that the other site is down. So, it never stops trying to send data there.

Comment 10 Vojtech Juranek 2015-06-03 14:08:42 UTC
Giuseppe, could you please comment on it, if you are fine with ERRORs in log? Otherwise I didn't find any other issue, so if you are ok with that, I'll mark it as verified.
Thanks

Comment 11 Giuseppe Bonocore 2015-06-04 11:59:09 UTC
Hello Vojtech. If the error message is harmless, it's ok to us to keep it

Comment 13 Vojtech Juranek 2015-06-04 14:26:32 UTC
Thanks Giuseppe, marking then as verified.

Comment 18 wfink 2015-06-19 12:01:31 UTC
*** Bug 1231926 has been marked as a duplicate of this bug. ***