Description of problem: After merging of split-brains, a client doesn't recognize the new cluster view and keeps using an old view. Version-Release number of selected component (if applicable): JDG 6.4.1 server and Java client (JDG 6.5.1 as well) How reproducible: Always Steps to Reproduce: 1. Start 2 nodes cluster, 127.0.0.1:11222 and 127.0.1.1:11222 in this example. 2. Connect to 127.0.0.1:11222 and confirm that the current view has 2 members. ~~~ 15:06:43,636 INFO [org.infinispan.client.hotrod.impl.protocol.Codec21] (main) ISPN004006: /127.0.0.1:11222 sent new topology view (id=2) containing 2 addresses: [/127.0.1.1:11222, /127.0.0.1:11222] 15:06:43,637 INFO [org.infinispan.client.hotrod.impl.transport.tcp.TcpTransportFactory] (main) ISPN004014: New server added(/127.0.1.1:11222), adding to the pool. ~~~ You can used the attached client as follows. $ make compile download run hoge> connect 127.0.0.1 hoge> get hoge (goes to the first server) hoge> get buzz (goes to the second server) 3. Stop (Ctrl-z) 127.0.1.1:11222 and wait this member has been dropped. 4. Access the cluster and confirm the new view with 1 member received. ~~~ 15:08:08,087 INFO [org.infinispan.client.hotrod.impl.protocol.Codec21] (main) ISPN004006: /127.0.0.1:11222 sent new topology view (id=3) containing 1 addresses: [/127.0.0.1:11222] 15:08:08,088 INFO [org.infinispan.client.hotrod.impl.transport.tcp.TcpTransportFactory] (main) ISPN004016: Server not in cluster anymore(/127.0.1.1:11222), removing from the pool. ~~~ 5. Restart (type "fg") 127.0.1.1:11222 and wait for merging. 6. Access to the cluster from the client but no views are received. Expected results: The new merged view should be received at step 6. Additional info: With the following log setting, you can observe which member is reached by a client. ~~~ <console-handler name="CONSOLE"> <level name="TRACE"/> ... <logger category="org.infinispan.interceptors.CallInterceptor"> <level name="TRACE"/> </logger> ~~~
Created attachment 1085419 [details] hotrodclient.zip
I have tried this case with community Infinispan 8.1.0.Alpha1 and the issue is not there, so probably this was already fixed before? Have you tried with latest JDG version? :| 10:54:37,805 INFO [com.example.HotRodClient] (main) connect called: + serverList 10:54:38,026 INFO [org.infinispan.client.hotrod.impl.protocol.Codec21] (main) ISPN004006: /127.0.0.1:11222 sent new topology view (id=5) containing 2 addresses: [/127.0.0.1:11222, /127.0.0.1:12222] 10:54:38,027 INFO [org.infinispan.client.hotrod.impl.transport.tcp.TcpTransportFactory] (main) ISPN004014: New server added(/127.0.0.1:12222), adding to the pool. 10:54:38,029 INFO [org.infinispan.client.hotrod.RemoteCacheManager] (main) ISPN004021: Infinispan version: 8.1.0.Alpha1 10:54:38,029 INFO [com.example.HotRodClient] (main) Connected. 10:54:38,111 INFO [com.example.HotRodClient] (main) Selected cache: hoge> get hoge null hoge> get buzz null hoge> get hoge 10:55:44,306 INFO [org.infinispan.client.hotrod.impl.protocol.Codec21] (main) ISPN004006: /127.0.0.1:11222 sent new topology view (id=6) containing 1 addresses: [/127.0.0.1:11222] 10:55:44,307 INFO [org.infinispan.client.hotrod.impl.transport.tcp.TcpTransportFactory] (main) ISPN004016: Server not in cluster anymore(/127.0.0.1:12222), removing from the pool. null hoge> get buzz null hoge> get hoge 10:56:10,619 INFO [org.infinispan.client.hotrod.impl.protocol.Codec21] (main) ISPN004006: /127.0.0.1:11222 sent new topology view (id=8) containing 2 addresses: [/127.0.0.1:11222, /127.0.0.1:12222] 10:56:10,620 INFO [org.infinispan.client.hotrod.impl.transport.tcp.TcpTransportFactory] (main) ISPN004014: New server added(/127.0.0.1:12222), adding to the pool. null hoge> get buzz null hoge>
I've re-run the test and I've been able to replicate it. I mixed up suspend and kill commands.
Galder Zamarreño <galder.zamarreno> updated the status of jira ISPN-5889 to Coding In Progress
@Galder, I've tested with JDG 6.4.1, JDG 6.5.1, and Infinispan 8.1.0.Alpha2 and all have the same behaviour. Ctrl-z, not killing, is important to imitate a long GC pause. This issue results in data inconsistency. For example, a client which connects to the first server always receives 1-member view after the merge. Any put operations, including a key which was directed to the second server originally, are directed to the first server. While a client which connects to the second server receives 2-member view after the merge. This client cannot read a value of the key put by the former client.
PR #3798 has been merged to the infinispan:master. I built and tested it but the issue in the description still remains. Are there more work on the issue?
Dan Berindei <dberinde> updated the status of jira ISPN-5889 to Reopened
PR: https://github.com/infinispan/jdg/pull/805 I've added a test method that does an "overlapping" merge. I've also tested with Ctrl+Z, and the client receives the 2nd node's address when it is resumed.
Tested using the provided application and reproduced the issue with ER2. The problem is no longer present in ER3. Marking as verified.
This product has been discontinued or is no longer tracked in Red Hat Bugzilla.