Created attachment 631337 [details] ispn config hinting-machine This problem is similar for rack, site as well as machine. Following jdg-node configs are taken from server hinting machine test: node0: port.offset=0 siteId=primary rackId=primary machineId=primary node1: port.offset=100 siteId=primary rackId=primary machineId=primary node2: port.offset=200 siteId=primary rackId=primary machineId=secondary We have set owners="2" so IMO there is expectation that entries are distributed between node0 and node1 (in this particular case) and are replicated to node2. So we should end for example with 2 entries in node0, 3 entries in node1 and all 5 entries in node2. Unfortunately there is NO replication to node2. We end only with replication between node0 and node1. Which should be distributed cache with replication to node2. For some details please see config of ispn subsystem for this particular case.
I did a short test too and it seems that the site,rack and machine attributes in JGroups subsystem are completely ignored by Infinispan currently. Entries are not replicated to different site/rack/machine as they should.
Created attachment 631418 [details] standalone-ha.xml for node0 and node1
Created attachment 631419 [details] standalone-ha.xml for node2 Attaching whole config files uncluding jgroups subsystem.
Dan Berindei <dberinde> made a comment on jira ISPN-2318 I implemented a version on top of DefaultConsistentHashFactory, and another version based on SyncConsistentHashFactory. I needed to modify both in order to reuse some code.
Unfortunatelly, this was not fixed. Failing with current ER3. Setting back to ON_DEV. Comment/ping/mail me if any logs/info needed. Will be provided ASAP.
Tomas, if you have trace logs, please post them here.
Created attachment 645826 [details] TRACE log for Dan, server hinting - rack - from our test suite Hi Dan, here is TRACE log from our test suite. Additional info: please, see https://svn.devel.redhat.com/repos/jboss-qa/jdg/jdg-functional-tests/trunk/remote/xml-configuration/server-hinting/rack/src/test/java/com/jboss/datagrid/test/configuration/ServerHintingConfigurationTest.java, this test is failing on this line: assertTrue("server 1 elements are not contained in server 2", s2Bulk.containsAll(s1Bulk)); Hope that helps. Let me know if you need something else, another log, more specific etc.
Thanks Tomas, it's clear now. We have a unit test for TopologyAwareConsistentHashFactory, but it wasn't showing this problem because it always creates the consistent hash from scratch instead of repeated rebalance operations like it would normally happen.
Created attachment 658832 [details] TRACE log, server hinting - rack - from our test suite ER5 Still no luck. Setting back to ON_DEV. Attaching new TRACE log from rack test - ER5. I don't know how can I help more now. Just let me know and I will do maximum.
Tomas, I don't think the fix is included in JDG ER5/Infinispan 5.2.0.Beta5. It will be included in ER6/Beta6.
Thanks Dan! I totally missed this fact. Will verify with ER6 then ;)
Created attachment 674014 [details] ER7 machine case - problem case TRACE log
Created attachment 674015 [details] ER7 site case - OK / passing case TRACE log
Hi Dan, please, see the two latest TRACE logs for more information. In ER6/7 it was repaired server hinting for rack and site. Machine still seems to not working now. When I was looking into logs I found these differencies: for MACHINE (test is not passing): 14:22:16,074 TRACE [org.infinispan.statetransfer.StateTransferManagerImpl] (OOB-76,null) Installing new cache topology CacheTopology{id=4, currentCH=DefaultConsistentHash{numSegments=1, numOwners=2, members=[node0/default(primary), node1/default(primary), node2/default(primary)], owners={0: 0 2}, pendingCH=null} on cache topology for SITE (is ok, was fixed in ER6): 14:20:25,315 TRACE [org.infinispan.statetransfer.StateTransferManagerImpl] (OOB-76,null) Installing new cache topology CacheTopology{id=4, currentCH=DefaultConsistentHash{numSegments=80, numOwners=2, members=[node0/default(primary), node1/default(primary), node2/default(secondary)], owners={0: 0 2, 1: 0 2, 2: 0 2, 3: 0 2, 4: 0 2, 5: 0 2, 6: 0 2, 7: 0 2, 8: 0 2, 9: 0 2, 10: 0 2, 11: 0 2, 12: 0 2, 13: 0 2, 14: 0 2, 15: 0 2, 16: 0 2, 17: 0 2, 18: 0 2, 19: 0 2, 20: 0 2, 21: 0 2, 22: 0 2, 23: 0 2, 24: 0 2, 25: 0 2, 26: 0 2, 27: 2 0, 28: 2 0, 29: 2 0, 30: 2 0, 31: 2 0, 32: 2 0, 33: 2 0, 34: 2 0, 35: 2 0, 36: 2 0, 37: 2 0, 38: 2 0, 39: 2 0, 40: 1 2, 41: 1 2, 42: 1 2, 43: 1 2, 44: 1 2, 45: 1 2, 46: 1 2, 47: 1 2, 48: 1 2, 49: 1 2, 50: 1 2, 51: 1 2, 52: 1 2, 53: 1 2, 54: 1 2, 55: 1 2, 56: 1 2, 57: 1 2, 58: 1 2, 59: 1 2, 60: 1 2, 61: 1 2, 62: 1 2, 63: 1 2, 64: 1 2, 65: 1 2, 66: 1 2, 67: 2 1, 68: 2 1, 69: 2 1, 70: 2 1, 71: 2 1, 72: 2 1, 73: 2 1, 74: 2 1, 75: 2 1, 76: 2 1, 77: 2 1, 78: 2 1, 79: 2 1}, pendingCH=null} on cache topology Can this be potencial problem? Thank you very much for your investigation. If you need any other info, let me know. Setting back ON_DEV for now despite of 2/3 was fixed and verified.
Martin Gencur <mgencur> made a comment on jira ISPN-2703 This should be actually fixed in CR2 but this option is not there yet.
Dan Berindei <dberinde> made a comment on jira ISPN-2703 [~mgencur], I have looked at the attached log but I didn't see any consistent hash with more than 2 members, so I'm not sure what the problem is. Could you describe the issue with more details? What the cluster configuration is, what you expect to see, and what you get instead?
Martin Gencur <mgencur> made a comment on jira ISPN-2703 Dan, let me describe the problem more thoroughly. We are starting 3 servers but numOwners is still only 2. We expect that whenever we put certain key/value pair in the cache to node0 or node1, it will be replicated to node2 because it has different "machine" attribute (rack and site are the same for all 3 nodes). The site/rack/machine attributes are configured in the following way: node0: site=primary rack=primary machine=primary node1: site=primary rack=primary machine=primary node2: site=primary rack=primary machine=secondary The test we are running is located at https://svn.devel.redhat.com/repos/jboss-qa/jdg/jdg-functional-tests/trunk/remote/xml-configuration/server-hinting/machine/src/test/java/com/jboss/datagrid/test/configuration/ServerHintingConfigurationTest.java After I added a little more logging to the test, I got the following: Entry count - node0: 1 ,node1: 0, node2: 1 Entry count - node0: 2 ,node1: 0, node2: 2 Entry count - node0: 3 ,node1: 0, node2: 3 Entry count - node0: 4 ,node1: 0, node2: 4 Entry count - node0: 5 ,node1: 0, node2: 5 Entry count - node0: 6 ,node1: 0, node2: 6 Entry count - node0: 7 ,node1: 0, node2: 7 Entry count - node0: 8 ,node1: 0, node2: 8 Entry count - node0: 9 ,node1: 0, node2: 9 ...still the same even with more iterations. The same test already works correctly for both rack and machine attributes (I mean replicating to a different site/rack) and the output then looks like this: Entry count - node0: 0 ,node1: 1, node2: 1 Entry count - node0: 0 ,node1: 2, node2: 2 Entry count - node0: 1 ,node1: 2, node2: 3 Entry count - node0: 1 ,node1: 3, node2: 4 Entry count - node0: 2 ,node1: 3, node2: 5
Martin Gencur <mgencur> made a comment on jira ISPN-2703 Dan, let me describe the problem more thoroughly. We are starting 3 servers but numOwners is still only 2. We expect that whenever we put certain key/value pair in the cache to node0 or node1, it will be replicated to node2 because it has different "machine" attribute (rack and site are the same for all 3 nodes). The site/rack/machine attributes are configured in the following way: node0: site=primary rack=primary machine=primary node1: site=primary rack=primary machine=primary node2: site=primary rack=primary machine=secondary The test we are running is located at https://svn.devel.redhat.com/repos/jboss-qa/jdg/jdg-functional-tests/trunk/remote/xml-configuration/server-hinting/machine/src/test/java/com/jboss/datagrid/test/configuration/ServerHintingConfigurationTest.java After I added a little more logging to the test, I got the following: Entry count - node0: 1 ,node1: 0, node2: 1 Entry count - node0: 2 ,node1: 0, node2: 2 Entry count - node0: 3 ,node1: 0, node2: 3 Entry count - node0: 4 ,node1: 0, node2: 4 Entry count - node0: 5 ,node1: 0, node2: 5 Entry count - node0: 6 ,node1: 0, node2: 6 Entry count - node0: 7 ,node1: 0, node2: 7 Entry count - node0: 8 ,node1: 0, node2: 8 Entry count - node0: 9 ,node1: 0, node2: 9 ...still the same even with more iterations. The same test already works correctly for both rack and site attributes (I mean replicating to a different site/rack) and the output then looks like this: Entry count - node0: 0 ,node1: 1, node2: 1 Entry count - node0: 0 ,node1: 2, node2: 2 Entry count - node0: 1 ,node1: 2, node2: 3 Entry count - node0: 1 ,node1: 3, node2: 4 Entry count - node0: 2 ,node1: 3, node2: 5
Dan Berindei <dberinde> made a comment on jira ISPN-2703 [~mgencur], I looked at the test but I don't really understand how it works. Where is the topology information configured? Are you sure that all 3 servers are up and running during the test? I have looked again at the attached log, and I didn't see any signs of a 3rd node starting. This is the only mention of {{node2}}: {noformat} [INFO] --- maven-resources-plugin:2.5:copy-resources (node2) @ server-hinting-site-tests --- {noformat} Could you run the test again and attach the logs from all the servers? Note that we already have a test for the machine attribute and it works (TopologyAwareConsistentHashFactoryTest.testDifferentMachines), so I'm pretty sure this is just a setup issue.
Martin Gencur <mgencur> made a comment on jira ISPN-2703 Dan, I'm attaching DEBUG logs from all three servers as well as the maven build output. The topology information is configured in JGroups subsystem of JDG and this is the configuration snippet: https://svn.devel.redhat.com/repos/jboss-qa/jdg/jdg-functional-tests/trunk/remote/common-utils/src/test/resources/jgroups-with-topology.xml Maven puts together individual pieces of configuration and the final standalone-ha.xml configuration file is located at ${JDG_HOME}/standalone/configuration/standalone-ha.xml The previously attached log was provided by Tomas and I guess it's not correct. We wouldn't consider it as a bug but this test was passing for JDG 6.0.1 (based on ISPN 5.1.7).
Martin Gencur <mgencur> made a comment on jira ISPN-2703 Dan, I'm attaching DEBUG logs from all three servers as well as the maven build output. The topology information is configured in JGroups subsystem of JDG and this is the configuration snippet: https://svn.devel.redhat.com/repos/jboss-qa/jdg/jdg-functional-tests/trunk/remote/common-utils/src/test/resources/jgroups-with-topology.xml Maven puts together individual pieces of configuration and the final standalone-ha.xml configuration file is located at JDG_HOME/standalone/configuration/standalone-ha.xml The previously attached log was provided by Tomas and I guess it's not correct. We wouldn't consider it as a bug but this test was passing for JDG 6.0.1 (based on ISPN 5.1.7).
Martin Gencur <mgencur> made a comment on jira ISPN-2703 Actually, the originally attached logs were taken from related BugZilla and were supposed to show the problem just for "site" attribute. This has been recently fixed and now only the "machine" attribute is not working.
Martin Gencur <mgencur> made a comment on jira ISPN-2703 Dan, let me test our scenario with more than 3 nodes. I checked your test (TopologyAwareConsistentHashFactoryTest.testNumberOfOwners), added more nodes and it seems it distributes the keys more or less uniformly - one copy is always at different machine, one copy is evenly distributed among the remaining nodes (if all but one node have the same value for the machine attribute). I'll let you know the results.
Martin Gencur <mgencur> made a comment on jira ISPN-2703 So I revised the test configuration and found out the numSegments attribute was set to 1. I increased the number to 30 and the test is now passing. Is this really expected not to work when the numSegments=1? Thanks for the explanation.
Dan Berindei <dberinde> made a comment on jira ISPN-2703 Ah, sorry, I looked at the test's infinispan.xml but I didn't notice numSegments was set to 1. Yes, this is expected (but perhaps under-documented). The consistent hash doesn't actually distribute keys among the nodes, it only distributes segments. Each key is mapped to a segment, and the segment is mapped to a list of owners. So if there is only one segment, all the keys will map to that segment, and all of them will have the same owners.
Dan Berindei <dberinde> updated the status of jira ISPN-2703 to Resolved
Dan Berindei <dberinde> made a comment on jira ISPN-2703 I'm rejecting the bug, as this is expected when numSegments==1. Martin, thanks for the explanation on the build system, I'll try to use it on the next JDG-related bug :)
Closing the bug as this was a configuration issue.
Martin Gencur <mgencur> made a comment on jira ISPN-2703 Thanks Dan, good to know that.