Bug 868832 - Server hinting does not replicate to "other" machines/rack/sites
Summary: Server hinting does not replicate to "other" machines/rack/sites
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: JBoss Data Grid 6
Classification: JBoss
Component: Server
Version: 6.1.0
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ER10
: 6.1.0
Assignee: Tristan Tarrant
QA Contact: Martin Gencur
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-10-22 08:58 UTC by Tomas Sykora
Modified: 2013-01-23 11:08 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-01-23 11:07:41 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
ispn config hinting-machine (1.23 KB, text/xml)
2012-10-22 08:58 UTC, Tomas Sykora
no flags Details
standalone-ha.xml for node0 and node1 (18.93 KB, text/xml)
2012-10-22 11:30 UTC, Tomas Sykora
no flags Details
standalone-ha.xml for node2 (18.94 KB, text/xml)
2012-10-22 11:31 UTC, Tomas Sykora
no flags Details
TRACE log for Dan, server hinting - rack - from our test suite (2.72 MB, text/plain)
2012-11-15 18:46 UTC, Tomas Sykora
no flags Details
TRACE log, server hinting - rack - from our test suite ER5 (2.68 MB, text/plain)
2012-12-06 16:06 UTC, Tomas Sykora
no flags Details
ER7 machine case - problem case TRACE log (1.70 MB, text/plain)
2013-01-07 13:36 UTC, Tomas Sykora
no flags Details
ER7 site case - OK / passing case TRACE log (2.89 MB, text/plain)
2013-01-07 13:37 UTC, Tomas Sykora
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker ISPN-2318 0 Blocker Resolved Reimplement a Topology-Aware Consistent Hash 2015-11-06 21:56:23 UTC
Red Hat Issue Tracker ISPN-2566 0 Critical Resolved TopologyAwareConsistentHashFactory rebalance doesn't redistribute data properly 2015-11-06 21:56:23 UTC
Red Hat Issue Tracker ISPN-2703 0 Critical Resolved The "machine" attribute of server hinting does not work 2015-11-06 21:56:22 UTC

Description Tomas Sykora 2012-10-22 08:58:28 UTC
Created attachment 631337 [details]
ispn config hinting-machine

This problem is similar for rack, site as well as machine. Following jdg-node configs are taken from server hinting machine test:

node0:
port.offset=0
siteId=primary
rackId=primary
machineId=primary

node1:
port.offset=100
siteId=primary
rackId=primary
machineId=primary

node2:
port.offset=200
siteId=primary
rackId=primary
machineId=secondary


We have set owners="2" so IMO there is expectation that entries are distributed between node0 and node1 (in this particular case) and are replicated to node2. So we should end for example with 2 entries in node0, 3 entries in node1 and all 5 entries in node2. 

Unfortunately there is NO replication to node2. We end only with replication between node0 and node1. Which should be distributed cache with replication to node2.

For some details please see config of ispn subsystem for this particular case.

Comment 1 Martin Gencur 2012-10-22 11:28:12 UTC
I did a short test too and it seems that the site,rack and machine attributes in JGroups subsystem are completely ignored by Infinispan currently. Entries are not replicated to different site/rack/machine as they should.

Comment 2 Tomas Sykora 2012-10-22 11:30:10 UTC
Created attachment 631418 [details]
standalone-ha.xml for node0 and node1

Comment 3 Tomas Sykora 2012-10-22 11:31:39 UTC
Created attachment 631419 [details]
standalone-ha.xml for node2

Attaching whole config files uncluding jgroups subsystem.

Comment 4 JBoss JIRA Server 2012-10-24 09:26:06 UTC
Dan Berindei <dberinde> made a comment on jira ISPN-2318

I implemented a version on top of DefaultConsistentHashFactory, and another version based on SyncConsistentHashFactory. I needed to modify both in order to reuse some code.

Comment 5 Tomas Sykora 2012-11-15 13:49:00 UTC
Unfortunatelly, this was not fixed. Failing with current ER3. Setting back to ON_DEV. 

Comment/ping/mail me if any logs/info needed.
Will be provided ASAP.

Comment 6 Dan Berindei 2012-11-15 14:44:12 UTC
Tomas, if you have trace logs, please post them here.

Comment 7 Tomas Sykora 2012-11-15 18:46:54 UTC
Created attachment 645826 [details]
TRACE log for Dan, server hinting - rack - from our test suite

Hi Dan,

here is TRACE log from our test suite.
Additional info:

please, see https://svn.devel.redhat.com/repos/jboss-qa/jdg/jdg-functional-tests/trunk/remote/xml-configuration/server-hinting/rack/src/test/java/com/jboss/datagrid/test/configuration/ServerHintingConfigurationTest.java, this test is failing on this line:

 assertTrue("server 1 elements are not contained in server 2", s2Bulk.containsAll(s1Bulk));

Hope that helps.
Let me know if you need something else, another log, more specific etc.

Comment 8 Dan Berindei 2012-11-16 08:09:48 UTC
Thanks Tomas, it's clear now. 

We have a unit test for TopologyAwareConsistentHashFactory, but it wasn't showing this problem because it always creates the consistent hash from scratch instead of repeated rebalance operations like it would normally happen.

Comment 9 Tomas Sykora 2012-12-06 16:06:39 UTC
Created attachment 658832 [details]
TRACE log, server hinting - rack - from our test suite ER5

Still no luck. Setting back to ON_DEV. Attaching new TRACE log from rack test - ER5. I don't know how can I help more now. Just let me know and I will do maximum.

Comment 10 Dan Berindei 2012-12-10 11:43:03 UTC
Tomas, I don't think the fix is included in JDG ER5/Infinispan 5.2.0.Beta5. It will be included in ER6/Beta6.

Comment 11 Tomas Sykora 2012-12-10 12:34:58 UTC
Thanks Dan! I totally missed this fact. Will verify with ER6 then ;)

Comment 12 Tomas Sykora 2013-01-07 13:36:34 UTC
Created attachment 674014 [details]
ER7 machine case - problem case TRACE log

Comment 13 Tomas Sykora 2013-01-07 13:37:11 UTC
Created attachment 674015 [details]
ER7 site case - OK / passing case TRACE log

Comment 14 Tomas Sykora 2013-01-07 13:41:01 UTC
Hi Dan,

please, see the two latest TRACE logs for more information. In ER6/7 it was repaired server hinting for rack and site. Machine still seems to not working now.

When I was looking into logs I found these differencies:

for MACHINE (test is not passing):

14:22:16,074 TRACE [org.infinispan.statetransfer.StateTransferManagerImpl] (OOB-76,null) Installing new cache topology CacheTopology{id=4, currentCH=DefaultConsistentHash{numSegments=1, numOwners=2, members=[node0/default(primary), node1/default(primary), node2/default(primary)], owners={0: 0 2}, pendingCH=null} on cache topology

for SITE (is ok, was fixed in ER6):

14:20:25,315 TRACE [org.infinispan.statetransfer.StateTransferManagerImpl] (OOB-76,null) Installing new cache topology CacheTopology{id=4, currentCH=DefaultConsistentHash{numSegments=80, numOwners=2, members=[node0/default(primary), node1/default(primary), node2/default(secondary)], owners={0: 0 2, 1: 0 2, 2: 0 2, 3: 0 2, 4: 0 2, 5: 0 2, 6: 0 2, 7: 0 2, 8: 0 2, 9: 0 2, 10: 0 2, 11: 0 2, 12: 0 2, 13: 0 2, 14: 0 2, 15: 0 2, 16: 0 2, 17: 0 2, 18: 0 2, 19: 0 2, 20: 0 2, 21: 0 2, 22: 0 2, 23: 0 2, 24: 0 2, 25: 0 2, 26: 0 2, 27: 2 0, 28: 2 0, 29: 2 0, 30: 2 0, 31: 2 0, 32: 2 0, 33: 2 0, 34: 2 0, 35: 2 0, 36: 2 0, 37: 2 0, 38: 2 0, 39: 2 0, 40: 1 2, 41: 1 2, 42: 1 2, 43: 1 2, 44: 1 2, 45: 1 2, 46: 1 2, 47: 1 2, 48: 1 2, 49: 1 2, 50: 1 2, 51: 1 2, 52: 1 2, 53: 1 2, 54: 1 2, 55: 1 2, 56: 1 2, 57: 1 2, 58: 1 2, 59: 1 2, 60: 1 2, 61: 1 2, 62: 1 2, 63: 1 2, 64: 1 2, 65: 1 2, 66: 1 2, 67: 2 1, 68: 2 1, 69: 2 1, 70: 2 1, 71: 2 1, 72: 2 1, 73: 2 1, 74: 2 1, 75: 2 1, 76: 2 1, 77: 2 1, 78: 2 1, 79: 2 1}, pendingCH=null} on cache topology

Can this be potencial problem?
Thank you very much for your investigation. If you need any other info, let me know.

Setting back ON_DEV for now despite of 2/3 was fixed and verified.

Comment 17 JBoss JIRA Server 2013-01-09 16:44:03 UTC
Martin Gencur <mgencur> made a comment on jira ISPN-2703

This should be actually fixed in CR2 but this option is not there yet.

Comment 18 JBoss JIRA Server 2013-01-15 12:52:19 UTC
Dan Berindei <dberinde> made a comment on jira ISPN-2703

[~mgencur], I have looked at the attached log but I didn't see any consistent hash with more than 2 members, so I'm not sure what the problem is.

Could you describe the issue with more details? What the cluster configuration is, what you expect to see, and what you get instead?

Comment 19 JBoss JIRA Server 2013-01-15 14:16:55 UTC
Martin Gencur <mgencur> made a comment on jira ISPN-2703

Dan, let me describe the problem more thoroughly.

We are starting 3 servers but numOwners is still only 2. We expect that whenever we put certain key/value pair in the cache to node0 or node1, it will be replicated to node2 because it has different "machine" attribute (rack and site are the same for all 3 nodes).

The site/rack/machine attributes are configured in the following way:

node0:
site=primary
rack=primary
machine=primary

node1:
site=primary
rack=primary
machine=primary

node2:
site=primary
rack=primary
machine=secondary

The test we are running is located at https://svn.devel.redhat.com/repos/jboss-qa/jdg/jdg-functional-tests/trunk/remote/xml-configuration/server-hinting/machine/src/test/java/com/jboss/datagrid/test/configuration/ServerHintingConfigurationTest.java

After I added a little more logging to the test, I got the following:
Entry count - node0: 1 ,node1: 0, node2: 1
Entry count - node0: 2 ,node1: 0, node2: 2
Entry count - node0: 3 ,node1: 0, node2: 3
Entry count - node0: 4 ,node1: 0, node2: 4
Entry count - node0: 5 ,node1: 0, node2: 5
Entry count - node0: 6 ,node1: 0, node2: 6
Entry count - node0: 7 ,node1: 0, node2: 7
Entry count - node0: 8 ,node1: 0, node2: 8
Entry count - node0: 9 ,node1: 0, node2: 9
...still the same even with more iterations.

The same test already works correctly for both rack and machine attributes (I mean replicating to a different site/rack) and the output then looks like this:
Entry count - node0: 0 ,node1: 1, node2: 1
Entry count - node0: 0 ,node1: 2, node2: 2
Entry count - node0: 1 ,node1: 2, node2: 3
Entry count - node0: 1 ,node1: 3, node2: 4
Entry count - node0: 2 ,node1: 3, node2: 5

Comment 20 JBoss JIRA Server 2013-01-15 14:17:53 UTC
Martin Gencur <mgencur> made a comment on jira ISPN-2703

Dan, let me describe the problem more thoroughly.

We are starting 3 servers but numOwners is still only 2. We expect that whenever we put certain key/value pair in the cache to node0 or node1, it will be replicated to node2 because it has different "machine" attribute (rack and site are the same for all 3 nodes).

The site/rack/machine attributes are configured in the following way:

node0:
site=primary
rack=primary
machine=primary

node1:
site=primary
rack=primary
machine=primary

node2:
site=primary
rack=primary
machine=secondary

The test we are running is located at https://svn.devel.redhat.com/repos/jboss-qa/jdg/jdg-functional-tests/trunk/remote/xml-configuration/server-hinting/machine/src/test/java/com/jboss/datagrid/test/configuration/ServerHintingConfigurationTest.java

After I added a little more logging to the test, I got the following:
Entry count - node0: 1 ,node1: 0, node2: 1
Entry count - node0: 2 ,node1: 0, node2: 2
Entry count - node0: 3 ,node1: 0, node2: 3
Entry count - node0: 4 ,node1: 0, node2: 4
Entry count - node0: 5 ,node1: 0, node2: 5
Entry count - node0: 6 ,node1: 0, node2: 6
Entry count - node0: 7 ,node1: 0, node2: 7
Entry count - node0: 8 ,node1: 0, node2: 8
Entry count - node0: 9 ,node1: 0, node2: 9
...still the same even with more iterations.

The same test already works correctly for both rack and site attributes (I mean replicating to a different site/rack) and the output then looks like this:
Entry count - node0: 0 ,node1: 1, node2: 1
Entry count - node0: 0 ,node1: 2, node2: 2
Entry count - node0: 1 ,node1: 2, node2: 3
Entry count - node0: 1 ,node1: 3, node2: 4
Entry count - node0: 2 ,node1: 3, node2: 5

Comment 21 JBoss JIRA Server 2013-01-22 10:51:38 UTC
Dan Berindei <dberinde> made a comment on jira ISPN-2703

[~mgencur], I looked at the test but I don't really understand how it works. Where is the topology information configured? 

Are you sure that all 3 servers are up and running during the test? I have looked again at the attached log, and I didn't see any signs of a 3rd node starting. This is the only mention of {{node2}}:

{noformat}
[INFO] --- maven-resources-plugin:2.5:copy-resources (node2) @ server-hinting-site-tests ---
{noformat}

Could you run the test again and attach the logs from all the servers?

Note that we already have a test for the machine attribute and it works (TopologyAwareConsistentHashFactoryTest.testDifferentMachines), so I'm pretty sure this is just a setup issue.

Comment 22 JBoss JIRA Server 2013-01-22 12:00:16 UTC
Martin Gencur <mgencur> made a comment on jira ISPN-2703

Dan, I'm attaching DEBUG logs from all three servers as well as the maven build output. The topology information is configured in JGroups subsystem of JDG and this is the configuration snippet: https://svn.devel.redhat.com/repos/jboss-qa/jdg/jdg-functional-tests/trunk/remote/common-utils/src/test/resources/jgroups-with-topology.xml

Maven puts together individual pieces of configuration and the final standalone-ha.xml configuration file is located at ${JDG_HOME}/standalone/configuration/standalone-ha.xml

The previously attached log was provided by Tomas and I guess it's not correct.

We wouldn't consider it as a bug but this test was passing for JDG 6.0.1 (based on ISPN 5.1.7).

Comment 23 JBoss JIRA Server 2013-01-22 12:01:32 UTC
Martin Gencur <mgencur> made a comment on jira ISPN-2703

Dan, I'm attaching DEBUG logs from all three servers as well as the maven build output. The topology information is configured in JGroups subsystem of JDG and this is the configuration snippet: https://svn.devel.redhat.com/repos/jboss-qa/jdg/jdg-functional-tests/trunk/remote/common-utils/src/test/resources/jgroups-with-topology.xml

Maven puts together individual pieces of configuration and the final standalone-ha.xml configuration file is located at JDG_HOME/standalone/configuration/standalone-ha.xml

The previously attached log was provided by Tomas and I guess it's not correct.

We wouldn't consider it as a bug but this test was passing for JDG 6.0.1 (based on ISPN 5.1.7).

Comment 24 JBoss JIRA Server 2013-01-22 12:07:10 UTC
Martin Gencur <mgencur> made a comment on jira ISPN-2703

Actually, the originally attached logs were taken from related BugZilla and were supposed to show the problem just for "site" attribute. This has been recently fixed and now only the "machine" attribute is not working.

Comment 25 JBoss JIRA Server 2013-01-22 15:21:36 UTC
Martin Gencur <mgencur> made a comment on jira ISPN-2703

Dan, let me test our scenario with more than 3 nodes. I checked your test (TopologyAwareConsistentHashFactoryTest.testNumberOfOwners), added more nodes and it seems it distributes the keys more or less uniformly - one copy is always at different machine, one copy is evenly distributed among the remaining nodes (if all but one node have the same value for the machine attribute). I'll let you know the results.

Comment 26 JBoss JIRA Server 2013-01-22 15:46:46 UTC
Martin Gencur <mgencur> made a comment on jira ISPN-2703

So I revised the test configuration and found out the numSegments attribute was set to 1. I increased the number to 30 and the test is now passing. 

Is this really expected not to work when the numSegments=1? Thanks for the explanation.

Comment 27 JBoss JIRA Server 2013-01-23 10:58:18 UTC
Dan Berindei <dberinde> made a comment on jira ISPN-2703

Ah, sorry, I looked at the test's infinispan.xml but I didn't notice numSegments was set to 1.

Yes, this is expected (but perhaps under-documented). The consistent hash doesn't actually distribute keys among the nodes, it only distributes segments. Each key is mapped to a segment, and the segment is mapped to a list of owners. So if there is only one segment, all the keys will map to that segment, and all of them will have the same owners.

Comment 28 JBoss JIRA Server 2013-01-23 11:01:14 UTC
Dan Berindei <dberinde> updated the status of jira ISPN-2703 to Resolved

Comment 29 JBoss JIRA Server 2013-01-23 11:01:14 UTC
Dan Berindei <dberinde> made a comment on jira ISPN-2703

I'm rejecting the bug, as this is expected when numSegments==1.

Martin, thanks for the explanation on the build system, I'll try to use it on the next JDG-related bug :)

Comment 30 Martin Gencur 2013-01-23 11:07:41 UTC
Closing the bug as this was a configuration issue.

Comment 31 JBoss JIRA Server 2013-01-23 11:08:12 UTC
Martin Gencur <mgencur> made a comment on jira ISPN-2703

Thanks Dan, good to know that.


Note You need to log in before you can comment on or make changes to this bug.