Bug 809631

Summary: Uneven request balancing after node restore
Product: [JBoss] JBoss Data Grid 6 Reporter: Michal Linhard <mlinhard>
Component: InfinispanAssignee: Tristan Tarrant <ttarrant>
Status: VERIFIED --- QA Contact: Nobody <nobody>
Severity: high Docs Contact:
Priority: high    
Version: 6.0.0CC: jdg-bugs, nobody
Target Milestone: ---   
Target Release: 6.0.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Previously, after a node crashed and rejoined the cluster, it did not receive client load at the same level as the other nodes. The Hot Rod server is now fixed so that the view identifier is not updated until the topology cache contains the addresses of all nodes.
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Fix for the uneven req balancing none

Description Michal Linhard 2012-04-03 21:02:15 UTC
Happened in 8 node resilience test where 2 nodes were crashed
Clustering mode DIST, numOwners=3
http://www.qa.jboss.com/~mlinhard/hyperion/run51-resi-dist-size8-3own

this is what happens with the views:
http://www.qa.jboss.com/~mlinhard/hyperion/run51-resi-dist-size8-3own/table.html

after the crash one of the nodes get's significantly lower load than the others:
http://www.qa.jboss.com/~mlinhard/hyperion/run51-resi-dist-size8-3own/report/stats-throughput.png

Keys are balanced evenly though
http://www.qa.jboss.com/~mlinhard/hyperion/run51-resi-dist-size8-3own/report/cache_entries.png

Comment 2 Tristan Tarrant 2012-04-04 12:58:35 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
After a node crashes and rejoins the cluster, it does not receive client load at the same level as the other nodes.
This issue does not affect data integrity and distribution in the cluster.

Comment 3 JBoss JIRA Server 2012-04-20 13:25:50 UTC
Manik Surtani <manik.surtani> made a comment on jira ISPN-1995

Could this be due to topology not being properly updated on the clients?  I presume this is with Hot Rod?  Can you see, in the client logs, that the topology update has been received?

Comment 4 Michal Linhard 2012-04-25 09:24:22 UTC
Unfortunately I turned off these kind of messages in those runs.
I'll try to rerun with these enabled. Probably next week. Right now I'm swamped.

Comment 5 Michal Linhard 2012-05-02 15:14:48 UTC
Happened again in 16 node resilience test for ER7:
http://www.qa.jboss.com/~mlinhard/hyperion/run81-resi-dist-16/report/stats-throughput.png

I'll rerun with more detailed client logs ASAP.

Comment 6 Michal Linhard 2012-05-03 08:37:16 UTC
Hyperion lab experiencing some problems, so this will take longer.

Comment 7 Michal Linhard 2012-05-04 16:37:55 UTC
OK. I've reran in hyperion, reproduced again:
http://www.qa.jboss.com/~mlinhard/hyperion/run85-resi-dist-16/report/stats-throughput.png

this time with some log analysis:
http://www.qa.jboss.com/~mlinhard/hyperion/run85-resi-dist-16/report/loganalysis/client/

Config: 16 nodes, DIST mode, numOwners 3, crashing 2 nodes

The steps of the resilience test is as follows:

1. Start complete cluster node0001 - node0016, wait till it forms (View 6 around 02:44:32,594)
2. Wait 5 min
3. Kill node0002, node0003, wait till survivor cluster forms (View 7 around 02:50:54,200)
4. Wait 5 min
5. Restore node0002, node0003, wait till complete cluster forms again (View 9 around 02:56:10,164)

These views are created:
http://www.qa.jboss.com/~mlinhard/hyperion/run85-resi-dist-16/report/loganalysis/views.html

There are two anomalies during the test:

2 clients (333 and 459) when the nodes are killed they remove them from topology, add them and remove again within few seconds. (that's why we're seeing 502 node adds and removes in the client logs even though there are only 500 clients)

After the two nodes were restored the nodes start to obtain the topology information
first about the node0003 (starting around 02:55:57,302) and then about node0002 (starting around 02:56:03,581)

However in 165 cases the clients don't obtain the information about node0002 being added which is 33% of the nodes, 
which corresponds to the load (throughput) of the node0002 being cca 33% lower than of other nodes (in the throughput chart)

Comment 8 JBoss JIRA Server 2012-05-09 06:04:12 UTC
Galder Zamarreño <galder.zamarreno> updated the status of jira ISPN-1995 to Coding In Progress

Comment 9 mark yarborough 2012-05-09 13:23:22 UTC
Per Tristan triage, moved from ER8 to ER9 target release.

Comment 10 JBoss JIRA Server 2012-05-09 15:03:04 UTC
Galder Zamarreño <galder.zamarreno> made a comment on jira ISPN-1995

@Michal, thx for generating the logs. I see two main issues here:

1. Even after nodes have been killed, some servers might still send old views. This can only happen if clients are sending a different view id to the one in the server. The only thing that I see possible is this (and logs seem to indicate so). Client-333 sends a view id, let's say 1, and server has been updated to view id 2, so it sends back the new view. Now, Client-333 goes to a different node where view has not yet been updated and it sends view id 2. The server still has view id 1, and so, as it considers it 'different', it sends back the old view. The client cannot connect to the servers that have been killed. 

Removing crashed members happens in an async view change listener, which works on the local cache only. This is done this way to avoid issues with transport view id changing before the cache has been updated, see ISPN-1654 for more info.

So, this particular issue can be easily solved by servers skipping sending view changes if clients have 'newer' view id, by newer I mean strictly higher view ids.

"New topology..." info message could also with adding what the view id is, plus which node this topology comes from.

2. The second problem is a more weird one. The following doesn't make much sense:

{code}02:56:03,620 INFO  [org.infinispan.client.hotrod.impl.transport.tcp.TcpTransportFactory] (Client-333) ISPN004014: New server added(/192.168.87.92:11222), adding to the pool.
...
02:56:06,621 WARN  [org.infinispan.client.hotrod.impl.transport.tcp.TcpTransport] (Client-333) ISPN004008: Could not connect to server: /192.168.87.91:11222{code}

Assuming that 192.168.87.91 is node0002, and node0002 has been restarted, it would appear that either:
a) cluster formed and server view in another node was updated before the endpoint was fully started in .91
b) or, the view was updated around the cluster, but the server endpoint failed to start.

Looking at the server.log for .91, the problem is a):

{code}02:55:56,711 INFO  [stdout] (pool-2-thread-1) GMS: address=node0002/default, cluster=default, physical address=192.168.87.91:55200
02:55:56,711 INFO  [stdout] (pool-2-thread-1) -------------------------------------------------------------------
02:55:56,913 INFO  [com.jboss.datagrid] (MSC service thread 1-16) REST Server starting
02:55:56,914 INFO  [com.jboss.datagrid.endpoint.ProtocolServerService] (MSC service thread 1-15) MemcachedServer starting
02:55:56,914 INFO  [com.jboss.datagrid.endpoint.ProtocolServerService] (MSC service thread 1-5) HotRodServer starting 
...
02:56:10,146 INFO  [com.jboss.datagrid.endpoint.ProtocolServerService] (MSC service thread 1-5) HotRodServer started in 13232ms
{code}

What's happening is the cache view installation, state transfer...etc, which is delaying the endpoint being started. At first glance, moving the registration of the node after transport has started could work, but what if a request comes in and the the view id is not yet set? There's a second problem here, and that is the fact that the view id is set with a view listener, so if the transport is started before the view listener is installed, we'd miss view ids. So, this needs a bit more thinking.

Comment 11 Michal Linhard 2012-05-10 08:58:34 UTC
Another appearance in ER7, 32 node resilience tests numOwners5, failing 4 nodes.
http://www.qa.jboss.com/~mlinhard/hyperion/run96-resi-dist-32-numowners5/

Comment 12 Tristan Tarrant 2012-05-11 06:25:24 UTC
Created attachment 583745 [details]
Fix for the uneven req balancing

The zip contains both HotRod client and server jars. They must be put in place of the equivalent jar files in JDG ER8.1. The client jar has to be used for the test drivers.

Comment 13 Michal Linhard 2012-05-15 09:25:54 UTC
Tried 32 node resilience test with numowners5 and 4node failures:

I couldn't spot the original issue:
http://www.qa.jboss.com/~mlinhard/hyperion/run101-resi-dist-32-numowners5/report/stats-throughput.png

but another problem shows up, probably connected with upgrade of netty:
http://www.qa.jboss.com/~mlinhard/hyperion/run101-resi-dist-32-numowners5/logs/analysis/server/categories/cat6.html

Comment 14 JBoss JIRA Server 2012-05-15 09:35:09 UTC
Galder Zamarreño <galder.zamarreno> made a comment on jira ISPN-1995

Hmmmm, I need to check with Tristan, but Netty upgrade required changes to Infinispan itself. I dunno whether Tristan ported over these changes too. I'm gonna send the pull req for that now.

Comment 15 Michal Linhard 2012-05-15 10:02:18 UTC
16 node resilience test with numOwners3 and 2node failures:

Also original issue gone:
http://www.qa.jboss.com/~mlinhard/hyperion/run102-resi-dist-16-er8.1-
patch/report/stats-throughput.png

But we've got lots of our invalid magic number friend (https://bugzilla.redhat.com/show_bug.cgi?id=807741):

http://www.qa.jboss.com/~mlinhard/hyperion/run102-resi-dist-16-er8.1-patch/logs/analysis/client/

Comment 16 Michal Linhard 2012-05-15 13:42:27 UTC
Reran with the netty fix but I've got failed prepare, I need to set different timeouts and rerun:

http://www.qa.jboss.com/~mlinhard/hyperion/run103-resi-dist-32-numowners5-ER8.1-nettyfix/logs/analysis/views.html

The netty problem didn't appear this time.

Comment 18 Michal Linhard 2012-05-15 16:31:59 UTC
The same problem in size16 numowners3 fail2 tests:

http://www.qa.jboss.com/~mlinhard/hyperion/run105-resi-dist-16-ER8.1-nettyfix/report/stats-throughput.csv

node0003 doesn't recover throughput at all

Comment 20 JBoss JIRA Server 2012-05-16 08:18:45 UTC
Galder Zamarreño <galder.zamarreno> made a comment on jira ISPN-1995

@Michal, re: node 16 test: Is the throughput stats collection right? According to them, the node that doesn't seem to recover is node0002 rather than node0003. It might be the case that the stats are only collected until view5 has been installed (http://dev39.mw.lab.eng.bos.redhat.com/~mlinhard/hyperion/run105-resi-dist-16-ER8.1-nettyfix/logs/analysis/views.html) but not view6? I can't map relative times in stats to absolute times in logs. Otherwise, I don't see any errors that would explain this.

re: node 32 tests: The stats issue looks similar, but probably it's just a false alarm. It would appear as if the stats would finish in view 12, and view 13 would not appear.

Assuming the stats are right, seems like the issue is the same: last view (view6 in 16, view 13 in 32) is not installed. This could either be because:

a) the server does not send it
b) the client does not install it

To cover all bases, repeat the test with:

1. In server, enable TRACE on {code}org.infinispan.server.hotrod.Encoders$Encoder11${code}
2. In client, enable TRACE on {code}org.infinispan.client.hotrod.impl.protocol.Codec11{code}

Comment 21 Tristan Tarrant 2012-05-17 14:15:27 UTC
Tentatively marking as resolved

Comment 22 Michal Linhard 2012-05-18 06:48:21 UTC
Fail.

Tested with ER9 (md5sum 4d277c8626cdada863a3d26442044775)
Client side infinispan version 5.1.5.CR1


32 node resilience tests numOwners 5 fail 4:
http://www.qa.jboss.com/~mlinhard/hyperion/run110-resi-dist-32-ER9-own5/chart-throughput-killed-nodes.png

after all four nodes [2,5,11,13] are restored, only node 13 gets updated in the clients.
This always seems to be corresponding with the order of views in which the final restored view is installed
http://www.qa.jboss.com/~mlinhard/hyperion/run110-resi-dist-32-ER9-own5/logs/analysis/views.html

The crashed view - 8: All - [2,5,11,13]
View 9:               View 8 + [13]
Restored view:        All

Comment 23 Michal Linhard 2012-05-18 06:57:52 UTC
I was able to reproduce it on my laptop with 4 nodes and 1 client.
After the 2 nodes were restored (with one intermediate view) the node that was added in the second view wasn't reported back to the client. Trace logs here:
http://www.qa.jboss.com/~mlinhard/test_results/bz809631-reproduce.zip

Comment 24 Michal Linhard 2012-05-18 08:27:44 UTC
@Galder, just in case you don't already know: I'm looking at the trace logs as well and on node01 I'm seeing this:

HotRodServer.viewId is set to 2 when both events - adding of node02 and adding of node03:

14:34:14,429 TRACE [org.infinispan.server.hotrod.HotRodServer] (undefined) Address cache had CACHE_ENTRY_CREATED for key node03/default. View id is now 2
14:34:14,468 TRACE [org.infinispan.server.hotrod.HotRodServer] (undefined) Address cache had CACHE_ENTRY_CREATED for key node02/default. View id is now 2

since viewId didn't change the HotRodServer doesn't think it needs to send info about node02 to the client.

Comment 25 Michal Linhard 2012-05-18 08:34:07 UTC
This is cause by the discrepancy between JGroups and Infinispan CacheView ids.

There are only 2 JGroups view ids in this tests:

14:31:06,253 [node01/default|1] [node01/default, node04/default]
14:33:41,943 [node01/default|2] [node01/default, node04/default, node03/default, node02/default]

but 4 view ids when it comes to CacheView for ___hotRodTopologyCache.

Comment 26 JBoss JIRA Server 2012-05-18 11:34:14 UTC
Galder Zamarreño <galder.zamarreno> made a comment on jira ISPN-1995

Michal, I'm looking at the logs now. At first glance, that JGroups incorporates several nodes into a single view shouldn't be problematic, unless the cache is filled up with the nodes in view 2 after the view id is set. Then, what happens is that you have an updated view but not with all nodes. Cache views is a different topic, which should not influence here.

Comment 27 JBoss JIRA Server 2012-05-18 12:05:29 UTC
Galder Zamarreño <galder.zamarreno> made a comment on jira ISPN-1995

Cache filling up after view is set is what's happening here actually. This is what's happening:

a)Before ISPN-1995, the order of startup was:
1. Add node to topology
2. Start cache manager + JGroups transport
3. Start server endopoint.

The problem with this was that other nodes were finding nodes in topology that might not yet have started the endpoint, resulting in connection refused exceptions.

b) The fix I implemented changed this around to be:
1. Start cache manager + JGroups transport
2. Start server endopoint.
3. Add node to topology

The problem now is that JGroups transport starts before the node is added, and so the topology id (which is updated when a node is added to the cache), can be updated (if all nodes have formed the cluster already) before the other nodes have added themselves to the topology cache, hence resulting in missing nodes.

I think the order we have now is fine. I can see two ways to fix this:
1. Make ViewIdUpdater only update the view id when the view matches all the nodes in the cache. IOW, in the above example, view id would only be set to 2, once we've verified that the address cache contains all 4 nodes in the JGroups view. The problem here is that the verification would need to be done in ViewUpdater, and I'm not sure the cache will have all the entries (annoyances of cache listeners)
2. Instead of relying on the transport view id, use the cache view id for the Hot Rod cache. Need to check with Dan, but so far is my preferred option.

Comment 29 JBoss JIRA Server 2012-05-18 17:20:16 UTC
Galder Zamarreño <galder.zamarreno> made a comment on jira ISPN-1995

Ok, I'll submit the pull req. Thx Michal!

Comment 30 JBoss JIRA Server 2012-05-18 17:24:23 UTC
Galder Zamarreño <galder.zamarreno> updated the status of jira ISPN-1995 to Reopened

Comment 31 Michal Linhard 2012-05-22 21:13:20 UTC
Verified. Not present in JDG 6.0.0.ER10

Comment 32 Misha H. Ali 2012-06-05 22:19:12 UTC
This bug is marked for inclusion in the JGD 6 GA Release Notes. Setting a NEEDINFO for assignee to request fix information and to update the problem, if the technical note is outdated.

Comment 33 Tristan Tarrant 2012-06-06 09:57:03 UTC
The Hot Rod server has been fixed so that the view identifier is not updated until the topology cache contains the addresses of all nodes.

Comment 34 Tristan Tarrant 2012-06-06 09:57:04 UTC
Deleted Technical Notes Contents.

Old Contents:
After a node crashes and rejoins the cluster, it does not receive client load at the same level as the other nodes.
This issue does not affect data integrity and distribution in the cluster.

Comment 35 Misha H. Ali 2012-06-07 03:11:16 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Previously, after a node crashed and rejoined the cluster, it did not receive client load at the same level as the other nodes. The Hot Rod server is now fixed so that the view identifier is not updated until the topology cache contains the addresses of all nodes.