Bug 809631
Summary: | Uneven request balancing after node restore | ||||||
---|---|---|---|---|---|---|---|
Product: | [JBoss] JBoss Data Grid 6 | Reporter: | Michal Linhard <mlinhard> | ||||
Component: | Infinispan | Assignee: | Tristan Tarrant <ttarrant> | ||||
Status: | CLOSED UPSTREAM | QA Contact: | Nobody <nobody> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 6.0.0 | CC: | jdg-bugs, nobody | ||||
Target Milestone: | --- | ||||||
Target Release: | 6.0.0 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: |
Previously, after a node crashed and rejoined the cluster, it did not receive client load at the same level as the other nodes. The Hot Rod server is now fixed so that the view identifier is not updated until the topology cache contains the addresses of all nodes.
|
Story Points: | --- | ||||
Clone Of: | Environment: | ||||||
Last Closed: | 2025-02-10 03:19:31 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Michal Linhard
2012-04-03 21:02:15 UTC
Same thing happened in the 16 node test: http://www.qa.jboss.com/~mlinhard/hyperion/run57-resi-dist-size16-3own/report/stats-throughput.png Again the entry counts are alright: http://www.qa.jboss.com/~mlinhard/hyperion/run57-resi-dist-size16-3own/report/cache_entries.png Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: After a node crashes and rejoins the cluster, it does not receive client load at the same level as the other nodes. This issue does not affect data integrity and distribution in the cluster. Manik Surtani <manik.surtani> made a comment on jira ISPN-1995 Could this be due to topology not being properly updated on the clients? I presume this is with Hot Rod? Can you see, in the client logs, that the topology update has been received? Unfortunately I turned off these kind of messages in those runs. I'll try to rerun with these enabled. Probably next week. Right now I'm swamped. Happened again in 16 node resilience test for ER7: http://www.qa.jboss.com/~mlinhard/hyperion/run81-resi-dist-16/report/stats-throughput.png I'll rerun with more detailed client logs ASAP. Hyperion lab experiencing some problems, so this will take longer. OK. I've reran in hyperion, reproduced again: http://www.qa.jboss.com/~mlinhard/hyperion/run85-resi-dist-16/report/stats-throughput.png this time with some log analysis: http://www.qa.jboss.com/~mlinhard/hyperion/run85-resi-dist-16/report/loganalysis/client/ Config: 16 nodes, DIST mode, numOwners 3, crashing 2 nodes The steps of the resilience test is as follows: 1. Start complete cluster node0001 - node0016, wait till it forms (View 6 around 02:44:32,594) 2. Wait 5 min 3. Kill node0002, node0003, wait till survivor cluster forms (View 7 around 02:50:54,200) 4. Wait 5 min 5. Restore node0002, node0003, wait till complete cluster forms again (View 9 around 02:56:10,164) These views are created: http://www.qa.jboss.com/~mlinhard/hyperion/run85-resi-dist-16/report/loganalysis/views.html There are two anomalies during the test: 2 clients (333 and 459) when the nodes are killed they remove them from topology, add them and remove again within few seconds. (that's why we're seeing 502 node adds and removes in the client logs even though there are only 500 clients) After the two nodes were restored the nodes start to obtain the topology information first about the node0003 (starting around 02:55:57,302) and then about node0002 (starting around 02:56:03,581) However in 165 cases the clients don't obtain the information about node0002 being added which is 33% of the nodes, which corresponds to the load (throughput) of the node0002 being cca 33% lower than of other nodes (in the throughput chart) Galder Zamarreño <galder.zamarreno> updated the status of jira ISPN-1995 to Coding In Progress Per Tristan triage, moved from ER8 to ER9 target release. Galder Zamarreño <galder.zamarreno> made a comment on jira ISPN-1995 @Michal, thx for generating the logs. I see two main issues here: 1. Even after nodes have been killed, some servers might still send old views. This can only happen if clients are sending a different view id to the one in the server. The only thing that I see possible is this (and logs seem to indicate so). Client-333 sends a view id, let's say 1, and server has been updated to view id 2, so it sends back the new view. Now, Client-333 goes to a different node where view has not yet been updated and it sends view id 2. The server still has view id 1, and so, as it considers it 'different', it sends back the old view. The client cannot connect to the servers that have been killed. Removing crashed members happens in an async view change listener, which works on the local cache only. This is done this way to avoid issues with transport view id changing before the cache has been updated, see ISPN-1654 for more info. So, this particular issue can be easily solved by servers skipping sending view changes if clients have 'newer' view id, by newer I mean strictly higher view ids. "New topology..." info message could also with adding what the view id is, plus which node this topology comes from. 2. The second problem is a more weird one. The following doesn't make much sense: {code}02:56:03,620 INFO [org.infinispan.client.hotrod.impl.transport.tcp.TcpTransportFactory] (Client-333) ISPN004014: New server added(/192.168.87.92:11222), adding to the pool. ... 02:56:06,621 WARN [org.infinispan.client.hotrod.impl.transport.tcp.TcpTransport] (Client-333) ISPN004008: Could not connect to server: /192.168.87.91:11222{code} Assuming that 192.168.87.91 is node0002, and node0002 has been restarted, it would appear that either: a) cluster formed and server view in another node was updated before the endpoint was fully started in .91 b) or, the view was updated around the cluster, but the server endpoint failed to start. Looking at the server.log for .91, the problem is a): {code}02:55:56,711 INFO [stdout] (pool-2-thread-1) GMS: address=node0002/default, cluster=default, physical address=192.168.87.91:55200 02:55:56,711 INFO [stdout] (pool-2-thread-1) ------------------------------------------------------------------- 02:55:56,913 INFO [com.jboss.datagrid] (MSC service thread 1-16) REST Server starting 02:55:56,914 INFO [com.jboss.datagrid.endpoint.ProtocolServerService] (MSC service thread 1-15) MemcachedServer starting 02:55:56,914 INFO [com.jboss.datagrid.endpoint.ProtocolServerService] (MSC service thread 1-5) HotRodServer starting ... 02:56:10,146 INFO [com.jboss.datagrid.endpoint.ProtocolServerService] (MSC service thread 1-5) HotRodServer started in 13232ms {code} What's happening is the cache view installation, state transfer...etc, which is delaying the endpoint being started. At first glance, moving the registration of the node after transport has started could work, but what if a request comes in and the the view id is not yet set? There's a second problem here, and that is the fact that the view id is set with a view listener, so if the transport is started before the view listener is installed, we'd miss view ids. So, this needs a bit more thinking. Another appearance in ER7, 32 node resilience tests numOwners5, failing 4 nodes. http://www.qa.jboss.com/~mlinhard/hyperion/run96-resi-dist-32-numowners5/ Created attachment 583745 [details]
Fix for the uneven req balancing
The zip contains both HotRod client and server jars. They must be put in place of the equivalent jar files in JDG ER8.1. The client jar has to be used for the test drivers.
Tried 32 node resilience test with numowners5 and 4node failures: I couldn't spot the original issue: http://www.qa.jboss.com/~mlinhard/hyperion/run101-resi-dist-32-numowners5/report/stats-throughput.png but another problem shows up, probably connected with upgrade of netty: http://www.qa.jboss.com/~mlinhard/hyperion/run101-resi-dist-32-numowners5/logs/analysis/server/categories/cat6.html Galder Zamarreño <galder.zamarreno> made a comment on jira ISPN-1995 Hmmmm, I need to check with Tristan, but Netty upgrade required changes to Infinispan itself. I dunno whether Tristan ported over these changes too. I'm gonna send the pull req for that now. 16 node resilience test with numOwners3 and 2node failures: Also original issue gone: http://www.qa.jboss.com/~mlinhard/hyperion/run102-resi-dist-16-er8.1- patch/report/stats-throughput.png But we've got lots of our invalid magic number friend (https://bugzilla.redhat.com/show_bug.cgi?id=807741): http://www.qa.jboss.com/~mlinhard/hyperion/run102-resi-dist-16-er8.1-patch/logs/analysis/client/ Reran with the netty fix but I've got failed prepare, I need to set different timeouts and rerun: http://www.qa.jboss.com/~mlinhard/hyperion/run103-resi-dist-32-numowners5-ER8.1-nettyfix/logs/analysis/views.html The netty problem didn't appear this time. OK. now the traffic on restored nodes doesn't restore at all: http://www.qa.jboss.com/~mlinhard/hyperion/run104-resi-dist-32-ER8.1-nettyfix/report/stats-throughput.csv but the view is recovered: http://www.qa.jboss.com/~mlinhard/hyperion/run104-resi-dist-32-ER8.1-nettyfix/logs/analysis/views.html The same problem in size16 numowners3 fail2 tests: http://www.qa.jboss.com/~mlinhard/hyperion/run105-resi-dist-16-ER8.1-nettyfix/report/stats-throughput.csv node0003 doesn't recover throughput at all neither of the tests contain the netty problem though: http://www.qa.jboss.com/~mlinhard/hyperion/run104-resi-dist-32-ER8.1-nettyfix/logs/analysis/server/ http://www.qa.jboss.com/~mlinhard/hyperion/run105-resi-dist-16-ER8.1-nettyfix/logs/analysis/server/ also this time, I couldn't reproduce the invalid magic number problem: http://www.qa.jboss.com/~mlinhard/hyperion/run104-resi-dist-32-ER8.1-nettyfix/logs/analysis/client/ http://www.qa.jboss.com/~mlinhard/hyperion/run105-resi-dist-16-ER8.1-nettyfix/logs/analysis/client/ Galder Zamarreño <galder.zamarreno> made a comment on jira ISPN-1995 @Michal, re: node 16 test: Is the throughput stats collection right? According to them, the node that doesn't seem to recover is node0002 rather than node0003. It might be the case that the stats are only collected until view5 has been installed (http://dev39.mw.lab.eng.bos.redhat.com/~mlinhard/hyperion/run105-resi-dist-16-ER8.1-nettyfix/logs/analysis/views.html) but not view6? I can't map relative times in stats to absolute times in logs. Otherwise, I don't see any errors that would explain this. re: node 32 tests: The stats issue looks similar, but probably it's just a false alarm. It would appear as if the stats would finish in view 12, and view 13 would not appear. Assuming the stats are right, seems like the issue is the same: last view (view6 in 16, view 13 in 32) is not installed. This could either be because: a) the server does not send it b) the client does not install it To cover all bases, repeat the test with: 1. In server, enable TRACE on {code}org.infinispan.server.hotrod.Encoders$Encoder11${code} 2. In client, enable TRACE on {code}org.infinispan.client.hotrod.impl.protocol.Codec11{code} Tentatively marking as resolved Fail. Tested with ER9 (md5sum 4d277c8626cdada863a3d26442044775) Client side infinispan version 5.1.5.CR1 32 node resilience tests numOwners 5 fail 4: http://www.qa.jboss.com/~mlinhard/hyperion/run110-resi-dist-32-ER9-own5/chart-throughput-killed-nodes.png after all four nodes [2,5,11,13] are restored, only node 13 gets updated in the clients. This always seems to be corresponding with the order of views in which the final restored view is installed http://www.qa.jboss.com/~mlinhard/hyperion/run110-resi-dist-32-ER9-own5/logs/analysis/views.html The crashed view - 8: All - [2,5,11,13] View 9: View 8 + [13] Restored view: All I was able to reproduce it on my laptop with 4 nodes and 1 client. After the 2 nodes were restored (with one intermediate view) the node that was added in the second view wasn't reported back to the client. Trace logs here: http://www.qa.jboss.com/~mlinhard/test_results/bz809631-reproduce.zip @Galder, just in case you don't already know: I'm looking at the trace logs as well and on node01 I'm seeing this: HotRodServer.viewId is set to 2 when both events - adding of node02 and adding of node03: 14:34:14,429 TRACE [org.infinispan.server.hotrod.HotRodServer] (undefined) Address cache had CACHE_ENTRY_CREATED for key node03/default. View id is now 2 14:34:14,468 TRACE [org.infinispan.server.hotrod.HotRodServer] (undefined) Address cache had CACHE_ENTRY_CREATED for key node02/default. View id is now 2 since viewId didn't change the HotRodServer doesn't think it needs to send info about node02 to the client. This is cause by the discrepancy between JGroups and Infinispan CacheView ids. There are only 2 JGroups view ids in this tests: 14:31:06,253 [node01/default|1] [node01/default, node04/default] 14:33:41,943 [node01/default|2] [node01/default, node04/default, node03/default, node02/default] but 4 view ids when it comes to CacheView for ___hotRodTopologyCache. Galder Zamarreño <galder.zamarreno> made a comment on jira ISPN-1995 Michal, I'm looking at the logs now. At first glance, that JGroups incorporates several nodes into a single view shouldn't be problematic, unless the cache is filled up with the nodes in view 2 after the view id is set. Then, what happens is that you have an updated view but not with all nodes. Cache views is a different topic, which should not influence here. Galder Zamarreño <galder.zamarreno> made a comment on jira ISPN-1995 Cache filling up after view is set is what's happening here actually. This is what's happening: a)Before ISPN-1995, the order of startup was: 1. Add node to topology 2. Start cache manager + JGroups transport 3. Start server endopoint. The problem with this was that other nodes were finding nodes in topology that might not yet have started the endpoint, resulting in connection refused exceptions. b) The fix I implemented changed this around to be: 1. Start cache manager + JGroups transport 2. Start server endopoint. 3. Add node to topology The problem now is that JGroups transport starts before the node is added, and so the topology id (which is updated when a node is added to the cache), can be updated (if all nodes have formed the cluster already) before the other nodes have added themselves to the topology cache, hence resulting in missing nodes. I think the order we have now is fine. I can see two ways to fix this: 1. Make ViewIdUpdater only update the view id when the view matches all the nodes in the cache. IOW, in the above example, view id would only be set to 2, once we've verified that the address cache contains all 4 nodes in the JGroups view. The problem here is that the verification would need to be done in ViewUpdater, and I'm not sure the cache will have all the entries (annoyances of cache listeners) 2. Instead of relying on the transport view id, use the cache view id for the Hot Rod cache. Need to check with Dan, but so far is my preferred option. Tried two resilience tests with ER9 patched with Galder's fix: https://github.com/galderz/infinispan/tree/t_1995_5 And the issue disappeared: http://www.qa.jboss.com/~mlinhard/hyperion/run114-resi-dist-32-ER9-unevenfix/report/stats-throughput.png http://www.qa.jboss.com/~mlinhard/hyperion/run115-resi-dist-16-ER9-unevenfix/report/stats-throughput.png Galder Zamarreño <galder.zamarreno> made a comment on jira ISPN-1995 Ok, I'll submit the pull req. Thx Michal! Galder Zamarreño <galder.zamarreno> updated the status of jira ISPN-1995 to Reopened Verified. Not present in JDG 6.0.0.ER10 This bug is marked for inclusion in the JGD 6 GA Release Notes. Setting a NEEDINFO for assignee to request fix information and to update the problem, if the technical note is outdated. The Hot Rod server has been fixed so that the view identifier is not updated until the topology cache contains the addresses of all nodes. Deleted Technical Notes Contents. Old Contents: After a node crashes and rejoins the cluster, it does not receive client load at the same level as the other nodes. This issue does not affect data integrity and distribution in the cluster. Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Previously, after a node crashed and rejoined the cluster, it did not receive client load at the same level as the other nodes. The Hot Rod server is now fixed so that the view identifier is not updated until the topology cache contains the addresses of all nodes. This product has been discontinued or is no longer tracked in Red Hat Bugzilla. |