902794 – Joining node ignored by hotrod clients in REPL clustering mode

Bug 902794 - Joining node ignored by hotrod clients in REPL clustering mode

Summary: Joining node ignored by hotrod clients in REPL clustering mode

Keywords:
Status:	CLOSED UPSTREAM
Alias:	None
Product:	JBoss Data Grid 6
Classification:	JBoss
Component:	Infinispan
Sub Component:
Version:	6.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	ER10
Target Release:	6.1.0
Assignee:	Tristan Tarrant
QA Contact:	Nobody
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-01-22 12:12 UTC by Michal Linhard
Modified:	2025-02-10 03:27 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2025-02-10 03:27:20 UTC
Type:	Bug
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	ISPN-2738	0	Blocker	Resolved	Joining node ignored by hotrod clients in REPL clustering mode	2015-05-21 18:35:32 UTC

Description Michal Linhard 2013-01-22 12:12:21 UTC

resilience 4-3-4 REPL mode for JDG 6.1.0.ER9:

https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/EDG6/view/EDG-REPORTS-RESILIENCE/job/edg-60-resilience-repl-4-3/31/artifact/report/stats-throughput.png

Comment 1 JBoss JIRA Server 2013-01-22 12:46:46 UTC

Michal Linhard <mlinhard> made a comment on jira ISPN-2738

working on trace logs

Comment 2 JBoss JIRA Server 2013-01-22 17:52:46 UTC

Michal Linhard <mlinhard> made a comment on jira ISPN-2738

client logs: http://www.qa.jboss.com/~mlinhard/test_results/driver0-ISPN-2738.zip
server logs: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/EDG6/view/EDG-QE/job/edg-60-experiments-mlinhard/268/artifact/report/serverlogs.zip

Comment 3 JBoss JIRA Server 2013-01-22 17:53:36 UTC

Michal Linhard <mlinhard> made a comment on jira ISPN-2738

[~galder.zamarreno]

Comment 4 JBoss JIRA Server 2013-01-22 17:54:30 UTC

Michal Linhard <mlinhard> made a comment on jira ISPN-2738

[~galder.zamarreno] or [~dan.berindei] could you please have a look at this ? I think this might be connected with solution to ISPN-2632 we've talked about.

Comment 5 JBoss JIRA Server 2013-01-23 09:50:16 UTC

Galder Zamarreño <galder.zamarreno> updated the status of jira ISPN-2738 to Coding In Progress

Comment 6 JBoss JIRA Server 2013-01-23 11:06:34 UTC

Galder Zamarreño <galder.zamarreno> made a comment on jira ISPN-2738

The problem does indeed look related to ISPN-2632 and I think it's linked to removal of coordination between the address cache and the topology id update. The problem seems to be that the Hot Rod server sends a new topology id before the cache has been updated, so when a new added, it says: here's the new topology ID but the cache has not yet been updated. The client now has a new id but the members are the same. When the cache is eventually updated with the new node, the topology ID is not increased, so clients will never talk to it. Here's a snippet from node01.log that proofs what I say:

{code}12:43:03,137 TRACE [org.infinispan.server.hotrod.HotRodDecoder] (HotRodServerWorker-119) Decoded header HotRodHeader{op=GetRequest, version=12, 
messageId=1974, cacheName=testCache, flag=0, clientIntelligence=3, topologyId=8}
...
12:43:03,229 TRACE [org.infinispan.server.hotrod.HotRodDecoder] (HotRodServerWorker-107) Decoded header HotRodHeader{op=GetRequest, version=12, 
messageId=2626, cacheName=testCache, flag=0, clientIntelligence=3, topologyId=9}
...
12:43:03,753 TRACE [org.infinispan.container.entries.ReadCommittedEntry] (OOB-197,null) Updating entry (key=node02/default removed=false valid=true 
changed=true created=true loaded=false value=172.18.1.3:11222]
...
node01.log:86873:12:43:03,780 TRACE [org.infinispan.server.hotrod.HotRodDecoder] (HotRodServerWorker-119) Decoded header HotRodHeader{op=PutRequest, 
version=12, messageId=1992, cacheName=testCache, flag=6, clientIntelligence=3, topologyId=9}{code}

@Dan, this is precisely the reason why the interceptor in HotRodServer was created. To coordinate and make sure that the new topology ID is not sent before the cache has been updated. This is crucial is part of the code I added to deal with resilience testing in previous testing round.

Comment 7 JBoss JIRA Server 2013-01-23 12:03:29 UTC

Dan Berindei <dberinde> made a comment on jira ISPN-2738

In my fix for ISPN-2632 I replaced the interceptor with a check to not send the topology update unless all the consistent hash members also exist in the address cache. Unfortunately I only added the check for distributed caches (see AbstraceEncoder1x/AbstractTopologyAwareEncoder1x.writeHashTopologyHeader).

The fix is to add the same check on all the code paths that write a topology update.

Comment 8 JBoss JIRA Server 2013-01-23 15:56:51 UTC

Dan Berindei <dberinde> updated the status of jira ISPN-2738 to Open

Comment 9 JBoss JIRA Server 2013-01-23 15:56:59 UTC

Dan Berindei <dberinde> updated the status of jira ISPN-2738 to Coding In Progress

Comment 10 JBoss JIRA Server 2013-01-24 10:23:13 UTC

Dan Berindei <dberinde> made a comment on jira ISPN-2738

Skip the topology update if the cache members aren't all in the address
cache. Do the check in AbstractEncoder1x.generateTopologyResponse, so that
it works for all topology types (i.e. also for replicated caches).

I added a new replicated-mode test, but it still doesn't cover this case.

Comment 11 Michal Linhard 2013-01-30 12:37:57 UTC

Verified for JDG 6.1.0.ER10

Comment 16 Red Hat Bugzilla 2025-02-10 03:27:20 UTC

This product has been discontinued or is no longer tracked in Red Hat Bugzilla.

Note You need to log in before you can comment on or make changes to this bug.