805238 – Elasticity tests in REPL mode don't finish

Bug 805238 - Elasticity tests in REPL mode don't finish

Summary: Elasticity tests in REPL mode don't finish

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	JBoss Data Grid 6
Classification:	JBoss
Component:	Infinispan
Sub Component:
Version:	6.0.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Tristan Tarrant
QA Contact:	Michal Linhard
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2012-03-20 17:25 UTC by Michal Linhard
Modified:	2014-03-17 04:02 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2012-03-26 10:55:55 UTC
Type:	Bug
Embargoed:

Attachments	(Terms of Use)
view installation times (824 bytes, text/html) 2012-03-20 17:25 UTC, Michal Linhard	no flags	Details
view installation times after framework fix (1.31 KB, text/html) 2012-03-26 10:57 UTC, Michal Linhard	no flags	Details
Show Obsolete (1) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	ISPN-1933	0	None	Closed	State transfer in REPL mode takes more than 10 min	2012-03-28 03:58:15 UTC

Description Michal Linhard 2012-03-20 17:25:20 UTC

Created attachment 571487 [details]
view installation times

Elasticity test with JDG 6.0.0.ER4 in REPL clustering mode:

http://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-REPORTS-RESILIENCE/job/edg-60-elasticity-repl-basic/3

starting from 1 node going to 3 nodes everything is allright
and then when 4th node is joining (viewId=4) the view installation takes more than 10 min.

Client load is 500
Data load is 5% of total heap.

Comment 1 Michal Linhard 2012-03-20 18:15:57 UTC

Of course this is not easy to replicate with small load...
Again this will be tough to TRACE

Run with 10clients 10K dataload
http://www.qa.jboss.com/~mlinhard/hyperion/run34-elasticity4-repl/report/stats-throughput.png

Comment 2 Michal Linhard 2012-03-20 19:14:48 UTC

Run with 500 clients 5% dataload TRACE log:
http://www.qa.jboss.com/~mlinhard/hyperion/run35-elasticity4-repl-trace/
generated 7.8G of logs, didn't replicate the issue

Comment 3 Michal Linhard 2012-03-20 19:28:26 UTC

times to install views are around 40 secs:
http://www.qa.jboss.com/~mlinhard/hyperion/run35-elasticity4-repl-trace/table.html

Comment 4 Michal Linhard 2012-03-20 20:14:23 UTC

Another TRACE run that doesn't reproduce the problem:
http://www.qa.jboss.com/~mlinhard/hyperion/run37-elasticity4-repl-trace/report/stats-throughput.png

Another non-trace run that doesn't reproduce the problem:
http://www.qa.jboss.com/~mlinhard/hyperion/run38-elasticity4-repl/report/stats-throughput.png
http://www.qa.jboss.com/~mlinhard/hyperion/run38-elasticity4-repl/table.html

Comment 5 JBoss JIRA Server 2012-03-20 20:47:10 UTC

Dan Berindei <dberinde> made a comment on jira ISPN-1933

Michal, it looks like your test is using standalone.xml from the master branch, which is outdated, and not the latest config in the prod-6.0.0 branch.

Comment 6 Michal Linhard 2012-03-22 12:08:33 UTC

I updated the config we're using for the REPL tests:
https://svn.devel.redhat.com/repos/jboss-qa/load-testing/etc/edg-60/configs/comparison/stress-repl-sync.xml

In hyperion I ran the 4 node elasticity test 3 times and couldn't reproduce this anymore.

In edg lab this problem is still there:
http://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-REPORTS-RESILIENCE/job/edg-60-elasticity-repl-basic/4/
and still only with REPL case, the DIST works with the same JGroups config:
http://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-REPORTS-RESILIENCE/job/edg-60-elasticity-dist-basic/46/

Comment 7 Michal Linhard 2012-03-22 15:14:26 UTC

Second run in edg lab didn't even get to 3 node cluster:
http://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-REPORTS-RESILIENCE/job/edg-60-elasticity-repl-basic/5/artifact/report/stats-throughput.png

I'll try to run with trace logging

Comment 8 Michal Linhard 2012-03-22 15:59:15 UTC

Tradaaaaa!
Reproduced with TRACE log:
http://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-REPORTS-RESILIENCE/job/edg-60-elasticity-repl-basic/6/artifact/report/stats-throughput.png

Comment 9 Michal Linhard 2012-03-22 18:28:06 UTC

this might be a framework problem. The test might have been ended sooner because it didn't ignore some expected exceptions during join/leave.

Comment 10 Michal Linhard 2012-03-22 18:55:15 UTC

Consider this a false alarm until further notice

Comment 11 Dan Berindei 2012-03-23 07:02:45 UTC

Michal, I don't think it's a false alarm, but it's a different problem. I'm seeing these errors in your TRACE log:

08:10:29,004 ERROR [org.infinispan.statetransfer.StateTransferLockImpl] Trying to release state transfer shared lock without acquiring it first: java.lang.Exception
08:10:29,005 ERROR [org.infinispan.statetransfer.StateTransferLockImpl] Trying to release state transfer shared lock without acquiring it first: java.lang.Exception

They are certainly not expected, so I'm looking into it.

Comment 12 Michal Linhard 2012-03-23 07:57:21 UTC

Oh It's a new one you're right, I thought it's
https://issues.jboss.org/browse/ISPN-1754
therefore I didn't report it, but the stack trace is different there.

Did you create JIRA for that ?

It occurs only after I start stopping the servers (or killing them).
In the log you can see this event marked by "Test will now stop the server", that's written to the log output file right before I send the kill signal to JBoss.

Comment 13 Michal Linhard 2012-03-23 08:01:17 UTC

I think I meant this one: https://issues.jboss.org/browse/ISPN-1704
So now I don't know why I didn't reopen it :-)
For some reason I thought it's something expected/insignificant, it's good that you spotted that.

Comment 14 Dan Berindei 2012-03-23 08:04:31 UTC

It looks very much like ISPN-1704, but that one happened on the surviving nodes - this one seems to happen on the nodes that you're killing. So I think it might be worth a separate bug after all.

Comment 15 Michal Linhard 2012-03-26 10:55:55 UTC

After fixing a problem in test framework, tests run till the end, and all state tansfers complete under 5sec.

http://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-REPORTS-RESILIENCE/job/edg-60-elasticity-repl-basic/12/artifact/report/stats-throughput.png

Comment 16 Michal Linhard 2012-03-26 10:57:15 UTC

Created attachment 572732 [details]
view installation times after framework fix

adding new view installation times after framework fix

Comment 17 JBoss JIRA Server 2012-03-26 10:59:48 UTC

Michal Linhard <mlinhard> updated the status of jira ISPN-1933 to Closed

Comment 18 JBoss JIRA Server 2012-03-26 10:59:48 UTC

Michal Linhard <mlinhard> made a comment on jira ISPN-1933

This was a problem in the test itself. The state transfer didn't really last more than 10 min.

Note You need to log in before you can comment on or make changes to this bug.