Bug 745872 (EDG-68)

Summary: stuck hotrod operations
Product: [JBoss] JBoss Data Grid 5 Reporter: Michal Linhard <mlinhard>
Component: InfinispanAssignee: Default User <jbpapp-maint>
Status: CLOSED NEXTRELEASE QA Contact:
Severity: high Docs Contact:
Priority: high    
Version: EAP 5.1.0 EDG TPCC: galder.zamarreno, mlinhard, nobody, rachmato, rhusar
Target Milestone: ---   
Target Release: EAP 5.1.0 EDG TP   
Hardware: Unspecified   
OS: Unspecified   
URL: http://jira.jboss.org/jira/browse/EDG-68
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-04-12 07:02:08 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
jstack.zip none

Description Michal Linhard 2011-03-24 13:07:12 UTC
project_key: EDG

we had a stuck job during edg client stress test:
http://hudson.qa.jboss.com/hudson/job/edg-51x-stress-client-size6-hotrod/4/

the hang situation looked like this:
test ended because there were sampling errors
perf01 the smartfrog controler started to terminate the load driver nodes and waits for their termination
perf02-perf10 (load driver nodes) some of them didnt terminate because they still had "Runner" threads stuck in PUT and GET operations.
they didn't timeout because hotrod client doesn't support timeout.

we need help with analysis what happened on the server side (perf17-perf22)

Comment 1 Michal Linhard 2011-03-24 13:08:00 UTC
attached jstack outputs for all relevant machines

Comment 2 Michal Linhard 2011-03-24 13:08:00 UTC
Attachment: Added: jstack.zip


Comment 3 Michal Linhard 2011-03-24 13:08:18 UTC
Galder can you please take a brief look at this ?

Comment 4 Michal Linhard 2011-03-24 13:13:07 UTC
the ends of the server logs
http://hudson.qa.jboss.com/hudson/job/edg-51x-stress-client-size6-hotrod/4/console-perf18/consoleText
...
http://hudson.qa.jboss.com/hudson/job/edg-51x-stress-client-size6-hotrod/4/console-perf22/consoleText
might be misleading. some of the behaviour at the end is caused by me trying to undeploy datagrid.sar from EDG to kill hotrodserver and break the connections to clients thus freeing the clients out of the waiting.
it was an attempt to end the test properly so we can get some proper graphs out of the smartfrog components.

Comment 5 Radoslav Husar 2011-03-24 13:19:40 UTC
> they didn't timeout because hotrod client doesn't support timeout.


Actually, the version we used does support it but we didnt configure it so it used default which means disabled I recon JBPAPP-6048.

Comment 6 Michal Linhard 2011-03-24 13:53:26 UTC
in client stress tests we used 4.2.1.CR4 on the client side so the code for timeout wasn't there.
in jobs where we use the snapshot version property infinispan.client.hotrod.socket_timeout defaults to 60secs. 


Comment 7 Richard Achmatowicz 2011-03-24 14:00:54 UTC
Should we not get dev to create a CR5 to incorporate this (critical) fix, rather than use snapshots for further testing?


Comment 8 Radoslav Husar 2011-03-24 14:03:26 UTC
@Richard 
I agree very much, it has been causing a lot of pain.

Comment 9 Richard Achmatowicz 2011-03-24 14:10:23 UTC
Galder, wdyt? What is the relative state of 4.2.1.CR4 vs 4.2.1-SNAPSHOT? Would it be possible to cut a CR5 to give us some stability in testing?

Comment 10 Galder Zamarreño 2011-03-28 09:52:27 UTC
We released 4.2.1.FINAL last Friday, that should contain configurable hotrod client timeouts.

Comment 11 Michal Linhard 2011-03-28 10:00:52 UTC
Yeah, we are already using it. (And also were using SNAPSHOT with the timeout feature before to solve this) so it's not blocking us anymore.

I created this JIRA only out of interrest what happened on the server side, cause I wasn't able to figure it out from the jstack reports.

Comment 12 Galder Zamarreño 2011-04-07 11:51:08 UTC
Michal, do you still need me to have a look at the jstack files? Or can this be closed?

Comment 13 Michal Linhard 2011-04-07 12:00:54 UTC
It would be nice to see what happened there, because we solved the problem by introducing hotrod client timeouts, but didn't get to the core of the "deadlock" that happened in the system on the server side.

But it's a very low priority thing, and It's not blocking the tests anymore. I wouldn't object if we closed this for now.


Comment 14 Galder Zamarreño 2011-04-12 07:01:21 UTC
Michal, I had a quick look at the stacks and don't see anything peculiar. Let's close this and reopen at a later stage if necessary.