Bug 745872 (EDG-68) - stuck hotrod operations
Summary: stuck hotrod operations
Keywords:
Status: CLOSED NEXTRELEASE
Alias: EDG-68
Product: JBoss Data Grid 5
Classification: JBoss
Component: Infinispan
Version: EAP 5.1.0 EDG TP
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: EAP 5.1.0 EDG TP
Assignee: Default User
QA Contact:
URL: http://jira.jboss.org/jira/browse/EDG-68
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-03-24 13:07 UTC by Michal Linhard
Modified: 2014-03-17 04:02 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-04-12 07:02:08 UTC
Type: Bug


Attachments (Terms of Use)
jstack.zip (77.36 KB, application/zip)
2011-03-24 13:08 UTC, Michal Linhard
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker EDG-68 0 None None None Never

Description Michal Linhard 2011-03-24 13:07:12 UTC
project_key: EDG

we had a stuck job during edg client stress test:
http://hudson.qa.jboss.com/hudson/job/edg-51x-stress-client-size6-hotrod/4/

the hang situation looked like this:
test ended because there were sampling errors
perf01 the smartfrog controler started to terminate the load driver nodes and waits for their termination
perf02-perf10 (load driver nodes) some of them didnt terminate because they still had "Runner" threads stuck in PUT and GET operations.
they didn't timeout because hotrod client doesn't support timeout.

we need help with analysis what happened on the server side (perf17-perf22)

Comment 1 Michal Linhard 2011-03-24 13:08:00 UTC
attached jstack outputs for all relevant machines

Comment 2 Michal Linhard 2011-03-24 13:08:00 UTC
Attachment: Added: jstack.zip


Comment 3 Michal Linhard 2011-03-24 13:08:18 UTC
Galder can you please take a brief look at this ?

Comment 4 Michal Linhard 2011-03-24 13:13:07 UTC
the ends of the server logs
http://hudson.qa.jboss.com/hudson/job/edg-51x-stress-client-size6-hotrod/4/console-perf18/consoleText
...
http://hudson.qa.jboss.com/hudson/job/edg-51x-stress-client-size6-hotrod/4/console-perf22/consoleText
might be misleading. some of the behaviour at the end is caused by me trying to undeploy datagrid.sar from EDG to kill hotrodserver and break the connections to clients thus freeing the clients out of the waiting.
it was an attempt to end the test properly so we can get some proper graphs out of the smartfrog components.

Comment 5 Radoslav Husar 2011-03-24 13:19:40 UTC
> they didn't timeout because hotrod client doesn't support timeout.


Actually, the version we used does support it but we didnt configure it so it used default which means disabled I recon JBPAPP-6048.

Comment 6 Michal Linhard 2011-03-24 13:53:26 UTC
in client stress tests we used 4.2.1.CR4 on the client side so the code for timeout wasn't there.
in jobs where we use the snapshot version property infinispan.client.hotrod.socket_timeout defaults to 60secs. 


Comment 7 Richard Achmatowicz 2011-03-24 14:00:54 UTC
Should we not get dev to create a CR5 to incorporate this (critical) fix, rather than use snapshots for further testing?


Comment 8 Radoslav Husar 2011-03-24 14:03:26 UTC
@Richard 
I agree very much, it has been causing a lot of pain.

Comment 9 Richard Achmatowicz 2011-03-24 14:10:23 UTC
Galder, wdyt? What is the relative state of 4.2.1.CR4 vs 4.2.1-SNAPSHOT? Would it be possible to cut a CR5 to give us some stability in testing?

Comment 10 Galder Zamarreño 2011-03-28 09:52:27 UTC
We released 4.2.1.FINAL last Friday, that should contain configurable hotrod client timeouts.

Comment 11 Michal Linhard 2011-03-28 10:00:52 UTC
Yeah, we are already using it. (And also were using SNAPSHOT with the timeout feature before to solve this) so it's not blocking us anymore.

I created this JIRA only out of interrest what happened on the server side, cause I wasn't able to figure it out from the jstack reports.

Comment 12 Galder Zamarreño 2011-04-07 11:51:08 UTC
Michal, do you still need me to have a look at the jstack files? Or can this be closed?

Comment 13 Michal Linhard 2011-04-07 12:00:54 UTC
It would be nice to see what happened there, because we solved the problem by introducing hotrod client timeouts, but didn't get to the core of the "deadlock" that happened in the system on the server side.

But it's a very low priority thing, and It's not blocking the tests anymore. I wouldn't object if we closed this for now.


Comment 14 Galder Zamarreño 2011-04-12 07:01:21 UTC
Michal, I had a quick look at the stacks and don't see anything peculiar. Let's close this and reopen at a later stage if necessary.


Note You need to log in before you can comment on or make changes to this bug.