Hide Forgot
project_key: EDG we had a stuck job during edg client stress test: http://hudson.qa.jboss.com/hudson/job/edg-51x-stress-client-size6-hotrod/4/ the hang situation looked like this: test ended because there were sampling errors perf01 the smartfrog controler started to terminate the load driver nodes and waits for their termination perf02-perf10 (load driver nodes) some of them didnt terminate because they still had "Runner" threads stuck in PUT and GET operations. they didn't timeout because hotrod client doesn't support timeout. we need help with analysis what happened on the server side (perf17-perf22)
attached jstack outputs for all relevant machines
Attachment: Added: jstack.zip
Galder can you please take a brief look at this ?
the ends of the server logs http://hudson.qa.jboss.com/hudson/job/edg-51x-stress-client-size6-hotrod/4/console-perf18/consoleText ... http://hudson.qa.jboss.com/hudson/job/edg-51x-stress-client-size6-hotrod/4/console-perf22/consoleText might be misleading. some of the behaviour at the end is caused by me trying to undeploy datagrid.sar from EDG to kill hotrodserver and break the connections to clients thus freeing the clients out of the waiting. it was an attempt to end the test properly so we can get some proper graphs out of the smartfrog components.
> they didn't timeout because hotrod client doesn't support timeout. Actually, the version we used does support it but we didnt configure it so it used default which means disabled I recon JBPAPP-6048.
in client stress tests we used 4.2.1.CR4 on the client side so the code for timeout wasn't there. in jobs where we use the snapshot version property infinispan.client.hotrod.socket_timeout defaults to 60secs.
Should we not get dev to create a CR5 to incorporate this (critical) fix, rather than use snapshots for further testing?
@Richard I agree very much, it has been causing a lot of pain.
Galder, wdyt? What is the relative state of 4.2.1.CR4 vs 4.2.1-SNAPSHOT? Would it be possible to cut a CR5 to give us some stability in testing?
We released 4.2.1.FINAL last Friday, that should contain configurable hotrod client timeouts.
Yeah, we are already using it. (And also were using SNAPSHOT with the timeout feature before to solve this) so it's not blocking us anymore. I created this JIRA only out of interrest what happened on the server side, cause I wasn't able to figure it out from the jstack reports.
Michal, do you still need me to have a look at the jstack files? Or can this be closed?
It would be nice to see what happened there, because we solved the problem by introducing hotrod client timeouts, but didn't get to the core of the "deadlock" that happened in the system on the server side. But it's a very low priority thing, and It's not blocking the tests anymore. I wouldn't object if we closed this for now.
Michal, I had a quick look at the stacks and don't see anything peculiar. Let's close this and reopen at a later stage if necessary.