Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 745872 (EDG-68)

Summary:

stuck hotrod operations

Product:

[JBoss] JBoss Data Grid 5

Reporter:

Michal Linhard <mlinhard>

Component:

Infinispan

Assignee:

Default User <jbpapp-maint>

Status:

CLOSED NEXTRELEASE

QA Contact:

Severity:

high

Docs Contact:

Priority:

high

Version:

EAP 5.1.0 EDG TP

CC:

galder.zamarreno, mlinhard, nobody, rachmato, rhusar

Target Milestone:

---

Target Release:

EAP 5.1.0 EDG TP

Hardware:

Unspecified

OS:

Unspecified

URL:

http://jira.jboss.org/jira/browse/EDG-68

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2011-04-12 07:02:08 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
jstack.zip	none

Description Michal Linhard 2011-03-24 13:07:12 UTC

project_key: EDG

we had a stuck job during edg client stress test:
http://hudson.qa.jboss.com/hudson/job/edg-51x-stress-client-size6-hotrod/4/

the hang situation looked like this:
test ended because there were sampling errors
perf01 the smartfrog controler started to terminate the load driver nodes and waits for their termination
perf02-perf10 (load driver nodes) some of them didnt terminate because they still had "Runner" threads stuck in PUT and GET operations.
they didn't timeout because hotrod client doesn't support timeout.

we need help with analysis what happened on the server side (perf17-perf22)

Comment 1 Michal Linhard 2011-03-24 13:08:00 UTC

attached jstack outputs for all relevant machines

Comment 2 Michal Linhard 2011-03-24 13:08:00 UTC

Attachment: Added: jstack.zip

Comment 3 Michal Linhard 2011-03-24 13:08:18 UTC

Galder can you please take a brief look at this ?

Comment 4 Michal Linhard 2011-03-24 13:13:07 UTC

the ends of the server logs
http://hudson.qa.jboss.com/hudson/job/edg-51x-stress-client-size6-hotrod/4/console-perf18/consoleText
...
http://hudson.qa.jboss.com/hudson/job/edg-51x-stress-client-size6-hotrod/4/console-perf22/consoleText
might be misleading. some of the behaviour at the end is caused by me trying to undeploy datagrid.sar from EDG to kill hotrodserver and break the connections to clients thus freeing the clients out of the waiting.
it was an attempt to end the test properly so we can get some proper graphs out of the smartfrog components.

Comment 5 Radoslav Husar 2011-03-24 13:19:40 UTC

> they didn't timeout because hotrod client doesn't support timeout.


Actually, the version we used does support it but we didnt configure it so it used default which means disabled I recon JBPAPP-6048.

Comment 6 Michal Linhard 2011-03-24 13:53:26 UTC

in client stress tests we used 4.2.1.CR4 on the client side so the code for timeout wasn't there.
in jobs where we use the snapshot version property infinispan.client.hotrod.socket_timeout defaults to 60secs.

Comment 7 Richard Achmatowicz 2011-03-24 14:00:54 UTC

Should we not get dev to create a CR5 to incorporate this (critical) fix, rather than use snapshots for further testing?

Comment 8 Radoslav Husar 2011-03-24 14:03:26 UTC

@Richard 
I agree very much, it has been causing a lot of pain.

Comment 9 Richard Achmatowicz 2011-03-24 14:10:23 UTC

Galder, wdyt? What is the relative state of 4.2.1.CR4 vs 4.2.1-SNAPSHOT? Would it be possible to cut a CR5 to give us some stability in testing?

Comment 10 Galder Zamarreño 2011-03-28 09:52:27 UTC

We released 4.2.1.FINAL last Friday, that should contain configurable hotrod client timeouts.

Comment 11 Michal Linhard 2011-03-28 10:00:52 UTC

Yeah, we are already using it. (And also were using SNAPSHOT with the timeout feature before to solve this) so it's not blocking us anymore.

I created this JIRA only out of interrest what happened on the server side, cause I wasn't able to figure it out from the jstack reports.

Comment 12 Galder Zamarreño 2011-04-07 11:51:08 UTC

Michal, do you still need me to have a look at the jstack files? Or can this be closed?

Comment 13 Michal Linhard 2011-04-07 12:00:54 UTC

It would be nice to see what happened there, because we solved the problem by introducing hotrod client timeouts, but didn't get to the core of the "deadlock" that happened in the system on the server side.

But it's a very low priority thing, and It's not blocking the tests anymore. I wouldn't object if we closed this for now.

Comment 14 Galder Zamarreño 2011-04-12 07:01:21 UTC

Michal, I had a quick look at the stacks and don't see anything peculiar. Let's close this and reopen at a later stage if necessary.