Bug 1027691

Summary: Start servers in the group ends with read timed out
Product: [JBoss] JBoss Operations Network Reporter: Radim Hatlapatka <rhatlapa>
Component: Plugin -- JBoss EAP 6Assignee: Thomas Segismont <tsegismo>
Status: CLOSED CURRENTRELEASE QA Contact: Mike Foley <mfoley>
Severity: high Docs Contact:
Priority: unspecified    
Version: JON 3.2CC: hrupp, rhatlapa, theute
Target Milestone: ER07   
Target Release: JON 3.2.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Radim Hatlapatka 2013-11-07 09:57:24 UTC
Description of problem:
Very often it happens that when starting servers in server group ends with:

java.lang.Exception: Read timed out, rolled-back=false, rolled-back=false at org.rhq.core.pc.operation.OperationInvocation.run(OperationInvocation.java:278) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) 

Version-Release number of selected component (if applicable): EAP 6.1.1 vs JON 3.2.0.ER4


How reproducible: 30%


Steps to Reproduce in JON:
1. import eap 6 in domain mode
2. stop servers in servers main group
3. start servers in servers main group

Actual results:
start operation fails with the exception shown in JON UI


Expected results:
servers in domain mode succeeds


Additional info:
In my case the servers were first restarted, then stopped and then started.

I haven't found any relevant information neither in agents log and neither in the JON server log

Comment 1 Heiko W. Rupp 2013-11-08 14:07:36 UTC
Radim,
can you please give us a bit more context: how many servers in the group are you starting? If you start them through cli or whatever, how long does it take until they are up?
Do you see more information in the agent log? (Could you perhaps paste a much longer snippet of agent log when this happens?)

 Thanks
   Heiko

Comment 2 Radim Hatlapatka 2013-11-12 10:02:27 UTC
There are four servers managed by the domain controller, where three of them are in the group from which servers are being started.

According to EAP 6.1.1 host-controller log the servers were started in less than 30 seconds as you can see in [1].

There are no messages visible in both JON agent logs and in JON server logs related to the operation of starting servers in server group. I am using the default logging level which logs all INFO and more severe messages.

BTW: the operation took via our testsuite 0s see [2]


[1]
14:31:32,214 INFO  [org.jboss.as.host.controller] (HttpManagementService-threads - 7) JBAS010922: Starting server server-one
14:31:35,371 INFO  [org.jboss.as.domain.controller.mgmt] (Remoting "dev84.mw.lab.eng.bos.redhat.com:MANAGEMENT" task-2) JBAS010920: Server [Server:server-one] registered using connection [Channel ID 042b3dff (inbound) of Remoting connection 6a120629 to /10.16.95.131:49226]
14:31:35,428 INFO  [org.jboss.as.host.controller] (HttpManagementService-threads - 7) JBAS010922: Starting server server-two
14:31:35,498 INFO  [org.jboss.as.host.controller] (server-registration-threads - 1) JBAS010919: Registering server server-one
14:31:42,590 INFO  [org.jboss.as.domain.controller.mgmt] (Remoting "dev84.mw.lab.eng.bos.redhat.com:MANAGEMENT" task-2) JBAS010920: Server [Server:server-two] registered using connection [Channel ID 48df4e5c (inbound) of Remoting connection 3bf7c906 to /10.16.95.131:38136]
14:31:42,706 INFO  [org.jboss.as.host.controller] (HttpManagementService-threads - 7) JBAS010922: Starting server server-four
14:31:42,834 INFO  [org.jboss.as.host.controller] (server-registration-threads - 1) JBAS010919: Registering server server-two
14:31:50,809 INFO  [org.jboss.as.domain.controller.mgmt] (Remoting "dev84.mw.lab.eng.bos.redhat.com:MANAGEMENT" task-3) JBAS010920: Server [Server:server-four] registered using connection [Channel ID 3d831085 (inbound) of Remoting connection 2f2ed387 to /10.16.95.131:58631]
14:31:50,986 INFO  [org.jboss.as.host.controller] (server-registration-threads - 1) JBAS010919: Registering server server-four


[2]
[ant:testng] 201311111431:59.724 - INFO: Operation [Start Servers] finished after 0s (com.redhat.qe.jon.sahi.base.inventory.Operations.assertOperationResult)
[ant:testng] 201311111432:01.461 - INFO: Retrieving the error message as it was expected that operation ends successfully but it didn't (com.redhat.qe.jon.sahi.base.inventory.Operations.assertOperationResult)
[ant:testng] 201311111432:08.234 - SEVERE: Test Failed: startServers (com.redhat.qe.auto.testng.TestNGListener.onTestFailure)
[ant:testng] java.lang.AssertionError: Operation [Start Servers] result: Success errorMessage:
[ant:testng] java.lang.Exception: Read timed out, rolled-back=false, rolled-back=false at org.rhq.core.pc.operation.OperationInvocation.run(OperationInvocation.java:278) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:679) expected [true] but found [false]

Comment 3 Radim Hatlapatka 2013-11-12 11:23:36 UTC
I have tried it one more time (now in the server group were only two servers) and it failed.

From time submitted to time completed it took 11s and ended with read-timeout issue. It seems there is hardcoded timeout for 10s which should be increased the same way as it was done for restarting servers in server group, see https://bugzilla.redhat.com/show_bug.cgi?id=911327

Comment 4 Thomas Segismont 2013-11-13 10:52:57 UTC
Fixed in master

commit cce4661454763dbd1cab0f2abca6f3309577959f
Author: Thomas Segismont <tsegismo>
Date:   Wed Nov 13 11:51:35 2013 +0100

Applied the fix for BZ911327 to all operations (start/stop/restart)

Comment 5 Thomas Segismont 2013-11-13 10:55:01 UTC
Cherry-picked to release/jon3.2.x

commit ff2242931db30ea3e724572950f0b099ca56e6a6
Author: Thomas Segismont <tsegismo>
Date:   Wed Nov 13 11:51:35 2013 +0100

Comment 6 Heiko W. Rupp 2013-11-13 12:00:41 UTC
Radim, excellent analysis in #c3 Thank you.

Comment 7 Thomas Segismont 2013-11-13 13:03:20 UTC
(In reply to Heiko W. Rupp from comment #6)
> Radim, excellent analysis in #c3 Thank you.

+1

Comment 8 Simeon Pinder 2013-11-19 15:47:47 UTC
Moving to ON_QA as available for testing with new brew build.

Comment 9 Simeon Pinder 2013-11-22 05:13:21 UTC
Mass moving all of these from ER6 to target milestone ER07 since the ER6 build was bad and QE was halted for the same reason.

Comment 10 Radim Hatlapatka 2013-11-27 16:30:57 UTC
I am not able to reproduce the issue any more with JON 3.2.0.ER7