Description of problem: Very often it happens that when starting servers in server group ends with: java.lang.Exception: Read timed out, rolled-back=false, rolled-back=false at org.rhq.core.pc.operation.OperationInvocation.run(OperationInvocation.java:278) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) Version-Release number of selected component (if applicable): EAP 6.1.1 vs JON 3.2.0.ER4 How reproducible: 30% Steps to Reproduce in JON: 1. import eap 6 in domain mode 2. stop servers in servers main group 3. start servers in servers main group Actual results: start operation fails with the exception shown in JON UI Expected results: servers in domain mode succeeds Additional info: In my case the servers were first restarted, then stopped and then started. I haven't found any relevant information neither in agents log and neither in the JON server log
Radim, can you please give us a bit more context: how many servers in the group are you starting? If you start them through cli or whatever, how long does it take until they are up? Do you see more information in the agent log? (Could you perhaps paste a much longer snippet of agent log when this happens?) Thanks Heiko
There are four servers managed by the domain controller, where three of them are in the group from which servers are being started. According to EAP 6.1.1 host-controller log the servers were started in less than 30 seconds as you can see in [1]. There are no messages visible in both JON agent logs and in JON server logs related to the operation of starting servers in server group. I am using the default logging level which logs all INFO and more severe messages. BTW: the operation took via our testsuite 0s see [2] [1] 14:31:32,214 INFO [org.jboss.as.host.controller] (HttpManagementService-threads - 7) JBAS010922: Starting server server-one 14:31:35,371 INFO [org.jboss.as.domain.controller.mgmt] (Remoting "dev84.mw.lab.eng.bos.redhat.com:MANAGEMENT" task-2) JBAS010920: Server [Server:server-one] registered using connection [Channel ID 042b3dff (inbound) of Remoting connection 6a120629 to /10.16.95.131:49226] 14:31:35,428 INFO [org.jboss.as.host.controller] (HttpManagementService-threads - 7) JBAS010922: Starting server server-two 14:31:35,498 INFO [org.jboss.as.host.controller] (server-registration-threads - 1) JBAS010919: Registering server server-one 14:31:42,590 INFO [org.jboss.as.domain.controller.mgmt] (Remoting "dev84.mw.lab.eng.bos.redhat.com:MANAGEMENT" task-2) JBAS010920: Server [Server:server-two] registered using connection [Channel ID 48df4e5c (inbound) of Remoting connection 3bf7c906 to /10.16.95.131:38136] 14:31:42,706 INFO [org.jboss.as.host.controller] (HttpManagementService-threads - 7) JBAS010922: Starting server server-four 14:31:42,834 INFO [org.jboss.as.host.controller] (server-registration-threads - 1) JBAS010919: Registering server server-two 14:31:50,809 INFO [org.jboss.as.domain.controller.mgmt] (Remoting "dev84.mw.lab.eng.bos.redhat.com:MANAGEMENT" task-3) JBAS010920: Server [Server:server-four] registered using connection [Channel ID 3d831085 (inbound) of Remoting connection 2f2ed387 to /10.16.95.131:58631] 14:31:50,986 INFO [org.jboss.as.host.controller] (server-registration-threads - 1) JBAS010919: Registering server server-four [2] [ant:testng] 201311111431:59.724 - INFO: Operation [Start Servers] finished after 0s (com.redhat.qe.jon.sahi.base.inventory.Operations.assertOperationResult) [ant:testng] 201311111432:01.461 - INFO: Retrieving the error message as it was expected that operation ends successfully but it didn't (com.redhat.qe.jon.sahi.base.inventory.Operations.assertOperationResult) [ant:testng] 201311111432:08.234 - SEVERE: Test Failed: startServers (com.redhat.qe.auto.testng.TestNGListener.onTestFailure) [ant:testng] java.lang.AssertionError: Operation [Start Servers] result: Success errorMessage: [ant:testng] java.lang.Exception: Read timed out, rolled-back=false, rolled-back=false at org.rhq.core.pc.operation.OperationInvocation.run(OperationInvocation.java:278) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:679) expected [true] but found [false]
I have tried it one more time (now in the server group were only two servers) and it failed. From time submitted to time completed it took 11s and ended with read-timeout issue. It seems there is hardcoded timeout for 10s which should be increased the same way as it was done for restarting servers in server group, see https://bugzilla.redhat.com/show_bug.cgi?id=911327
Fixed in master commit cce4661454763dbd1cab0f2abca6f3309577959f Author: Thomas Segismont <tsegismo> Date: Wed Nov 13 11:51:35 2013 +0100 Applied the fix for BZ911327 to all operations (start/stop/restart)
Cherry-picked to release/jon3.2.x commit ff2242931db30ea3e724572950f0b099ca56e6a6 Author: Thomas Segismont <tsegismo> Date: Wed Nov 13 11:51:35 2013 +0100
Radim, excellent analysis in #c3 Thank you.
(In reply to Heiko W. Rupp from comment #6) > Radim, excellent analysis in #c3 Thank you. +1
Moving to ON_QA as available for testing with new brew build.
Mass moving all of these from ER6 to target milestone ER07 since the ER6 build was bad and QE was halted for the same reason.
I am not able to reproduce the issue any more with JON 3.2.0.ER7