Bug 995013
Summary: | Failures in clustering testsuite due to servers running from previous tests | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [JBoss] JBoss Enterprise Application Platform 6 | Reporter: | Richard Janík <rjanik> | ||||||
Component: | Clustering, Testsuite | Assignee: | Radoslav Husar <rhusar> | ||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Richard Janík <rjanik> | ||||||
Severity: | high | Docs Contact: | Russell Dickenson <rdickens> | ||||||
Priority: | unspecified | ||||||||
Version: | 6.1.1, 6.2.0 | CC: | akostadi, jkudrnac, kkhan, paul.ferraro, pjelinek, rhusar | ||||||
Target Milestone: | ER2 | ||||||||
Target Release: | EAP 6.3.0 | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | Clustering testsuite | ||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2014-06-28 15:28:11 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
Richard Janík
2013-08-08 12:06:24 UTC
Please look more closely through the testsuite logs why the server shutdown failed. One possibility is that the timeout was too low for shutdown. Try increasing it if that's the case. Update: present in 6.2.0.ER1 as well. I haven't had a look at the timeouts yet, though. Please share some links with the results running on 6.2. Thanks Here are the links, tested against 6.2.0.ER1: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/EAP6/view/EAP6-JGroups/job/eap-6x-jgroups-tcpgossip-rhel-matrix/63/jdk=java17_default,label=RHEL6_x86_64/testReport/ https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/EAP6/view/EAP6-JGroups/job/eap-6x-jgroups-tcpgossip-rhel-matrix/63/jdk=openjdk-1.6.0-local,label=RHEL5_x86/testReport/ https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/EAP6/view/EAP6-JGroups/job/eap-6x-jgroups-tcpgossip-rhel-matrix/63/jdk=openjdk-1.7.0-local,label=RHEL5_x86_64/testReport/ https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/EAP6/view/EAP6-JGroups/job/eap-6x-jgroups-tcpgossip-SOLaris-matrix/lastCompletedBuild/jdk=java16_default,label=solaris11-x64/testReport/ https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/EAP6/view/EAP6-JGroups/job/eap-6x-jgroups-tcpgossip-SOLaris-matrix/lastCompletedBuild/jdk=java16_default,label=sol11_sparc64/testReport/ Just a status update: seen in EAP 6.2.0.ER3. Test reports from RHEL matrix: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-jgroups-tcpgossip-rhel-matrix/65/testReport/ Sample log from affected configuration: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-jgroups-tcpgossip-rhel-matrix/65/jdk=ibm16,label=RHEL5_x86/consoleFull This is no longer addressable in the time remaining for release. My observations confirm that server is not stopped in time. It is actually correctly stopped but before that completes, the tests running do fail. Below is what I observe: node0 tcp async: > 07:21:00,039 INFO [org.jboss.as] (MSC service thread 1-1) JBAS015950: JBoss EAP 6.2.0.GA (AS 7.3.0.Final-redhat-14) stopped in 42ms node1 tcp async: > 07:21:00,238 INFO [org.jboss.as] (MSC service thread 1-7) JBAS015950: JBoss EAP 6.2.0.GA (AS 7.3.0.Final-redhat-14) stopped in 58ms subsequent failed test cases: > 07:20:16,816 ERROR [org.jboss.as.test.clustering.NodeUtil] (main) org.jboss.arquillian.container.spi.client.container.LifecycleException: The server is already running! Managed containers do not support connecting to running server instances due to the possible harmful effect of connecting to the wrong server. Please stop server before running or change to another type of container. > 07:20:44,933 ERROR [org.jboss.as.test.clustering.NodeUtil] (main) org.jboss.arquillian.container.spi.client.container.LifecycleException: The server is already running! Managed containers do not support connecting to running server instances due to the possible harmful effect of connecting to the wrong server. Please stop server before running or change to another type of container. Comparing logs of one server that stopped before next tests and the problematic server, I see some delays in the logs of the bad server. I've no idea why does this happen but one thing I'd try is increasing 'waitForPortsTimeoutInSeconds'. If this does not help, maybe I'll try adding more ports to the list because currently it seems only the management port is used. Created attachment 847673 [details] org.jboss.as.test.clustering.cluster.sso.ClusteredSingleSignOnTestCase-ASYNC-tcp-output.txt I may not be able to work anymore on this issue. But I want to put here some more notes. I'm attaching org.jboss.as.test.clustering.cluster.sso.ClusteredSingleSignOnTestCase-ASYNC-tcp-output.txt where you can see the errors thrown by Arquillian trying to shutdown the ASYNC/tcp servers. Basically that is a NPE on this line: > getManagementClient().getControllerClient().executeAsync(op, null); I don't have a big insight on how arquillian works but this sounds to me like the arquillian service cannot be reached on the server. Now looking at server log I see: > 16:18:33,409 INFO [org.jboss.as.server] (management-handler-thread - 1) JBAS018558: Undeployed "arquillian-service" (runtime-name: "arquillian-service") But arquillian is trying to stop server (according to the attached log file error messages) at 16:22:17 So the question is why arquillian service is undeployed before arquillian tries to stop servers? I see in sso/ClusteredSingleSignOnTestCase.java that it is dealing with starting/stopping servers but things look legitimate. I'll also attach the log file from one of the failed servers. btw I've always seen ASYNC-TCP servers to be the problem so maybe it could be something specific to them and how arquillian handles them. Created attachment 847674 [details] ASYNC TCP node0 server log attaching the node0 tcp async server.log from same test run as attachment 847673 [details] What I forgot to stress in last comment is that because of the NPE reported, the issue cannot be a simple timeout issue rather it is a problem of arquillian or maybe the clustering test case. Ok, now I realize that the ManagedDeployableContainer.java upstream [1] has additional logic to eventually kill the server if such condition occurs (see line 321). So perhaps we need to backport that and that should workaround the issue. Alternatively inserting time between the maven goals can also help but wouldn't be that reliable and is simply ugly. It would be nice though to understand and fix the misbehavior occurring as explained in the previous 2 comments. It is also very bad that ClusteredSingleSignOnTestCase is *not* reported as failed so I had to grep through the files to get to the error messages (not to say I had to first know what to grep for). In summary there are 3 issues I see: 1. ManagedDeployableContainer.java needs to have changes backported from wildfly to increase reliability 2. arquillian service undeployed before arquillian tries to send shutdown command 3. test case that fails to stop servers properly is *not* reported as failed and that makes finding out culprit of errors and diagnostic messages hard. [1] https://github.com/wildfly/wildfly/blob/master/arquillian/container-managed/src/main/java/org/jboss/as/arquillian/container/managed/ManagedDeployableContainer.java Sorry for multiple posts. Just to be complete, the exact backport from upstream that I believe will workaround the issue is adding: > try { > if(process != null) { > process.destroy(); > process.waitFor(); > } > }catch (Exception ignore) {} just before line 301 in ManagedDeployableContainer.java: > throw new LifecycleException("Could not stop container", e); I've created a job[1] with the workaround from comment 15 and couldn't reproduce the bug on EAP (I just patched the testsuite before running it). Apart from 2 configurations, the build looks much better. I've also created a tcpgossip job for WildFly[2] and I've caught it there[3], RHEL6 64bit, openjdk1.7. The testsuite as a whole seems quite unstable. [1] https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-jgroups-tcpgossip-rhel-matrix-bz995013/ config: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-jgroups-tcpgossip-rhel-matrix-bz995013/configure build 7 (last): https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-jgroups-tcpgossip-rhel-matrix-bz995013/7/ test results: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-jgroups-tcpgossip-rhel-matrix-bz995013/7/testReport/ [2] https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/wildfly-8x-jgroups-tcpgossip-rhel-matrix/ config: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/wildfly-8x-jgroups-tcpgossip-rhel-matrix/configure [3] https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/wildfly-8x-jgroups-tcpgossip-rhel-matrix/jdk=openjdk-1.7.0-local,label=RHEL6_x86_64/lastCompletedBuild/testReport/org.jboss.as.test.clustering.cluster.singleton/SingletonTestCase%28SYNC-tcp%29/testSingletonService/ all tests of that configuration: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/wildfly-8x-jgroups-tcpgossip-rhel-matrix/jdk=openjdk-1.7.0-local,label=RHEL6_x86_64/lastCompletedBuild/testReport/ Haven't seen this in ER2. |