Bug 995013

Summary:

Failures in clustering testsuite due to servers running from previous tests

Product:

[JBoss] JBoss Enterprise Application Platform 6

Reporter:

Richard Janík <rjanik>

Component:

Clustering, Testsuite

Assignee:

Radoslav Husar <rhusar>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Richard Janík <rjanik>

Severity:

high

Docs Contact:

Russell Dickenson <rdickens>

Priority:

unspecified

Version:

6.1.1, 6.2.0

CC:

akostadi, jkudrnac, kkhan, paul.ferraro, pjelinek, rhusar

Target Milestone:

ER2

Target Release:

EAP 6.3.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Clustering testsuite

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2014-06-28 15:28:11 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
org.jboss.as.test.clustering.cluster.sso.ClusteredSingleSignOnTestCase-ASYNC-tcp-output.txt	none
ASYNC TCP node0 server log	none

Description Richard Janík 2013-08-08 12:06:24 UTC

Description of problem:

EAP 6.1.1.ER4

I have observed many tests failing on some configurations (again, seems random... sorry) after their respective setUp() tests fail with all kinds of errors (could not deploy deployment,...), the underlying issue is in all cases this:

17:01:59,375 ERROR [org.jboss.as.test.clustering.NodeUtil] (main) org.jboss.arquillian.container.spi.client.container.LifecycleException: The server is already running! Managed containers do not support connecting to running server instances due to the possible harmful effect of connecting to the wrong server. Please stop server before running or change to another type of container.
To disable this check and allow Arquillian to connect to a running server, set allowConnectingToRunningServer to true in the container configuration

The servers are presumably still running from some previous test. This was caught when testing with TCPGOSSIP[1], but it is unrelated. I have run the testsuite with different options and have identified this to occur only with -DextendedTests[2], testsuite without -DextendedTests[3] is clean. I've only seen this happen on RHEL and Solaris machines, though I think it is because they have the largest matrices.

This is most probably an issue of clustering testsuite, clustering itself is fine. I've heard there were some changes recently in the way the clustering testsuite is run, perhaps this has something to do with them?

[1] TCPGOSSIP, -DextendedTests

https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/EAP6/view/EAP6-JGroups/job/eap-6x-jgroups-tcpgossip-rhel-matrix/55/jdk=ibm17,label=RHEL5_x86/testReport/
https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/EAP6/view/EAP6-JGroups/job/eap-6x-jgroups-tcpgossip-solaris-matrix/30/jdk=java17_default,label=sol11_sparc64/testReport/

[2] -DextendedTests, no TCPGOSSIP

https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6.1.1.ER4-clustering-testsuite-extended-rjanik-tmp/1/
https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6.1.1.ER4-clustering-testsuite-extended-rjanik-tmp/1/jdk=java17_default,label=RHEL5_x86_64/testReport/

[3] nice clean clustering testsuite

https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6.1.1.ER4-clustering-testsuite-rjanik-tmp/1/

sample error output

https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6.1.1.ER4-clustering-testsuite-extended-rjanik-tmp/jdk=java17_default,label=RHEL5_x86_64/lastCompletedBuild/testReport/org.jboss.as.test.clustering.cluster.ejb3.security/FailoverWithSecurityTestCase%28SYNC-udp%29/testSetup/

Comment 1 Radoslav Husar 2013-08-08 12:31:00 UTC

Please look more closely through the testsuite logs why the server shutdown failed. One possibility is that the timeout was too low for shutdown. Try increasing it if that's the case.

Comment 2 Richard Janík 2013-09-19 10:46:01 UTC

Update: present in 6.2.0.ER1 as well. I haven't had a look at the timeouts yet, though.

Comment 3 Radoslav Husar 2013-09-23 12:36:55 UTC

Please share some links with the results running on 6.2. Thanks

Comment 4 Richard Janík 2013-09-24 05:32:04 UTC

Here are the links, tested against 6.2.0.ER1:

https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/EAP6/view/EAP6-JGroups/job/eap-6x-jgroups-tcpgossip-rhel-matrix/63/jdk=java17_default,label=RHEL6_x86_64/testReport/

https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/EAP6/view/EAP6-JGroups/job/eap-6x-jgroups-tcpgossip-rhel-matrix/63/jdk=openjdk-1.6.0-local,label=RHEL5_x86/testReport/

https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/EAP6/view/EAP6-JGroups/job/eap-6x-jgroups-tcpgossip-rhel-matrix/63/jdk=openjdk-1.7.0-local,label=RHEL5_x86_64/testReport/

https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/EAP6/view/EAP6-JGroups/job/eap-6x-jgroups-tcpgossip-SOLaris-matrix/lastCompletedBuild/jdk=java16_default,label=solaris11-x64/testReport/

https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/EAP6/view/EAP6-JGroups/job/eap-6x-jgroups-tcpgossip-SOLaris-matrix/lastCompletedBuild/jdk=java16_default,label=sol11_sparc64/testReport/

Comment 5 Richard Janík 2013-10-01 08:10:01 UTC

Just a status update: seen in EAP 6.2.0.ER3.

Comment 6 Richard Janík 2013-10-01 08:21:40 UTC

Test reports from RHEL matrix: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-jgroups-tcpgossip-rhel-matrix/65/testReport/

Sample log from affected configuration: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-jgroups-tcpgossip-rhel-matrix/65/jdk=ibm16,label=RHEL5_x86/consoleFull

Comment 10 Paul Ferraro 2013-10-16 12:54:24 UTC

This is no longer addressable in the time remaining for release.

Comment 11 Aleksandar Kostadinov 2014-01-08 20:25:15 UTC

My observations confirm that server is not stopped in time. It is actually correctly stopped but before that completes, the tests running do fail. Below is what I observe:

node0 tcp async:
> 07:21:00,039 INFO  [org.jboss.as] (MSC service thread 1-1) JBAS015950: JBoss EAP 6.2.0.GA (AS 7.3.0.Final-redhat-14) stopped in 42ms
node1 tcp async:
> 07:21:00,238 INFO  [org.jboss.as] (MSC service thread 1-7) JBAS015950: JBoss EAP 6.2.0.GA (AS 7.3.0.Final-redhat-14) stopped in 58ms

subsequent failed test cases:
> 07:20:16,816 ERROR [org.jboss.as.test.clustering.NodeUtil] (main) org.jboss.arquillian.container.spi.client.container.LifecycleException: The server is already running! Managed containers do not support connecting to running server instances due to the possible harmful effect of connecting to the wrong server. Please stop server before running or change to another type of container.
> 07:20:44,933 ERROR [org.jboss.as.test.clustering.NodeUtil] (main) org.jboss.arquillian.container.spi.client.container.LifecycleException: The server is already running! Managed containers do not support connecting to running server instances due to the possible harmful effect of connecting to the wrong server. Please stop server before running or change to another type of container.

Comparing logs of one server that stopped before next tests and the problematic server, I see some delays in the logs of the bad server. I've no idea why does this happen but one thing I'd try is increasing 'waitForPortsTimeoutInSeconds'. If this does not help, maybe I'll try adding more ports to the list because currently it seems only the management port is used.

Comment 12 Aleksandar Kostadinov 2014-01-09 14:01:07 UTC

Created attachment 847673 [details]
org.jboss.as.test.clustering.cluster.sso.ClusteredSingleSignOnTestCase-ASYNC-tcp-output.txt

I may not be able to work anymore on this issue. But I want to put here some more notes. I'm attaching org.jboss.as.test.clustering.cluster.sso.ClusteredSingleSignOnTestCase-ASYNC-tcp-output.txt where you can see the errors thrown by Arquillian trying to shutdown the ASYNC/tcp servers. Basically that is a NPE on this line:
> getManagementClient().getControllerClient().executeAsync(op, null);

I don't have a big insight on how arquillian works but this sounds to me like the arquillian service cannot be reached on the server. Now looking at server log I see:
> 16:18:33,409 INFO  [org.jboss.as.server] (management-handler-thread - 1) JBAS018558: Undeployed "arquillian-service" (runtime-name: "arquillian-service")
But arquillian is trying to stop server (according to the attached log file error messages) at 16:22:17

So the question is why arquillian service is undeployed before arquillian tries to stop servers? I see in sso/ClusteredSingleSignOnTestCase.java that it is dealing with starting/stopping servers but things look legitimate.

I'll also attach the log file from one of the failed servers. btw I've always seen ASYNC-TCP servers to be the problem so maybe it could be something specific to them and how arquillian handles them.

Comment 13 Aleksandar Kostadinov 2014-01-09 14:08:23 UTC

Created attachment 847674 [details]
ASYNC TCP node0 server log

attaching the node0 tcp async server.log from same test run as attachment 847673 [details]

What I forgot to stress in last comment is that because of the NPE reported, the issue cannot be a simple timeout issue rather it is a problem of arquillian or maybe the clustering test case.

Comment 14 Aleksandar Kostadinov 2014-01-09 14:27:32 UTC

Ok, now I realize that the ManagedDeployableContainer.java upstream [1] has additional logic to eventually kill the server if such condition occurs (see line 321). So perhaps we need to backport that and that should workaround the issue.
Alternatively inserting time between the maven goals can also help but wouldn't be that reliable and is simply ugly.

It would be nice though to understand and fix the misbehavior occurring as explained in the previous 2 comments. It is also very bad that ClusteredSingleSignOnTestCase is *not* reported as failed so I had to grep through the files to get to the error messages (not to say I had to first know what to grep for).

In summary there are 3 issues I see:
1. ManagedDeployableContainer.java needs to have changes backported from wildfly to increase reliability
2. arquillian service undeployed before arquillian tries to send shutdown command
3. test case that fails to stop servers properly is *not* reported as failed and that makes finding out culprit of errors and diagnostic messages hard.

[1] https://github.com/wildfly/wildfly/blob/master/arquillian/container-managed/src/main/java/org/jboss/as/arquillian/container/managed/ManagedDeployableContainer.java

Comment 15 Aleksandar Kostadinov 2014-01-09 14:40:15 UTC

Sorry for multiple posts. Just to be complete, the exact backport from upstream that I believe will workaround the issue is adding:
>            try {
>                if(process != null) {
>                    process.destroy();
>                    process.waitFor();
>                }
>            }catch (Exception ignore) {}

just before line 301 in ManagedDeployableContainer.java:
> throw new LifecycleException("Could not stop container", e);

Comment 17 Richard Janík 2014-01-15 08:54:56 UTC

I've created a job[1] with the workaround from comment 15 and couldn't reproduce the bug on EAP (I just patched the testsuite before running it). Apart from 2 configurations, the build looks much better.

I've also created a tcpgossip job for WildFly[2] and I've caught it there[3], RHEL6 64bit, openjdk1.7. The testsuite as a whole seems quite unstable.


[1] https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-jgroups-tcpgossip-rhel-matrix-bz995013/

config: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-jgroups-tcpgossip-rhel-matrix-bz995013/configure

build 7 (last): https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-jgroups-tcpgossip-rhel-matrix-bz995013/7/

test results: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-jgroups-tcpgossip-rhel-matrix-bz995013/7/testReport/


[2] https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/wildfly-8x-jgroups-tcpgossip-rhel-matrix/

config: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/wildfly-8x-jgroups-tcpgossip-rhel-matrix/configure


[3] https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/wildfly-8x-jgroups-tcpgossip-rhel-matrix/jdk=openjdk-1.7.0-local,label=RHEL6_x86_64/lastCompletedBuild/testReport/org.jboss.as.test.clustering.cluster.singleton/SingletonTestCase%28SYNC-tcp%29/testSingletonService/

all tests of that configuration: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/wildfly-8x-jgroups-tcpgossip-rhel-matrix/jdk=openjdk-1.7.0-local,label=RHEL6_x86_64/lastCompletedBuild/testReport/

Comment 19 Richard Janík 2014-05-02 12:15:24 UTC

Haven't seen this in ER2.