Bug 1012454 - Intermittent failures of WSDiscoveryTestCase for WS Discovery
Intermittent failures of WSDiscoveryTestCase for WS Discovery
Status: ASSIGNED
Product: JBoss Enterprise Application Platform 6
Classification: JBoss
Component: Web Services (Show other bugs)
6.2.0
Unspecified Unspecified
unspecified Severity medium
: ---
: EAP 6.4.0
Assigned To: Rebecca Searls
Rostislav Svoboda
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-09-26 09:44 EDT by Petr Sakař
Modified: 2017-10-09 20:18 EDT (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
patch files (1.11 MB, application/zip)
2015-01-29 11:57 EST, Rebecca Searls
no flags Details
Patch for wFly900-cxf_500-SNAPSHOT (10.01 KB, patch)
2015-02-01 17:21 EST, Rebecca Searls
no flags Details | Diff
Patch for eap640 cxf_432-final.patch (10.06 KB, patch)
2015-02-01 17:24 EST, Rebecca Searls
no flags Details | Diff


External Trackers
Tracker ID Priority Status Summary Last Updated
Apache JIRA CXF-6172 None None None Never

  None (edit)
Description Petr Sakař 2013-09-26 09:44:56 EDT
Description of problem:

WSDiscoveryTestCase failure see http://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-jbossws-testsuite-hpux/jdk=jdk16_hpux,label_exp=hpux11v3/lastCompletedBuild/testReport/org.jboss.test.ws.jaxws.samples.wsdd/WSDiscoveryTestCase/testProbeAndResolve/
Error Message
	expected:<3> but was:<4>
Stacktrace
	junit.framework.AssertionFailedError: expected:<3> but was:<4>
	at junit.framework.Assert.fail(Assert.java:50)
	at org.jboss.test.ws.jaxws.samples.wsdd.WSDiscoveryTestCase.testProbeAndResolve(WSDiscoveryTestCase.java:69)
Standard Error
	WSDiscoveryTestCase ProbeMatchType address http://localhost:8080/jaxws-samples-wsdd/WSDDService
	WSDiscoveryTestCase ProbeMatchType address http://localhost:8080/jaxws-samples-wsdd2/AnotherWSDDService
	WSDiscoveryTestCase ProbeMatchType address http://localhost:8080/jaxws-samples-wsdd2/WSDDService
	WSDiscoveryTestCase ProbeMatchType address http://localhost:8080/jaxws-samples-wsdd/WSDDService


Version-Release number of selected component (if applicable):
6.2.0.ER1
6.2.0.ER2
6.2.0.ER3


How reproducible:
intermittent

Additional info:
Test coveraget of implementation in upstream project should be inspected.

WS Discovery feature was not approved for EAP-6.2.0, was imported from upstream
Comment 1 Rebecca Searls 2014-03-26 10:01:47 EDT
I did a quick evaluation of the 5 test runs containing WSDiscoveryTestCase
test failures.  Report data taken from this web site
https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/EAP6/view/EAP6-WS/job/eap-62x-patched-jbossws-testsuite-matrix
Runs #16,#15,#11 (i.e. "EAP 6.2.2.CP.CR3") and #12, #10 (i.e. "EAP 6.3.0.DR4") 
were evaluated


3 FAILURE PATTERNS

1. Of the 5 test reports evaluated "x86_64" platform configurations failed 
   the most often.  These platforms failed in 4 of the 5 test runs evaluated.

      9 platforms X 5 test runs = 45 tests
      9 platforms X 4 test run failures = 36 failed tests
      36/45 = 80% failure rate

     jdk=ibm16,              label=RHEL5 &&      x86_64
     jdk=ibm17,              label=RHEL5 &&      x86_64
     jdk=ibm17,              label=RHEL6 &&      x86_64
     jdk=java16_default,     label=RHEL6 &&      x86
     jdk=java16_default,     label=solaris10 &&  sparc
     jdk=java16_default,     label=solaris11 &&  x86_64
     jdk=java17_default,     label=RHEL6 &&      x86_64
     jdk=openjdk-1.6.0-local,label=RHEL6 &&      x86_64
     jdk=openjdk-1.7.0-local,label=RHEL6 &&      x86_64


    - The "6.2.2.CP.CR3 -fn" tests showed a 47% failure rate
        65 platforms  X 3 test runs = 195 total tests run
        195 total tests run - 91 test failures = 104 passing tests
        91/195 = 47% failure rate

    - The "EAP 6.3.0.DR4" tests showed a 25% failure rate
        65 platforms X 2 test runs = 130 total tests run
        130 total tests run - 33 test failures = 97 passing tests
        33/130 = 25% failure rate


2. The most common failure causes was too many endpoint services by the same
   "targetname" found on the network. 

      For test run "Mar 24 (#16) 6.2.2.CP.CR3 -fn" a total of 18 failures due to
      this.   Here are examples of the junit failure stmt.
         expected:<1> but was:<4>
         expected:<1> but was:<3>
         expected:<1> but was:<2>
         expected:<1> but was:<0>

      For test run "Feb 20 (#10) EAP 6.3.0.DR4" a total of 17 failures due to
      this.  There appears to have been a code change to the test for between
      "EAP 6.3.0.DR4" and "6.2.2.CP.CR3 -fn", but the behavior is still the
      same too many endpoint services found.
          expected:<3> but was:<7>
          expected:<3> but was:<4>
          expected:<3> but was:<6>
          expected:<3> but was:<5>

3. "Mar 24 (#16) 6.2.2.CP.CR3 -fn" is suffering a 2nd intermittent error.
    The following msg was generated for 15 tests.  This appears to have
    been introduced into the code starting with build (#16).

      Could not resolve (timeout = 2000 ms) reference: 
      <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
      <EndpointReference xmlns="http://www.w3.org/2005/08/addressing">
      <Address>urn:uuid:73645339-8571-4e3b-b3d3-8b77e4125d84</Address>
      <ReferenceParameters/></EndpointReference>
Comment 2 Rebecca Searls 2014-03-26 12:11:16 EDT
Failures are occurring because it appears multiple testsuits are being run
in parallel on the same network.  It's possible this is being run as "matrix job" on Jenkins, that would explain why the behavior is not reproducible outside Jenkins.

I have not found any way to retrieve any unique identifying information about
returned W3CEndpointReference objects.

The only solution I see is to change the test to check that the array size of 
the matching list of endpoints is GT zero.
Comment 3 Petr Sakař 2014-03-26 12:19:20 EDT
Ara you sure there is not fundamental mistake in the implementation ? We bind EAP in our tests to loopback interface only (127.0.0.1). Thus any WS announced have meaning only for the host where it is running, not for other hosts on network (as the address is 127.0.0.1 and they can not access it).
Comment 4 Rebecca Searls 2014-03-26 12:56:59 EDT
org.apache.cxf.ws.discovery.WSDiscoveryClient
397  disp.getRequestContext().put("udp.multi.response.timeout", timeout);

The code above which the test is using, is making a UDP multicast call. ...
"UDP is different than the other CXF transports in that it allows multiple 
 responses to be received for a single request. For example, if you send 
 out a request via a multicast or broadcast, several servers could respond 
 to that request.   ..."
Comment 5 Rebecca Searls 2014-03-26 12:58:51 EDT
Committed revision 18542
      [BZ-1012454] check that the number of matches is GT 0.
Comment 6 Rebecca Searls 2014-03-26 13:32:33 EDT
The above bug fix does not address this error.

Could not resolve (timeout = 2000 ms) reference: 
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<EndpointReference xmlns="http://www.w3.org/2005/08/addressing">
<Address>urn:uuid:73645339-8571-4e3b-b3d3-8b77e4125d84</Address>
<ReferenceParameters/></EndpointReference>

Every endpoint is assigned a unique uuid, by
     org.apache.cxf.ws.discovery.WSDiscoveryClient
     438     builder.address(ContextUtils.generateUUID());
however the value is a private property of javax.xml.ws.wsaddressing.W3CEndpointReference and not accessible.
Comment 7 Rebecca Searls 2014-03-26 15:17:14 EDT
I found a means to print out the uuid for each endpoint.  I've checked-in this tmp code in order to help debug Jenkins runs.  Tmp code will be removed in the near future.
Comment 9 Rebecca Searls 2014-08-18 10:36:34 EDT
http://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/EAP6/view/EAP6-WS/job/eap-6x-jbossws-testsuite-rhel/44/testReport
http://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/EAP6/view/EAP6-WS/job/eap-6x-jbossws-testsuite-rhel/45/testReport/

The evaluation of WSDiscoveryTestCase failures for run 44 and 45 on eap-6x-jbossws-testsuite-rhel shows that multiple machines are responding to the UDP multicast call by this code.  Between the 2 runs there is some small overlap in test platform that shows the failure but it is not consistent enough to declare it is a platform specific issue.

The bug report notes that this failure only occurs on jenkins and not when the
test is run by individuals.  Is this testsuit on jenkins being run in parallel 
on the same network?
Comment 10 Jan Blizňák 2014-08-28 16:31:55 EDT
Yes, as you can see in linked jenkins jobs, these are matrix jobs that could be executing many testsuites in parallel.

And yes, those jenkins nodes are in the same network.

Is that a problem for testing WSDiscovery? Can we do something about it?
Is it possible to rewrite the test to handle such situation?
Comment 11 Jan Blizňák 2014-09-10 07:58:19 EDT
I hit the failure in single run with 6.3.1.CP.CR2 when there were no other WS testsuites running concurrently.
The odd thing here is there were discovered three services with the same uuid.


https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-jbossws-testsuite-prepare/84/testReport/junit/org.jboss.test.ws.jaxws.samples.wsdd/WSDiscoveryTestCase/testProbeAndResolve/

Error Message

 http://localhost:8080/jaxws-samples-wsdd/WSDDService  urn:uuid:9ecf3c4d-eea6-4ec9-a067-4898edf8e0b8 http://localhost:8080/jaxws-samples-wsdd/WSDDService  urn:uuid:9ecf3c4d-eea6-4ec9-a067-4898edf8e0b8 http://localhost:8080/jaxws-samples-wsdd/WSDDService  urn:uuid:9ecf3c4d-eea6-4ec9-a067-4898edf8e0b8  expected:<1> but was:<3>

Stacktrace

junit.framework.AssertionFailedError: 
http://localhost:8080/jaxws-samples-wsdd/WSDDService  urn:uuid:9ecf3c4d-eea6-4ec9-a067-4898edf8e0b8
http://localhost:8080/jaxws-samples-wsdd/WSDDService  urn:uuid:9ecf3c4d-eea6-4ec9-a067-4898edf8e0b8
http://localhost:8080/jaxws-samples-wsdd/WSDDService  urn:uuid:9ecf3c4d-eea6-4ec9-a067-4898edf8e0b8
 expected:<1> but was:<3>
	at junit.framework.Assert.fail(Assert.java:50)
	at junit.framework.Assert.failNotEquals(Assert.java:287)
	at junit.framework.Assert.assertEquals(Assert.java:67)
	at junit.framework.Assert.assertEquals(Assert.java:199)
	at org.jboss.test.ws.jaxws.samples.wsdd.WSDiscoveryTestCase.checkResolveMatches(WSDiscoveryTestCase.java:156)
	at org.jboss.test.ws.jaxws.samples.wsdd.WSDiscoveryTestCase.testProbeAndResolve(WSDiscoveryTestCase.java:80)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at junit.framework.TestCase.runTest(TestCase.java:168)
	at junit.framework.TestCase.runBare(TestCase.java:134)
	at junit.framework.TestResult$1.protect(TestResult.java:110)
	at junit.framework.TestResult.runProtected(TestResult.java:128)
	at junit.framework.TestResult.run(TestResult.java:113)
	at junit.framework.TestCase.run(TestCase.java:124)
	at junit.framework.TestSuite.runTest(TestSuite.java:243)
	at junit.framework.TestSuite.run(TestSuite.java:238)
	at junit.extensions.TestDecorator.basicRun(TestDecorator.java:24)
	at org.jboss.wsf.test.JBossWSTestSetup$1.protect(JBossWSTestSetup.java:142)
	at junit.framework.TestResult.runProtected(TestResult.java:128)
	at org.jboss.wsf.test.JBossWSTestSetup.run(JBossWSTestSetup.java:149)
	at org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:83)
	at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:234)
	at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:133)
	at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:114)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray(ReflectionUtils.java:188)
	at org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:166)
	at org.apache.maven.surefire.booter.ProviderFactory.invokeProvider(ProviderFactory.java:86)
	at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:101)
	at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:74)
Comment 12 Jan Blizňák 2014-12-02 10:14:02 EST
There are actually more issues here:

1) in pure IPv6 environment no WS is found during PROBE phase and that is why org.jboss.test.ws.jaxws.samples.wsdd.WSDiscoveryTestCase.testProbeAndResolve always fails with:

Error Message
  expected:<1> but was:<0>

This might be possibly related to https://issues.jboss.org/browse/JBWS-3721 https://issues.jboss.org/browse/JBWS-3778

Note: the interesting thing here is that it applies only for RHELs, on windows pure IPv6 environments this test never failed from what I can see in jenkins:
https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/EAP6/view/EAP6-WS/job/eap-6x-jbossws-testsuite-rhel-ipv6-pure/
https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/EAP6/view/EAP6-WS/job/eap-6x-jbossws-testsuite-windows-ipv6-pure/



2) the failures because of concurrent execution of testsuite on the same network:

a) Although the each server is bind to localhost only, when WS-Discovery enabled service is deployed, it starts listening on UDP port 3702 even for remote (nonloopback interface) requests.
b) This causes that also non-locally hosted web services are discovered in PROBE phase of #testProbeAndResolve and #testInvocation. 
c) All discovered WS are filtered by #filterProbeMatchesForHost but since all these webservices are deployed on localhost, it adds them all to further processing.
d) The resolving phase then is the place where test most probably fails on timeout because it discovers some webservice hosted elsewere at previous step and it gets undeployed since then.
e1) in #testInvocation there is hidden issue: because we get port by address from getXAddrs (which is always localhost), we actually execute webservice hosted on the current machine (even when the remote webservice is still active)
e2) in #testProbeAndResolve it checks that each webservice is dicovered only once, which isn't always true because of step b) and c)  


3) special case of failing of #testProbeAndResolve because it discovers three services with the same uuid (even when no other concurrent execution are running)
I managed to isolate the issue to the specific (beaker) machine (there might be more of them), it fails every time on it. I haven't found any other indices yet.

https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/EAP6/view/EAP6-WS/job/eap-6x-jbossws-testsuite-smoke/178/testReport/junit/org.jboss.test.ws.jaxws.samples.wsdd/WSDiscoveryTestCase/testProbeAndResolve/
https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/EAP6/view/EAP6-WS/job/eap-6x-jbossws-testsuite-smoke/179/testReport/junit/org.jboss.test.ws.jaxws.samples.wsdd/WSDiscoveryTestCase/testProbeAndResolve/
https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/EAP6/view/EAP6-WS/job/eap-6x-jbossws-testsuite-smoke/180/testReport/junit/org.jboss.test.ws.jaxws.samples.wsdd/WSDiscoveryTestCase/testProbeAndResolve/



My questions are:

1) Is responding on UDP multicast on public interface correct even when the server is accessible only on loopback interface? Or this is just special case and we should deal with it by configuring firewall?
2) Why is address with "localhost" sent back in ProbeMatch and ResolveMatch for remotely hosted service when it is obvious we cannot access it? XAddr specification: Transport address(es) that MAY be used to communicate with the Target Service (or Discovery Proxy). Contained URIs MUST NOT contain whitespaces. If a Target Service (or Discovery Proxy) has transport addresses (see Section 2.1 Endpoint References) at least one transport address MUST be included. If omitted or empty, no implied value. http://docs.oasis-open.org/ws-dd/discovery/1.1/os/wsdd-discovery-1.1-spec-os.html


If both of above is expected then the problematic part for testing is the filtering 2c). Different approach might be needed. Or maybe this can be solved by binding server and testsuite to real IP address of the current machine (but I don't know whether it would break anythink else in the TS. Another attempt might be to use dynamic names (with some pseudorandom element) for webservices for each deployment.
Comment 13 Rebecca Searls 2015-01-29 11:57:24 EST
Created attachment 985684 [details]
patch files
Comment 14 Rebecca Searls 2015-01-29 11:59:25 EST
Based upon an email discussion "Re: bz-1012454 next steps" 12/22/2014
I have modify our modules/addons/transports/udp/src/main/java/org/jboss/wsf/stack/cxf/addons/transports/udp/*
files to use the same implementation as CXF.  Here is the file list.

  u modules/addons/transports/udp/pom.xml
  u modules/addons/transports/udp/src/main/java/org/jboss/wsf/stack/cxf/addons/transports/udp/UDPConduit.java
  u modules/addons/transports/udp/src/main/java/org/jboss/wsf/stack/cxf/addons/transports/udp/UDPDestination.java
  a modules/addons/transports/udp/src/main/java/org/jboss/wsf/stack/cxf/addons/transports/udp/IoSessionInputStream.java
  a modules/addons/transports/udp/src/main/java/org/jboss/wsf/stack/cxf/addons/transports/udp/IoSessionOutputStream.java

This code change required the addition of archive
    <groupId>org.apache.mina</groupId>
    <artifactId>mina-core</artifactId>
to the pom.xml and this archive must be added as a module to the JBoss server and a reference
to the package added to module modules/system/layers/base/org/jboss/ws/cxf/jbossws-cxf-transports-udp/main/

I tested these code changes in a single machine env in JBossWS CXF stack (4.3.2.Final), 
the version used by EAP6.4.0, and JBossWS CXF stack (5.0.0-SNAPSHOT).

I have attached patch and zip files for both versions (bz1012454.zip).
Comment 17 Rebecca Searls 2015-02-01 17:21:55 EST
Created attachment 986852 [details]
Patch for wFly900-cxf_500-SNAPSHOT
Comment 18 Rebecca Searls 2015-02-01 17:24:40 EST
Created attachment 986853 [details]
Patch for eap640 cxf_432-final.patch
Comment 25 Jan Blizňák 2015-12-09 09:51:35 EST
Regarding the comment #12, the first issue of not working in IPV6 network was fixed in https://issues.apache.org/jira/browse/CXF-6172. 
The other two issues are still relevant.

Note You need to log in before you can comment on or make changes to this bug.