1310739 – [QA] (6.2.z)Performance degradation when using REST API on WebSphere

Bug 1310739 - [QA] (6.2.z)Performance degradation when using REST API on WebSphere

Summary: [QA] (6.2.z)Performance degradation when using REST API on WebSphere

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	JBoss BPMS Platform 6
Classification:	Retired
Component:	Business Central
Sub Component:
Version:	6.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	CR1
Target Release:	6.2.3
Assignee:	Shelly McGowan
QA Contact:	Lukáš Petrovický
Docs Contact:
URL:
Whiteboard:
Depends On:	1299549
Blocks:	1327763 1327765
TreeView+	depends on / blocked

Reported:	2016-02-22 15:15 UTC by Alessandro Lazarotti
Modified:	2020-03-27 19:37 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Clone Of:	1299549
Environment:
Last Closed:	2020-03-27 19:37:00 UTC
Type:	Bug
Embargoed:

Attachments	(Terms of Use)
Contents of the system out log showing (un/marshaller) pool growth (197.70 KB, text/plain) 2016-03-25 17:47 UTC, Marco Rietveld	no flags	Details
javacore thread stack dump excerpt (11.23 KB, text/plain) 2016-03-25 17:58 UTC, Marco Rietveld	no flags	Details
View All

Description Alessandro Lazarotti 2016-02-22 15:15:03 UTC

+++ This bug was initially created as a clone of Bug #1299549 +++

Description of problem:
REST API calls are becoming significantly slower over time when Business Central is run on WebSphere.

Version-Release number of selected component (if applicable):
6.2.0 GA

Steps to Reproduce:
1. Run WasRestPerformanceTest from jBPM Integration test suite [1]
2. See how test duration time is raising.

Actual results:
The same process run many times is becoming slower and slower. Here are the times to run 100 processes (in ms):
9937
20448
23608
29905
45294
60033

Expected results:
The execution should always take almost the same amount of time.

Additional info:
This is reproducible using both direct REST calls and provided remote client classes. The issue affects only REST interface. When you run the same processes using SOAP or JMS, you will see that there is no performance degradation over time.


[1] https://gitlab.mw.lab.eng.bos.redhat.com/bxms/brms/blob/master/test-jbpm-integration/src/test/java/org/jboss/qa/bpms/jbpm/integration/remote/WasRestPerformanceTest.java

--- Additional comment from Tomas Livora on 2016-02-11 05:48:39 EST ---

This issue causes many problems when running tests from our test suite. The tests run 3-4 hours instead of 20 minutes on other containers. And there are also many failures caused by timeouts.

Comment 3 Marco Rietveld 2016-03-02 13:32:08 UTC

Tomas, I'm starting to suspect that there are some threads that are not being closed. Also, unfortunately, a .phd (portable heap dump) file misses some of the details that (non-ibm) heap dumps contain. 

Would you be able to provide an "after-tests" system dump of websphere? 
See http://wiki.eclipse.org/MemoryAnalyzer#System_Dumps_and_Heap_Dumps_from_IBM_Virtual_Machines for more info. 

The best would be if you could get "after-tests" system, phd and javacore dumps (all 3). 

Thanks!

Comment 5 Marco Rietveld 2016-03-10 14:24:45 UTC

Tomas, 

If possible, I need at least one of the following: 

1. A PHD file that's created at the same time as a javacore file. 
2. A system dump. 

If you add the following parameter to the (command-line) startup of websphere, then the websphere instance will create these dumps.  

-Xdump:system+heap+java:events=systhrow+user

In short, this argument says: 
a. please create system and heap and javacore dumps
b. Do that if there is an system generated exception or if the user submits control break. 

If the server is being started in the background, you can do "kill -INT <server-pid> in order to send a control-break (INT) signal to the process. 


More information:  

Dump types:

    system - a system process dump. Process system dump files with jextract before loading them into Memory Analyzer. Do not give them a .sdff extension as that is only used for Java 1.4.2 system dumps
    heap - a Portable Heap Dump (PHD) file. Contains all objects and classes, but no thread details
    javacore - a readable file which contains information about class loaders and which can be used by Memory Analyzer when reading PHD files

Events:

    systhrow - when a system generated exception is thrown
    user - the user has typed control-break

Comment 6 Marco Rietveld 2016-03-14 10:51:24 UTC

Status 14-03: 

This bug remains very similar to looking for a needle in a haystack, because there's no real clue besides "a memory leak in Websphere with REST operations".

Last week I spent most of the time either using the (Eclipse) MAT tool to go through the dumps or otherwise generate new dumps. Unfortunately, that has not really helped. Generating a system dump remains problematic, although I'm not sure how much it would help to have one. 


At the moment, I've just installed other IBM tools (Support Assistant, MDD4j, TMDA, etc) and I'm also seeing if I can generate the leak in other ways (kie-server REST tests, etc).

Comment 7 Marco Rietveld 2016-03-25 17:47:47 UTC

Created attachment 1140347 [details]
Contents of the system out log showing (un/marshaller) pool growth

Comment 8 Marco Rietveld 2016-03-25 17:55:51 UTC

Explanation: 

In short, the problem seems to be caused by a combination of the apache wink code[1] used in Websphere 8.5.5.x and the deployment-specific serialization mechanisms (org.kie.remote.services.rest.jaxb.DynamicJaxbContext) used in kie-remote-services module.

The Apache Wink (jax-rs) JAXB code pools JAXB marshallers and unmarshallers[2] and when these marshaller/unmarshaller pools reach a certain size[3], the pool is halved: every other entry in the pool is removed from the pool. 

A ConcurrentHashMap<JAXBContext, ArrayList<V>> is used as the collection for the pooling algorithm, where V is either a javax.xml.bind.Marshaller or javax.xml.bind.Unmarshaller. 

The problem is that whether the pool is made smaller[3] or added to[7], the underlying java.util.concurrent.ConcurrentHashMap(.Segment) code (in both the Oracle 6 JDK and the IBM 6 JDK) ends up calling the .equals(Object) method on the key used in the HashMap, namely a org.kie.remote.services.rest.jaxb.DynamicJaxbContext bean instance[8].

The DynamicJaxbContext instance used in the Workbench/Business Central is an @ApplicationScoped bean, which means that the instance used by the Apache Wink JAXB code is (javassist) proxy instance created by the Apache OpenWebBeans framework[6], which is apparently used as the CDI implemention in Websphere.

Calling the .equals(Object) on the proxy/bean instance of the DynamicJaxbContext means that the OpenWebBeans code has to search for the singleton implementation of the bean, which is not a trivial operation. this .equals(Object) seems to *not* return true when used on the identical instance.  Implementing (overriding) an .equals(Object) method in the DynamicJaxbContext class did not help which makes me believe that the OpenWebBeans(/javassist) bean uses/has it's own implementation of .equals(Object). 

In short, not only are multiple instances of the same DynamicJaxbContect bean instance being added to the Pool, every time that a DynamicJaxbContext bean instance is added or removed, the resulting call to the .equals(Object) method results in an inefficient call to the OpenWebBeans code making it search for the original Singleton. 

The following comment explains further evidence for this as well as how the problem was partially confirmed. 

[1] Based on the line-number analysis, the code used in websphere 8.5.5.9 appears to be based on Wink
1.1.3-incubating ( https://github.com/apache/wink/tree/wink-1.1.3-incubating ). 
[2] https://issues.apache.org/jira/browse/WINK-222 "Pool the JAXB Unmarshallers/Marshallers for a performance improvement" 
[3] https://github.com/apache/wink/blob/wink-1.1.3-incubating/wink-common/src/main/java/org/apache/wink/common/internal/providers/entity/xml/AbstractJAXBProvider.java#L887-L890
[4] https://github.com/apache/wink/blob/wink-1.1.3-incubating/wink-common/src/main/java/org/apache/wink/common/internal/providers/entity/xml/AbstractJAXBProvider.java#L957-L977
[5] https://github.com/apache/wink/blob/wink-1.1.3-incubating/wink-common/src/main/java/org/apache/wink/common/internal/providers/entity/xml/AbstractJAXBProvider.java#L113-L116
    https://github.com/apache/wink/blob/wink-1.1.3-incubating/wink-common/src/main/java/org/apache/wink/common/internal/providers/entity/xml/AbstractJAXBProvider.java#L883-L886
[6] Based on line-number comparison/analysis, the code in websphere 8.5.5.9 appears to be base on
OpenWebBeans 1.1.0 ( https://github.com/apache/openwebbeans/tree/openwebbeans-1.1.0 )
[7] https://github.com/apache/wink/blob/wink-1.1.3-incubating/wink-common/src/main/java/org/apache/wink/common/internal/providers/entity/xml/AbstractJAXBProvider.java#L932-L955
[8] Confirmed using jd-gui on the jdk rt.jar (and java.util.jar for the IBM JDK). 
    jd-gui: http://jd.benow.ca/

Comment 9 Marco Rietveld 2016-03-25 17:58:53 UTC

Created attachment 1140362 [details]
javacore thread stack dump excerpt

Comment 10 Marco Rietveld 2016-03-25 19:07:04 UTC

Proof: 

There are two files attached that contain additional info used to make the concliusions above: 

A. javacore-jaxb-pooling-thread.log ( https://bugzilla.redhat.com/attachment.cgi?id=1140362 ) 
B. pool-add-system-out.log ( https://bugzilla.redhat.com/attachment.cgi?id=1140347 )


A. javacore dump excerpt

The javacore file is a "dump" of the current stacks for (java) threads at the moment of the dump. In this case, this was achieved  by sending a SIGQUIT signal to the websphere process while a REST call was being processed (after configuring Websphere with the IBM JDK -Xdump option). 

Of note are the following: 
- the CPU time for the thread is unusually high ( ~361 secs )
- the "  at java/util/concurrent/ConcurrentHashMap$Segment.remove(ConcurrentHashMap.java:587(Compiled Code))" corresponds to the following line of code in the IBM JDK used with Websphere: 

  while ((localHashEntry2 != null) && ((localHashEntry2.hash != paramInt) || (!paramObject1.equals(localHashEntry2.key)))) {

  (note the .equals(Object) call on paramObject1). This code is always used regardless of how an
  Entry is removed from the ConcurrentHashMap. 

B. an excerpt from the system.out log containing the following debug statement: 

   System.out.println( "Adding " + key.getClass().getSimpleName() + ": " + System.identityHashCode(key) + "[" + map.size() +"]");

   In this case the "key" variable is a DynamicJaxbContext (OpenWebBeans @ApplicationScoped) bean instance and the "map" variable is the ConcurrentHashMap instance that underlies the Apache Wink AbstractJAXBProvider.Pool class. 

   The content of the attachment shows that the same instance of the DynamicJaxbContext (proxy/javassist) bean instance is being continually added to the ConcurrentHashMap. This would seem to indicate that the .equals(Object) is returning false even when it's being called on the same instance. 

   (As noted in the previous comment, I also tried to override the .equals(Object) method in the DynamicJaxbContext, however this did not produce a significant enough change in performance). 


   This debug output was acheived by doing the following: 

   a. 'grep'ping through a locally installed websphere instance in order to find the jars containing the Apache Wink AbstractJAXBProvider compiled class.
   b. Checking out the Apache Wink repository and, based on line-number analysis, identifying the version of the Apache Wink code being used in the Websphere 8.5.5.9 instance installed locally on my laptop. 
   c. Opening the Apache Wink module (wink-common) in the wink git repo containing the AbstractJAXBProvider class, modifying it (see [1] for the modifications), and letting Eclipse compile the class. 
   d. Replacing the class in the original jars identified in (a) with the compiled class created in (c). (Unzip the original jar, replace the extraced files, and "rezip" the jar to create a modified jar)
   e. Starting websphere and running the tests. 

   Furthermore, while the tests initially would take 25seconds and then plateau somewhere between 7 and 10 minutes, the tests run with the modified code (which avoids the use of the .equals(Object) method in the code used to make the pool smaller) never lasted longer than 20 seconds. 

[1] The modified code used to confirm the problem: 

926:    /**
927:     * Get or create a list of the values for the key
928:     *
929:     * @param key
930:     * @return list of values (never null)
931:     */
932:    private List<V> getValues(K key) {
933:        ConcurrentHashMap<K, ArrayList<V>> map = softMap.get();
934:        ArrayList<V> values = null;
935:        if (map != null) {
936:            values = map.get(key);
937:            if (values != null) {
938:                return values;
939:            }
940:        }
941:        synchronized (this) {
942:            if (map != null) {
943:                values = map.get(key);
944:            }
945:            if (values == null) {
946:                if (map == null) {
947:                    map = new ConcurrentHashMap<K, ArrayList<V>>();
948:                    softMap = new SoftReference<ConcurrentHashMap<K, ArrayList<V>>>(map);
949:                }
950:                values = new ArrayList<V>();
951:                System.out.println( "Adding " + key.getClass().getSimpleName() + ": " + System.identityHashCode(key) + "[" + map.size() +"]");
952:                map.put(key, values);
953:            }
954:            return values;
955:        }
956:    }
957:
958:    /**
959:     * When the number of keys exceeds the maximum load, half of the entries
960:     * are deleted. The assumption is that the JAXBContexts, UnMarshallers,
961:     * Marshallers, etc. require a large footprint.
962:     */
963:    private void adjustSize() {
964:        ConcurrentHashMap<K, ArrayList<V>> map = softMap.get();
965:        if (map != null && map.size() > MAX_LOAD_FACTOR) {
966:            // Remove all entries
967:            map.clear();
968:        }
969:    }

There are 2 changes here: 

- line 951 includes a (System.out.println) debug statement
- line 967 replaces the original "halving" code with a simple clear() in order to acheive making the map smaller without .equals(Object) being called on a key of the map. 

The original "halving" code can be found here: 
https://github.com/apache/wink/blob/wink-1.1.3-incubating/wink-common/src/main/java/org/apache/wink/common/internal/providers/entity/xml/AbstractJAXBProvider.java#L965-L974

Comment 11 Marco Rietveld 2016-03-25 19:20:49 UTC

Lastly, the following information is important to have: 

The DynamicJaxbContext logic is a solution to the following: 

1. Provide a JAXBcontext that contains the relevant classes from a deployment to be used when deserializing a REST request
2. Which (deployment) JAXBContext should be used is based on the URL or HTTP headers used in a request. 
3. The only information provided to the (JAX-RS) ContextResolver (that produces the JAXBContext) is the (java) type -- and of course, multiple deployments can contain classes with identical java type names. 
4. Make sure that this JAXBContext logic reacts to (CDI) deployment and undeployment events. 

I'm not sure yet whether it's possible to create a solution the above problem while avoiding the use of a JAXBContext instance that is also a CDI scoped bean.

Comment 12 Marco Rietveld 2016-04-06 11:49:09 UTC

Fixed on 6.4.x. Commits: 

6.4.x: 
https://github.com/droolsjbpm/droolsjbpm-integration/commit/e8c4332f

Pull request for 6.3.x: 
https://github.com/droolsjbpm/droolsjbpm-integration/pull/403

Comment 13 Marco Rietveld 2016-04-11 13:20:24 UTC

Unfortunately, while the performance problem is fixed, it looks like custom-type serialization is now broken. 

Currently working on fixing that..

Comment 14 Marco Rietveld 2016-04-11 13:42:29 UTC

(custom-type serialization is only broken on websphere, btw).

Comment 15 Marco Rietveld 2016-04-13 14:07:29 UTC

The above problem (custom-type serialization problems after the fix) can be fixed by changing the *configuration* of the websphere instance. 

In the custom properties of the JVM instance (reachable via the server configuration), add the following custom property: 

Property name:  org.apache.wink.jaxbcontextcache
Property value: off

This fixes the custom-type serialization problem. 

Also, obviously, it turns off caching for the JAXBContext instances. This makes performance a little bit slower, but not significantly -- especially since for the kie-remote-* services, caching isn't needed in the first place. Furthermore, it's also not needed for the kie-server REST API nor the guvnor REST API.

Comment 16 Tomas Livora 2016-05-09 09:44:20 UTC

Verified on BPM Suite 6.2.3 CR1

Note You need to log in before you can comment on or make changes to this bug.