Bug 1250140

Summary: [scale] - org.jboss.resteasy.spi.ResteasyProviderFactory potential leak
Product: Red Hat Enterprise Virtualization Manager Reporter: Eldad Marciano <emarcian>
Component: ovirt-engineAssignee: Juan Hernández <juan.hernandez>
Status: CLOSED CURRENTRELEASE QA Contact: Eldad Marciano <emarcian>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.5.4CC: bazulay, bmcclain, gklein, juan.hernandez, lpeer, lsurette, ncredi, oourfali, pstehlik, rbalakri, rgolan, Rhev-m-bugs, s.kieske, srevivo, ykaul
Target Milestone: ovirt-3.6.0-rcKeywords: Automation, AutomationBlocker, ZStream
Target Release: 3.6.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: 3.6.0-10 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1255767 (view as bug list) Environment:
Last Closed: 2016-04-20 01:26:59 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1255767    

Description Eldad Marciano 2015-08-04 15:12:24 UTC
Description of problem:
engine runs OOM after 1835 rest actions.
seems like there is class loader leak around: 
 org.jboss.resteasy.spi.ResteasyProviderFactory

which runs by 'Worker' Threads and 'org.ovirt.thread.pool'

the use case drives by QE engine that serve many jenkins jobs.

reproduced the bug on top synthetic engine required.

on other hand QE Automation will double check their automation code for risky areas such as (api.disconnect).


link for heap file:
http://file.tlv.redhat.com/gklein/heap-20635-2015-08-02_13-01-02.bin.gz

update the bug with further more information ASAP.
Version-Release number of selected component (if applicable):
3.5.4

How reproducible:
100%

Steps to Reproduce:
1. engine with 2GB ram.
2. 734 tests 2.5 times per day till the problem happens (1835 rest actions)

Actual results:
OOM after a while.

Expected results:
Continues Hours Operation 

Additional info:

Comment 1 Eldad Marciano 2015-08-05 11:18:07 UTC
Nelly, 
how many connections jenkins handle?
is it connection per test ? or team or one connection for all?

Comment 2 Juan Hernández 2015-08-17 11:59:15 UTC
What that heap dump shows is that the server has created 44 instances of the "com.sun.xml.bind.v2.runtime.JAXBContextImpl" class. Instances of this class store all the information required to convert any of the objects used by the RESTAPI to/from XML, and each of them consumes approx 22 MiB, for a total of approx 115 MiB.

All these "JAXBContextImpl" instances are created by the Resteasy builtin JAXB provider "org.jboss.resteasy.plugins.providers.jaxb.JAXBContextWrapper" and stored in a cache indexed by type of object:

  DataCenter -> First instance
  Cluster -> Second instance
  VM -> Third instance
  ...

In general these instances may contain different information, but in our case all of them are identical, so one would be enough, but this isn't how the builtin JAXB provider works.

If we want to improve this we need to backport the following change, which introduces a custom message body writer that creates only one JAXB context implementation:

  restapi: JAXB provider
  https://gerrit.ovirt.org/29789

Oved, please set the 3.5.z flag and acks if you want this backported.

Comment 3 Oved Ourfali 2015-08-17 12:10:09 UTC
Sounds like we should. 
I've set flags and target release accordingly.

Comment 4 Nelly Credi 2015-08-17 12:43:23 UTC
one per execution

Comment 5 Juan Hernández 2015-08-17 13:53:58 UTC
The two backported patches should fix this issue. To verify that the issue is fixed check that the number of instances of the JAXBContextImpl classes doesn't increase when new types of objects are requested via the RESTAPI:

  # ps -u ovirt | grep java
  22143 ?        00:00:24 java

  # su - ovirt -s /bin/sh

  $ jmap -histo 22143 | grep 'JAXBContextImpl$'
  1057: 4 352 com.sun.xml.bind.v2.runtime.JAXBContextImp

Before the fix the number of instances will increase when new types of objects are requested. After the fix the number of instances (the second column, 4 in the example above) should stay constant.

Comment 7 Sven Kieske 2015-09-17 13:09:17 UTC
Sorry, but in which ovirt version was this bug introduced? was it always there?

I can't use "jmap" because I can't install it easily in my production environment.


Thanks

Comment 8 Juan Hernández 2015-09-17 13:33:05 UTC
I think that the bug was always there, but I'm not 100% sure because it depends on the version of JBoss and I didn't check all the versions of JBoss, only WildFly 8.2 and JBoss EAP 6.3.

Note that although it is described as a "leak", it actually isn't, because there is a limit to the number of instances of "JAXBContextImpl" that are created, approx 50, because there are approx 50 types of objects in the RESTAPI. If you send 10 million requests to get VMs, for example, it won't create 10 million instances of "JAXBContextImpl", only 50, at most.

The "jmap" tool is part of the "java-1.7.0-openjdk-devel" package. Installing won't hurt your production environment. But it is strange that you don't have it already, it is installed when the "ovirt-engine" package is installed, at least for oVirt 3.5 and later. Alternatively you may take the "jmap" binary from another machine (with the same version of java-1.7.0-openjdk) and copy it to your production environment, execute, and then remove it.

Comment 9 Eldad Marciano 2015-12-29 16:48:07 UTC
Verified on top of 
rhevm - 3.6.1.1-0.1.el6
JBoss Enterprise Application Platform - Version 6.4.5.GA
having constant instance amount during the load test and after it.