Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1570104 - [Infra] OpenDaylight is consuming too much heap memory
[Infra] OpenDaylight is consuming too much heap memory
Status: CLOSED ERRATA
Product: Red Hat OpenStack
Classification: Red Hat
Component: puppet-opendaylight (Show other bugs)
13.0 (Queens)
Unspecified Unspecified
high Severity high
: beta
: 13.0 (Queens)
Assigned To: Tim Rozet
Itzik Brown
odl_infra
: Triaged
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2018-04-20 12:06 EDT by Sai Sindhur Malleni
Modified: 2018-10-18 03:19 EDT (History)
9 users (show)

See Also:
Fixed In Version: puppet-opendaylight-8.1.0-0.20180321182557
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
N/A
Last Closed: 2018-06-27 09:52:02 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
/bin/inc (9.76 KB, text/plain)
2018-04-23 15:20 EDT, Sai Sindhur Malleni
no flags Details
bin/setenv (2.17 KB, text/plain)
2018-04-23 15:20 EDT, Sai Sindhur Malleni
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
OpenDaylight Bug INTPAK-163 None None None 2018-04-24 11:32 EDT
OpenDaylight gerrit 71267 None None None 2018-04-24 13:51 EDT
Red Hat Product Errata RHEA-2018:2086 None None None 2018-06-27 09:53 EDT

  None (edit)
Description Sai Sindhur Malleni 2018-04-20 12:06:15 EDT
Description of problem:
I noticed something strange and want to make sure I'm not missing something very obvious. Looking at the karaf process in the ODL container I see

[root@overcloud-odl-0 heat-admin]# ps aux | grep karaf | grep server
42462      19245 27.5  6.4 60158752 8518440 ?    Sl   13:50  34:55 /usr/bin/java -Djava.net.preferIPv4Stack=true -Djava.security.egd=file:/dev/./urandom -Djava.endorsed.dirs=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.161-2.b14.el7.x86_64/jre/jre/lib/endorsed:/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.161-2.b14.el7.x86_64/jre/lib/endorsed:/opt/opendaylight/lib/endorsed -Djava.ext.dirs=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.161-2.b14.el7.x86_64/jre/jre/lib/ext:/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.161-2.b14.el7.x86_64/jre/lib/ext:/opt/opendaylight/lib/ext -Dkaraf.instances=/opt/opendaylight/instances -Dkaraf.home=/opt/opendaylight -Dkaraf.base=/opt/opendaylight -Dkaraf.data=/opt/opendaylight/data -Dkaraf.etc=/opt/opendaylight/etc -Dkaraf.restart.jvm.supported=true -Djava.io.tmpdir=/opt/opendaylight/data/tmp -Djava.util.logging.config.file=/opt/opendaylight/etc/java.util.logging.properties -Dkaraf.startLocalConsole=false -Dkaraf.startRemoteShell=true -classpath /opt/opendaylight/lib/boot/org.apache.karaf.diagnostic.boot-4.1.3.jar:/opt/opendaylight/lib/boot/org.apache.karaf.jaas.boot-4.1.3.jar:/opt/opendaylight/lib/boot/org.apache.karaf.main-4.1.3.jar:/opt/opendaylight/lib/boot/org.osgi.core-6.0.0.jar org.apache.karaf.main.Main


I do not see any startup options for the starting heap size or maximum heap size.

Meanwhile, running perf tests we are seeing ODL hog a lot of memory for heap. Earlier default heap max was set to 2G and if it went over we would just see an OOM and ODL would be killed. However, it now seems to be using as much as it likes.

You can see heap size here[1] (nice sawtooth :-) )


[1]- https://snapshot.raintank.io/dashboard/snapshot/iJPjHdURQ3IIDTL8fqYCZ7kST5T2Ca3r

Version-Release number of selected component (if applicable):
OSP 13
opendaylight-8.0.0-5.el7ost.noarch  
puppet-opendaylight-8.1.0-0.20180321182556.45c4db7.el7ost.noarch 

How reproducible:
100%

Steps to Reproduce:
1. Install OSP13 + ODL
2. Monitor default java startup opts and heap size
3.

Actual results:
ODL seems to be consuming as much heap as it likes

Expected results:
Heap should be set to a max of 2G and trying to go over should result in the JVM getting killed

Additional info:
Comment 1 Michael Vorburger 2018-04-23 10:52:08 EDT
This new "memory problem" here is very different from the ones we chased earlier; and hopefully much easier to solve: What seems to be happening here is that when we moved into a container with ODL, we somehow lost the JVM options we had earlier!

Or perhaps it's not even related to going into a container, but some other upstream (Karaf version bump?) or downstream (RPM? TripleO) change? Whatever the culprit - we can see /usr/bin/java ... line above that there are no JVM memory like settings -Xmx anymore. 

It's actually about more than just the only -Xmx; if we compare the lines above with what I see on a netvirt/karaf (master Fluorine; but should be the same on Oxygen)) upstream, we also lost Xms and UnlockDiagnosticVMOptions and HeapDumpOnOutOfMemoryError, that's curious:

/usr/bin/java -Djava.security.properties=/home/vorburger/dev/ODL/git/netvirt/karaf/target/assembly/etc/odl.java.security -Xms128M -Xmx2048m -XX:+UnlockDiagnosticVMOptions -XX:+HeapDumpOnOutOfMemoryError -Dcom.sun.management.jmxremote -Djava.security.egd=file:/dev/./urandom 

Someone needs to figure out why all of these options which are there for a good reason upstream ;) seem to have been lost in the RPM that runs inside the container downstream?

Now from what I understand about containers, they can but don't have to have memory limits (I'm more familiar with who this works for application containers in OpenShift than re. how an ODL container is OSP is configured in details).

It would seem that this ODL container has neither the required JVM memory option for memory management, nor does it's encasing container have any limit.  Therefore I suspect it just keeps growing and grabbing GB after GB from the underlying host node on which the container runs.

We may also want to add additional container specific JVM options; in addition to the regular (non container related) usual Xmx and Xms, I would recommend we also consider adding this magic:

    -XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap

We should inherit as much as possible from upstream (so the fix for this issue should NOT contain Xmx & Xms, because that is already available upstream), and pass additional options like above as parameters or set ENV VARs (JAVA_OPTS ?) instead of patch existing upstream Karaf launch scripts such as /opt/opendaylight/bin/karaf directly.
Comment 2 Sai Sindhur Malleni 2018-04-23 11:12:05 EDT
I think this bug's priority should be more than "medium". In cases where we have ODL collocated with OpenStack controllers this could cause severe performance and even functional issues, probably taking down other serices in OpenStack.
Comment 3 Sai Sindhur Malleni 2018-04-23 11:13:09 EDT
Also, +1 to Michael's point. Pretty sure this is a startup option things. In OSP12, I remember we had the java specific startup args for heap even in the container. So something changed with how TripleO sets up ODL.
Comment 4 Tim Rozet 2018-04-23 12:23:23 EDT
We do not set cgroup mem limit on the container, so inheriting that for the jvm process wont do anything (for now, but it is good to remember for the future).  The JVM process itself should have heap limit, but it is possible that is being overwritten.  We used to pass JAVA_OPTS into the systemd env, however with the move to containers that no longer works (because we do not use systemd in the container).  We had to make a change to be able to pass JAVA_OPTS by modifying the start script used by karaf:

https://git.opendaylight.org/gerrit/#/c/68783/

My hunch here is that we are overwriting the final opts in the start script accidentally and overwriting the other arguments.  Can you please provide your karaf script from the container so we can see how it has been modified by puppet?
Comment 5 Sai Sindhur Malleni 2018-04-23 12:31:47 EDT
Attaching /opt/opendaylight/bin/karaf
https://gist.github.com/smalleni/b23c22d6f91229f3cd609f70fe29c58c
Comment 6 Michael Vorburger 2018-04-23 13:21:27 EDT
Could I please also ask for the "bin/inc" and "bin/setenv" scripts from your end, not just "bin/karaf" ?  In latest upstream Oxygen maintenance branch, I can see that "standard" Xmx I've referred to above comes from here, and then we can understand how we loose it downstream:

bin/setenv:    export JAVA_MAX_MEM="2048m"

bin/inc:    DEFAULT_JAVA_OPTS="-Xms${JAVA_MIN_MEM} -Xmx${JAVA_MAX_MEM} -XX:+UnlockDiagnosticVMOptions "
Comment 7 Sai Sindhur Malleni 2018-04-23 15:20 EDT
Created attachment 1425722 [details]
/bin/inc
Comment 8 Sai Sindhur Malleni 2018-04-23 15:20 EDT
Created attachment 1425723 [details]
bin/setenv
Comment 9 Michael Vorburger 2018-04-24 07:06:14 EDT
> comes from here, and then we can understand how we loose it downstream

the attached bin/setenv and bin/inc are correct, so we probably start differently.
Comment 10 Tim Rozet 2018-04-24 09:09:31 EDT
We start with:
https://github.com/openstack/tripleo-heat-templates/blob/master/docker/services/opendaylight-api.yaml#L111


One easy way to fix this is to just specify the heap size as part of JAVA_OPTS in puppet-opendaylight, and any other arguments we are also missing.
Comment 11 Tim Rozet 2018-04-24 10:54:58 EDT
The reason nothing is set is because there is a check in the inc function setupDebugOptions:

setupDebugOptions() {
    if [ "x${JAVA_OPTS}" = "x" ]; then
        JAVA_OPTS="${DEFAULT_JAVA_OPTS}"
    fi

Since we already set JAVA_OPTS, it never sets the default opts which includes mem settings.
Comment 12 Tim Rozet 2018-04-24 11:02:33 EDT
As Stephen mentioned, we should use EXTRA_JAVA_OPTS here so that we are able to include the DEFAULT_JAVA_OPTS.
Comment 13 Michael Vorburger 2018-04-24 11:05:06 EDT
Skitt on IRC pointed out we could just use EXTRA_JAVA_OPTS instead JAVA_OPTS.  That way, the standard upstream JVM args (Xmx etc) are preserved, but we can augment with additional options - such as (apparently) the -Djava.net.preferIPv4Stack=true thing.
Comment 20 Itzik Brown 2018-05-22 03:22:36 EDT
Checked with:
puppet-opendaylight-8.1.2-1.38977efgit.el7ost.noarch

Seems like it's right:

[root@controller-1 heat-admin]# ps -ef |grep java
42462      37818   37069 27 May21 ?        06:35:06 /usr/bin/java -Djava.security.properties=/opt/opendaylight/etc/odl.java.security -Xms128M -Xmx2048m -XX:+UnlockDiagnosticVMOptions -XX:+HeapDumpOnOutOfMemoryError -Dcom.sun.management.jmxremote -Djava.net.preferIPv4Stack=true -Djava.security.egd=file:/dev/./urandom -Djava.endorsed.dirs=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.171-7.b10.el7.x86_64/jre/jre/lib/endorsed:/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.171-7.b10.el7.x86_64/jre/lib/endorsed:/opt/opendaylight/lib/endorsed -Djava.ext.dirs=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.171-7.b10.el7.x86_64/jre/jre/lib/ext:/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.171-7.b10.el7.x86_64/jre/lib/ext:/opt/opendaylight/lib/ext -Dkaraf.instances=/opt/opendaylight/instances -Dkaraf.home=/opt/opendaylight -Dkaraf.base=/opt/opendaylight -Dkaraf.data=/opt/opendaylight/data -Dkaraf.etc=/opt/opendaylight/etc -Dkaraf.restart.jvm.supported=true -Djava.io.tmpdir=/opt/opendaylight/data/tmp -Djava.util.logging.config.file=/opt/opendaylight/etc/java.util.logging.properties -Dkaraf.startLocalConsole=false -Dkaraf.startRemoteShell=true -classpath /opt/opendaylight/lib/boot/org.apache.karaf.diagnostic.boot-4.1.3.jar:/opt/opendaylight/lib/boot/org.apache.karaf.jaas.boot-4.1.3.jar:/opt/opendaylight/lib/boot/org.apache.karaf.main-4.1.3.jar:/opt/opendaylight/lib/boot/org.osgi.core-6.0.0.jar org.apache.karaf.main.Main
Comment 22 errata-xmlrpc 2018-06-27 09:52:02 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2086

Note You need to log in before you can comment on or make changes to this bug.