Description of problem: I noticed something strange and want to make sure I'm not missing something very obvious. Looking at the karaf process in the ODL container I see [root@overcloud-odl-0 heat-admin]# ps aux | grep karaf | grep server 42462 19245 27.5 6.4 60158752 8518440 ? Sl 13:50 34:55 /usr/bin/java -Djava.net.preferIPv4Stack=true -Djava.security.egd=file:/dev/./urandom -Djava.endorsed.dirs=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.161-2.b14.el7.x86_64/jre/jre/lib/endorsed:/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.161-2.b14.el7.x86_64/jre/lib/endorsed:/opt/opendaylight/lib/endorsed -Djava.ext.dirs=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.161-2.b14.el7.x86_64/jre/jre/lib/ext:/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.161-2.b14.el7.x86_64/jre/lib/ext:/opt/opendaylight/lib/ext -Dkaraf.instances=/opt/opendaylight/instances -Dkaraf.home=/opt/opendaylight -Dkaraf.base=/opt/opendaylight -Dkaraf.data=/opt/opendaylight/data -Dkaraf.etc=/opt/opendaylight/etc -Dkaraf.restart.jvm.supported=true -Djava.io.tmpdir=/opt/opendaylight/data/tmp -Djava.util.logging.config.file=/opt/opendaylight/etc/java.util.logging.properties -Dkaraf.startLocalConsole=false -Dkaraf.startRemoteShell=true -classpath /opt/opendaylight/lib/boot/org.apache.karaf.diagnostic.boot-4.1.3.jar:/opt/opendaylight/lib/boot/org.apache.karaf.jaas.boot-4.1.3.jar:/opt/opendaylight/lib/boot/org.apache.karaf.main-4.1.3.jar:/opt/opendaylight/lib/boot/org.osgi.core-6.0.0.jar org.apache.karaf.main.Main I do not see any startup options for the starting heap size or maximum heap size. Meanwhile, running perf tests we are seeing ODL hog a lot of memory for heap. Earlier default heap max was set to 2G and if it went over we would just see an OOM and ODL would be killed. However, it now seems to be using as much as it likes. You can see heap size here[1] (nice sawtooth :-) ) [1]- https://snapshot.raintank.io/dashboard/snapshot/iJPjHdURQ3IIDTL8fqYCZ7kST5T2Ca3r Version-Release number of selected component (if applicable): OSP 13 opendaylight-8.0.0-5.el7ost.noarch puppet-opendaylight-8.1.0-0.20180321182556.45c4db7.el7ost.noarch How reproducible: 100% Steps to Reproduce: 1. Install OSP13 + ODL 2. Monitor default java startup opts and heap size 3. Actual results: ODL seems to be consuming as much heap as it likes Expected results: Heap should be set to a max of 2G and trying to go over should result in the JVM getting killed Additional info:
This new "memory problem" here is very different from the ones we chased earlier; and hopefully much easier to solve: What seems to be happening here is that when we moved into a container with ODL, we somehow lost the JVM options we had earlier! Or perhaps it's not even related to going into a container, but some other upstream (Karaf version bump?) or downstream (RPM? TripleO) change? Whatever the culprit - we can see /usr/bin/java ... line above that there are no JVM memory like settings -Xmx anymore. It's actually about more than just the only -Xmx; if we compare the lines above with what I see on a netvirt/karaf (master Fluorine; but should be the same on Oxygen)) upstream, we also lost Xms and UnlockDiagnosticVMOptions and HeapDumpOnOutOfMemoryError, that's curious: /usr/bin/java -Djava.security.properties=/home/vorburger/dev/ODL/git/netvirt/karaf/target/assembly/etc/odl.java.security -Xms128M -Xmx2048m -XX:+UnlockDiagnosticVMOptions -XX:+HeapDumpOnOutOfMemoryError -Dcom.sun.management.jmxremote -Djava.security.egd=file:/dev/./urandom Someone needs to figure out why all of these options which are there for a good reason upstream ;) seem to have been lost in the RPM that runs inside the container downstream? Now from what I understand about containers, they can but don't have to have memory limits (I'm more familiar with who this works for application containers in OpenShift than re. how an ODL container is OSP is configured in details). It would seem that this ODL container has neither the required JVM memory option for memory management, nor does it's encasing container have any limit. Therefore I suspect it just keeps growing and grabbing GB after GB from the underlying host node on which the container runs. We may also want to add additional container specific JVM options; in addition to the regular (non container related) usual Xmx and Xms, I would recommend we also consider adding this magic: -XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap We should inherit as much as possible from upstream (so the fix for this issue should NOT contain Xmx & Xms, because that is already available upstream), and pass additional options like above as parameters or set ENV VARs (JAVA_OPTS ?) instead of patch existing upstream Karaf launch scripts such as /opt/opendaylight/bin/karaf directly.
I think this bug's priority should be more than "medium". In cases where we have ODL collocated with OpenStack controllers this could cause severe performance and even functional issues, probably taking down other serices in OpenStack.
Also, +1 to Michael's point. Pretty sure this is a startup option things. In OSP12, I remember we had the java specific startup args for heap even in the container. So something changed with how TripleO sets up ODL.
We do not set cgroup mem limit on the container, so inheriting that for the jvm process wont do anything (for now, but it is good to remember for the future). The JVM process itself should have heap limit, but it is possible that is being overwritten. We used to pass JAVA_OPTS into the systemd env, however with the move to containers that no longer works (because we do not use systemd in the container). We had to make a change to be able to pass JAVA_OPTS by modifying the start script used by karaf: https://git.opendaylight.org/gerrit/#/c/68783/ My hunch here is that we are overwriting the final opts in the start script accidentally and overwriting the other arguments. Can you please provide your karaf script from the container so we can see how it has been modified by puppet?
Attaching /opt/opendaylight/bin/karaf https://gist.github.com/smalleni/b23c22d6f91229f3cd609f70fe29c58c
Could I please also ask for the "bin/inc" and "bin/setenv" scripts from your end, not just "bin/karaf" ? In latest upstream Oxygen maintenance branch, I can see that "standard" Xmx I've referred to above comes from here, and then we can understand how we loose it downstream: bin/setenv: export JAVA_MAX_MEM="2048m" bin/inc: DEFAULT_JAVA_OPTS="-Xms${JAVA_MIN_MEM} -Xmx${JAVA_MAX_MEM} -XX:+UnlockDiagnosticVMOptions "
Created attachment 1425722 [details] /bin/inc
Created attachment 1425723 [details] bin/setenv
> comes from here, and then we can understand how we loose it downstream the attached bin/setenv and bin/inc are correct, so we probably start differently.
We start with: https://github.com/openstack/tripleo-heat-templates/blob/master/docker/services/opendaylight-api.yaml#L111 One easy way to fix this is to just specify the heap size as part of JAVA_OPTS in puppet-opendaylight, and any other arguments we are also missing.
The reason nothing is set is because there is a check in the inc function setupDebugOptions: setupDebugOptions() { if [ "x${JAVA_OPTS}" = "x" ]; then JAVA_OPTS="${DEFAULT_JAVA_OPTS}" fi Since we already set JAVA_OPTS, it never sets the default opts which includes mem settings.
As Stephen mentioned, we should use EXTRA_JAVA_OPTS here so that we are able to include the DEFAULT_JAVA_OPTS.
Skitt on IRC pointed out we could just use EXTRA_JAVA_OPTS instead JAVA_OPTS. That way, the standard upstream JVM args (Xmx etc) are preserved, but we can augment with additional options - such as (apparently) the -Djava.net.preferIPv4Stack=true thing.
Checked with: puppet-opendaylight-8.1.2-1.38977efgit.el7ost.noarch Seems like it's right: [root@controller-1 heat-admin]# ps -ef |grep java 42462 37818 37069 27 May21 ? 06:35:06 /usr/bin/java -Djava.security.properties=/opt/opendaylight/etc/odl.java.security -Xms128M -Xmx2048m -XX:+UnlockDiagnosticVMOptions -XX:+HeapDumpOnOutOfMemoryError -Dcom.sun.management.jmxremote -Djava.net.preferIPv4Stack=true -Djava.security.egd=file:/dev/./urandom -Djava.endorsed.dirs=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.171-7.b10.el7.x86_64/jre/jre/lib/endorsed:/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.171-7.b10.el7.x86_64/jre/lib/endorsed:/opt/opendaylight/lib/endorsed -Djava.ext.dirs=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.171-7.b10.el7.x86_64/jre/jre/lib/ext:/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.171-7.b10.el7.x86_64/jre/lib/ext:/opt/opendaylight/lib/ext -Dkaraf.instances=/opt/opendaylight/instances -Dkaraf.home=/opt/opendaylight -Dkaraf.base=/opt/opendaylight -Dkaraf.data=/opt/opendaylight/data -Dkaraf.etc=/opt/opendaylight/etc -Dkaraf.restart.jvm.supported=true -Djava.io.tmpdir=/opt/opendaylight/data/tmp -Djava.util.logging.config.file=/opt/opendaylight/etc/java.util.logging.properties -Dkaraf.startLocalConsole=false -Dkaraf.startRemoteShell=true -classpath /opt/opendaylight/lib/boot/org.apache.karaf.diagnostic.boot-4.1.3.jar:/opt/opendaylight/lib/boot/org.apache.karaf.jaas.boot-4.1.3.jar:/opt/opendaylight/lib/boot/org.apache.karaf.main-4.1.3.jar:/opt/opendaylight/lib/boot/org.osgi.core-6.0.0.jar org.apache.karaf.main.Main
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:2086