We are using the JBoss EAP 7 (1.4) image. We passed MAVEN_OPTS=-Xmx512m and MAVEN_OPTS=-Xmx1024m and neither of them result in a successful build. I ran the entire build with the build log-level set to 5 and debug enabled in Maven and have attached the build log for your review.
It seems pretty clear the compilation is hitting this issue: http://stackoverflow.com/questions/24989653/jenkins-maven-build-137-error Which is a native memory exhaustion problem, not a heap exhaustion (that would explain why the heap settings had no effect). I'm not aware of how you can enforce limits on the native memory consumed by the compilation process, it may simply be an effect of the project being compiled. I'm going to assign this over to the team that owns the EAP image in case they have more maven expertise to bring to bear. btw, you indicate you looked at the docker stats for the build container...was that the origin-sti-builder container, or the actual eap container running the assemble/maven process? The former is what shows up in the pod, the latter is launched directly via the docker daemon and is where the actual memory will be used. The assemble/maven container is launched with the same cgroup memory constraints as the origin-sti-builder pod, though.
we are running the container and then i am monitoring docker stats directly for the sti builder pod, which never runs above about 160mb out of the 1gb I give it. But basically we can make the Jboss build finish if we bump the memory up to 3GB... which is a bit nutty, plus there is no way for us to give builder pods special Limits which then means that everyone's default limits end up being 3GB, which is not sustainable.
> we are running the container and then i am monitoring docker stats directly for the sti builder pod, which never runs above about 160mb out of the 1gb I give it. sounds like you're just looking at the origin-sti-builder container which is sort of an orchestrator for the build, it's not where maven is running, so that's why the memory usage would be low. Maven is actually running in a secondary container. > plus there is no way for us to give builder pods special Limits which then means that everyone's default limits end up being 3GB, which is not sustainable. you can set a different quota for terminating pods, which would include build pods (and deployment pods and job pods). https://docs.openshift.org/latest/admin_guide/quota.html#quota-scopes so you could theoretically give users a 6gig terminating-pod quota but a 1gig non-terminating-pod quota. That would allow users to run builds with up to 6gigs but applications would only have up to a 1gig, for example.
The Quota for terminating pods is already set and it is not helping. apiVersion: v1 kind: ResourceQuota metadata: creationTimestamp: null name: compute-resources-timebound spec: hard: limits.cpu: "70" limits.memory: 35Gi scopes: - Terminating status: {} The only thing that gets the build to finish is if we modify the below memory limits to 3GB, which is not sustainable. apiVersion: v1 kind: LimitRange metadata: creationTimestamp: null name: resource-limits spec: limits: - max: cpu: "1" memory: 1Gi min: cpu: 29m memory: 150Mi type: Pod - default: cpu: "1" memory: 1Gi defaultRequest: cpu: "1" memory: 1Gi max: cpu: "1" memory: 1Gi min: cpu: 29m memory: 150Mi type: Container
Quota is what you're allowed to actually use, limitrange controls what you're allowed to ask for. So yes, just setting the quota higher won't make the build succeed, my point was you could entitle people to ask for more resources for builds, while not giving them more resources for applications. It's entirely possible to have a limitrange that's higher than your quota, which would mean (for example) you're allowed to ask for 10gigs but you're never allowed to actually use 10gigs (thus your pod will never be scheduled). In your case, this would mean something like "give users a terminating quota of 3gigs and a non-terminating quota of 1gig. Let them set their requests/limits to whatever they want." (they'll never be able to use more than 3gigs for a build or 1gig for an application due to their respective quotas). You can also help them along by using the builddefaulter to set default request/limit values. More typically you'd have a quota that's higher than your limitrange meaning you can't use your entire quota via a single pod, but you might via multiple pods. Anyway back to your primary problem at hand: we still need to dig into how much memory your maven operation is actually taking up. I'm pretty sure the memory stats you gathered are from the supervisor container, and not the container running maven.
What is the easiest way to get the container hash of the maven one? other than trying to eyeball creation times i guess?
OK so figured out how to track the maven build and you are correct the maven build gets our defaults of 1GB and promptly dies when it consumes all of its resources, while the openshift-sti pod is chugging along nice and quietly. Tried to accomplish what you are explaining with the limitrange and terminating quota but failed, the builder and the resulting maven pods ended up with the higher limitrange but the deployer failed due to not being able to request the needed resources. This could very much be due to me doing something wrong on the during setting up the ResourceQuota and LimitRanges so if you could provide a possible example that would do what you are describing above I would be more than willing to test it out.
> the builder and the resulting maven pods ended up with the higher limitrange but the deployer failed due to not being able to request the needed resources. we're edging out of my area of expertise, but is the issue that your quota was exhausted by the build that was running, thus there was no quota left for the deployment to request? (can you share the exact error the deployment reported?)
I don't believe so. In the cases where the build succeeds (due to the higher limitrange), the deployment will then fail if those resources are not upped to seemingly what was requested by the higher limitrange. The only message we really get is: dev-test-10: About to stop retrying dev-test-10: couldn't create deployer pod for testproject/dev-test-10: pods "dev-test-10-deploy" is forbidden: exceeded quota: compute-resources-timebound, requested: limits.memory=3Gi, used: limits.memory=0, limited: limits.memory=1536Mi
It sounds like your deployments are also picking up the default memory limit that you've specified. Can you lay out for me the full set of quota+limit configuration you've done, including project quota/limits, as well as cluster defaults and build defaults? Alternatively let me stop and ask which problem we really want to solve here, is it: 1) how to configure the system so builds can run with 3gigs of memory but deployments don't require that much and normal applications still only get 1gig(or whatever)? or 2) the original question of "why are my maven builds requiring 3 gigs of ram"? (in which case we should probably just do some profiling against maven running locally, outside the container entirely, to determine if we can reproduce the utilization there, first).
Created attachment 1279063 [details] project default
So we are trying to answer both questions really. What I attached is our defaults for anyone that comes in. The builder pods and the children of it get treated as good ol regular normal pods which then pick up our default resources. So we need to answer #1 in order to solve the problem we are seeing in #2, however, in #2 OCP itself is not giving a useful error message when the side container dies, but rather just tells the end user that the build terminated and failed but no other indicator as to what happened or how to go about debugging. If you had not mentioned that s2i spawns off a second container we would still be scratching our heads as to why the build only consumed 160MB of memory. So #2 is hinged less on debugging why the maven build requires that much, but on OCP reporting an error that gives the end user and operator a message that indicates the problem. As I said, if we can solve #1, then we can get past the immediate problem of having customers that cannot complete builds.
Yeah the error reporting is going to be tough to solve. All openshift/s2i really know is "we launched this container to do some work, and exited non-zero". Beyond that, we're reliant on the logs from the container providing sufficient information to the user about what happened. It's possible the assemble script for the s2i builder image you're using could be augmented to do a better job of recognizing the maven process got mem-killed and at least log it (that would be something for the EAP docker image team to do, and this bug is currently assigned to them anyway). Back to your quota/limits: So what I see is that each user has up to 1.5gigs for terminating pods, and terminating pods are allowed to ask for a memory limit between 150m and 512m, and will get a default of 512m if nothing is specified (I would expect this to apply to deployment pods and you should be able to verify that by inspecting a deployment pod) In addition to this config, to enable users to run builds that need 3 gigs i would expect you to have to do the following: 1) increase the pod+container memory limit max to 3gig 2) Use the BuildDefaulter to provide a default resource limit definition for builds, with a memory limit of 3gigs. (or just update the BuildConfig in your template definition, if that's how users are defining their buildconfigs): https://docs.openshift.org/latest/install_config/build_defaults_overrides.html#manually-setting-global-build-defaults 3) Increase the terminating pod memory quota to at least 3 gigs (this will allow one build to run, but the deployment won't be able to run until the build pod terminates. I'm not sure whether the deployment will wait for quota to be available, or fail immediately). 3.5gigs should allow both the build+deployment, w/ your current settings. This will result in: - deployment pods will still use the default limit (512meg) - build pods will use the builddefaulter limit (3gigs) - quota will be sufficient to allow 3gig builds to be scheduled - application pods will be able to set a limit of up to 3gigs of memory, but that pod will never schedule because their quota for non-terminating pods is still going to be 2.5gigs based on your current compute-resources quota value
FYI we have an issue for the image that may be related, CLOUD-883, and have a fix which will be released shortly through a new version of the image. Is there an easy way to reproduce this issue so we can verify the update also fixes this one? Thanks
Yes simply set a default limit of a project to say 100MB memory and try to do a maven build, it should die due to memory starvation.
We released new images last week as part of a large update, part of which was the changes required to support resource limits on the S2I pods.
Verified openshift v3.6.133 kubernetes v1.6.1+5115d708d7 reproduce steps: 1. create limitrange # oc describe limitrange Name: resource-limits Namespace: dyan7 Type Resource Min Max Default Request Default Limit Max Limit/Request Ratio ---- -------- --- --- --------------- ------------- ----------------------- Pod memory 150Mi 512Mi - - - Pod cpu 29m 1 - - - Container cpu 29m 1 60m 1 - Container memory 150Mi 512Mi 307Mi 512Mi - 2.create maven build # oc new-app eap70-basic-s2i actual results: build complete, no error # oc get build NAME TYPE FROM STATUS STARTED DURATION eap-app-1 Source Git@d9281fa Complete 11 minutes ago 3m8s eap-app-2 Source Git@d9281fa Complete 7 minutes ago 2m58s images: 1.5: registry.access.redhat.com/jboss-eap-7/eap70-openshift@sha256:3e3c89f43ead790847c9eccaa13009f097b291e677b11f6bb8a1f108f8731b81 1.4: registry.access.redhat.com/jboss-eap-7/eap70-openshift@sha256:2fcea3fcc642cee9e31184e83dbcd4402d6a02710a4ad882669c6db164e4df73