Bug 1924648
| Summary: | graph-builder keep restart due to OOM with default config in updateserver container | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | liujia <jiajliu> | ||||||||
| Component: | OpenShift Update Service | Assignee: | Over the Air Updates <aos-team-ota> | ||||||||
| OpenShift Update Service sub component: | operand | QA Contact: | liujia <jiajliu> | ||||||||
| Status: | CLOSED DUPLICATE | Docs Contact: | Kathryn Alexander <kalexand> | ||||||||
| Severity: | medium | ||||||||||
| Priority: | medium | CC: | vrutkovs, wking, yanyang | ||||||||
| Version: | 4.6 | ||||||||||
| Target Milestone: | --- | ||||||||||
| Target Release: | --- | ||||||||||
| Hardware: | Unspecified | ||||||||||
| OS: | Unspecified | ||||||||||
| Whiteboard: | |||||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||
| Doc Text: | Story Points: | --- | |||||||||
| Clone Of: | Environment: | ||||||||||
| Last Closed: | 2021-03-03 20:51:48 UTC | Type: | Bug | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Embargoed: | |||||||||||
| Attachments: |
|
||||||||||
This doesn't show OOMKilled, mind attaching the pod logs w/ default resources? That said we should be using current production requests / limits as defaults, as there are several images which baloon RAM consumption (https://issues.redhat.com/browse/ART-1849) also, if you are pointing Cincinnati at a local repository where you dumped the release image and the images referenced by the release image all in together (like we currently recommend in the official docs [1]), I'm pretty sure Cincinnati will explode unless you give it a crazy-high memory limit. But maybe we can use this bug to get io.openshift.release label filtering on inspected images before we attempt to search layers [2]? [1]: https://issues.redhat.com/browse/OTA-214?focusedCommentId=14177433&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-14177433 [2]: https://issues.redhat.com/browse/OTA-214?focusedCommentId=15533400&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15533400 (In reply to Vadim Rutkovsky from comment #1) > This doesn't show OOMKilled, mind attaching the pod logs w/ default > resources? > > That said we should be using current production requests / limits as > defaults, as there are several images which baloon RAM consumption > (https://issues.redhat.com/browse/ART-1849) I pasted part of pod status in description which shows the reason of the pod's restart as following: ``` State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: OOMKilled Exit Code: 137 Started: Wed, 03 Feb 2021 11:14:49 +0000 Finished: Wed, 03 Feb 2021 11:15:24 +0000 Ready: False Restart Count: 7 Limits: cpu: 750m memory: 512Mi Requests: cpu: 150m memory: 64Mi ``` Sure, i just attached graph-builder policy-engine and pod logs. Created attachment 1754971 [details]
pod.log
Created attachment 1754972 [details]
pe.log
Created attachment 1754973 [details]
gb.log
(In reply to W. Trevor King from comment #2) > also, if you are pointing Cincinnati at a local repository where you dumped > the release image and the images referenced by the release image all in > together (like we currently recommend in the official docs [1]), I'm pretty > sure Cincinnati will explode unless you give it a crazy-high memory limit. > But maybe we can use this bug to get io.openshift.release label filtering on > inspected images before we attempt to search layers [2]? > > [1]: > https://issues.redhat.com/browse/OTA-214?focusedCommentId=14177433&page=com. > atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment- > 14177433 > [2]: > https://issues.redhat.com/browse/OTA-214?focusedCommentId=15533400&page=com. > atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment- > 15533400 Yes, it's a disconnected cluster with local registry, in which ocp release image mirrored for installation via "oc adm release mirror" command. Both "--to" and "--to-image" pointed to the same repo. I thought mirroring the image to different repo(doc needed) was a workaround, right? > I thought mirroring the image to different repo(doc needed) was a workaround, right?
Right, we don't doc that today, but using a different repo for --to-image, so the repo that Cincinnati indexes isn't polluted with non-release images should help reduce Cincinnati's memory consumption by orders of magnitude
Marking as a dup of the older graph-builder OOM bug 1850781. *** This bug has been marked as a duplicate of bug 1850781 *** |
Description of problem (please be detailed as possible and provide log snippests): Failed to create updateservice instance due to graph-build container can not start(OOMKilled). # ./oc get po NAME READY STATUS RESTARTS AGE updateservice-sample-767fdd75d4-z6c49 1/2 CrashLoopBackOff 7 20m Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 8h default-scheduler Successfully assigned openshift-update-service/updateservice-sample-767fdd75d4-z6c49 to jliu-a46-m9xlz-w-b-1.c.openshift-qe.internal Normal AddedInterface 8h multus Add eth0 [10.128.2.30/23] Normal Pulled 8h kubelet Container image "jliu-a46.mirror-registry.qe.gcp.devcluster.openshift.com:5000/rh-osbs/cincinnati-graph-data-container:v4.6.0" already present on machine Normal Created 8h kubelet Created container graph-data Normal Started 8h kubelet Started container graph-data Normal Pulled 8h kubelet Container image "jliu-a46.mirror-registry.qe.gcp.devcluster.openshift.com:5000/rh-osbs/openshift-update-service:v4.6.0-4" already present on machine Normal Created 8h kubelet Created container policy-engine Normal Started 8h kubelet Started container policy-engine Normal Pulled 8h (x3 over 8h) kubelet Container image "jliu-a46.mirror-registry.qe.gcp.devcluster.openshift.com:5000/rh-osbs/openshift-update-service:v4.6.0-4" already present on machine Normal Created 8h (x3 over 8h) kubelet Created container graph-builder Normal Started 8h (x3 over 8h) kubelet Started container graph-builder Warning Unhealthy 8h (x23 over 8h) kubelet Readiness probe failed: HTTP probe failed with statuscode: 503 Warning BackOff 7h58m (x74 over 8h) kubelet Back-off restarting failed container State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: OOMKilled Exit Code: 137 Started: Wed, 03 Feb 2021 11:14:49 +0000 Finished: Wed, 03 Feb 2021 11:15:24 +0000 Ready: False Restart Count: 7 Limits: cpu: 750m memory: 512Mi Requests: cpu: 150m memory: 64Mi Version of all relevant components (if applicable): OSUS operator image: v4.6.0-4 Bundle image: v1.0-10 OSUS image: v4.6.0-4 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? always Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Install osus operator 2. Create osus instance with above operator(with wa in https://bugzilla.redhat.com/show_bug.cgi?id=1909928#c3) 3. Actual results: Failed to deploy updateservice instance. Expected results: Updateservice instance created successfully. Additional info: Bump the limits to 4G, deploy succeed. # ./oc get po -o json|jq .items[].spec.containers[0].resources { "limits": { "cpu": "750m", "memory": "4Gi" }, "requests": { "cpu": "150m", "memory": "64Mi" } } # ./oc get po NAME READY STATUS RESTARTS AGE updateservice-sample-7b6869f985-2lf92 2/2 Running 0 30m