Bug 1924648

Summary: graph-builder keep restart due to OOM with default config in updateserver container
Product: OpenShift Container Platform Reporter: liujia <jiajliu>
Component: OpenShift Update ServiceAssignee: Over the Air Updates <aos-team-ota>
OpenShift Update Service sub component: operand QA Contact: liujia <jiajliu>
Status: CLOSED DUPLICATE Docs Contact: Kathryn Alexander <kalexand>
Severity: medium    
Priority: medium CC: vrutkovs, wking, yanyang
Version: 4.6   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-03-03 20:51:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
pod.log
none
pe.log
none
gb.log none

Description liujia 2021-02-03 11:22:50 UTC
Description of problem (please be detailed as possible and provide log
snippests):
Failed to create updateservice instance due to graph-build container can not start(OOMKilled).

# ./oc get po
NAME                                    READY   STATUS             RESTARTS   AGE
updateservice-sample-767fdd75d4-z6c49   1/2     CrashLoopBackOff   7          20m

Events:
  Type     Reason          Age                  From               Message
  ----     ------          ----                 ----               -------
  Normal   Scheduled       8h                   default-scheduler  Successfully assigned openshift-update-service/updateservice-sample-767fdd75d4-z6c49 to jliu-a46-m9xlz-w-b-1.c.openshift-qe.internal
  Normal   AddedInterface  8h                   multus             Add eth0 [10.128.2.30/23]
  Normal   Pulled          8h                   kubelet            Container image "jliu-a46.mirror-registry.qe.gcp.devcluster.openshift.com:5000/rh-osbs/cincinnati-graph-data-container:v4.6.0" already present on machine
  Normal   Created         8h                   kubelet            Created container graph-data
  Normal   Started         8h                   kubelet            Started container graph-data
  Normal   Pulled          8h                   kubelet            Container image "jliu-a46.mirror-registry.qe.gcp.devcluster.openshift.com:5000/rh-osbs/openshift-update-service:v4.6.0-4" already present on machine
  Normal   Created         8h                   kubelet            Created container policy-engine
  Normal   Started         8h                   kubelet            Started container policy-engine
  Normal   Pulled          8h (x3 over 8h)      kubelet            Container image "jliu-a46.mirror-registry.qe.gcp.devcluster.openshift.com:5000/rh-osbs/openshift-update-service:v4.6.0-4" already present on machine
  Normal   Created         8h (x3 over 8h)      kubelet            Created container graph-builder
  Normal   Started         8h (x3 over 8h)      kubelet            Started container graph-builder
  Warning  Unhealthy       8h (x23 over 8h)     kubelet            Readiness probe failed: HTTP probe failed with statuscode: 503
  Warning  BackOff         7h58m (x74 over 8h)  kubelet            Back-off restarting failed container

    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Wed, 03 Feb 2021 11:14:49 +0000
      Finished:     Wed, 03 Feb 2021 11:15:24 +0000
    Ready:          False
    Restart Count:  7
    Limits:
      cpu:     750m
      memory:  512Mi
    Requests:
      cpu:      150m
      memory:   64Mi

Version of all relevant components (if applicable):
OSUS operator image: v4.6.0-4
Bundle image: v1.0-10
OSUS image: v4.6.0-4

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?
always

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Install osus operator
2. Create osus instance with above operator(with wa in https://bugzilla.redhat.com/show_bug.cgi?id=1909928#c3)
3.


Actual results:
Failed to deploy updateservice instance.

Expected results:
Updateservice instance created successfully.

Additional info:
Bump the limits to 4G, deploy succeed.
# ./oc get po -o json|jq .items[].spec.containers[0].resources
{
  "limits": {
    "cpu": "750m",
    "memory": "4Gi"
  },
  "requests": {
    "cpu": "150m",
    "memory": "64Mi"
  }
}
# ./oc get po
NAME                                    READY   STATUS    RESTARTS   AGE
updateservice-sample-7b6869f985-2lf92   2/2     Running   0          30m

Comment 1 Vadim Rutkovsky 2021-02-03 11:31:53 UTC
This doesn't show OOMKilled, mind attaching the pod logs w/ default resources?

That said we should be using current production requests / limits as defaults, as there are several images which baloon RAM consumption (https://issues.redhat.com/browse/ART-1849)

Comment 2 W. Trevor King 2021-02-04 03:31:14 UTC
also, if you are pointing Cincinnati at a local repository where you dumped the release image and the images referenced by the release image all in together (like we currently recommend in the official docs [1]), I'm pretty sure Cincinnati will explode unless you give it a crazy-high memory limit.  But maybe we can use this bug to get io.openshift.release label filtering on inspected images before we attempt to search layers [2]?

[1]: https://issues.redhat.com/browse/OTA-214?focusedCommentId=14177433&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-14177433
[2]: https://issues.redhat.com/browse/OTA-214?focusedCommentId=15533400&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15533400

Comment 3 liujia 2021-02-04 05:25:58 UTC
(In reply to Vadim Rutkovsky from comment #1)
> This doesn't show OOMKilled, mind attaching the pod logs w/ default
> resources?
> 
> That said we should be using current production requests / limits as
> defaults, as there are several images which baloon RAM consumption
> (https://issues.redhat.com/browse/ART-1849)

I pasted part of pod status in description which shows the reason of the pod's restart as following:
```
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Wed, 03 Feb 2021 11:14:49 +0000
      Finished:     Wed, 03 Feb 2021 11:15:24 +0000
    Ready:          False
    Restart Count:  7
    Limits:
      cpu:     750m
      memory:  512Mi
    Requests:
      cpu:      150m
      memory:   64Mi
```
Sure, i just attached graph-builder policy-engine and pod logs.

Comment 4 liujia 2021-02-04 05:26:44 UTC
Created attachment 1754971 [details]
pod.log

Comment 5 liujia 2021-02-04 05:27:04 UTC
Created attachment 1754972 [details]
pe.log

Comment 6 liujia 2021-02-04 05:27:25 UTC
Created attachment 1754973 [details]
gb.log

Comment 7 liujia 2021-02-04 05:45:14 UTC
(In reply to W. Trevor King from comment #2)
> also, if you are pointing Cincinnati at a local repository where you dumped
> the release image and the images referenced by the release image all in
> together (like we currently recommend in the official docs [1]), I'm pretty
> sure Cincinnati will explode unless you give it a crazy-high memory limit. 
> But maybe we can use this bug to get io.openshift.release label filtering on
> inspected images before we attempt to search layers [2]?
> 
> [1]:
> https://issues.redhat.com/browse/OTA-214?focusedCommentId=14177433&page=com.
> atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-
> 14177433
> [2]:
> https://issues.redhat.com/browse/OTA-214?focusedCommentId=15533400&page=com.
> atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-
> 15533400

Yes, it's a disconnected cluster with local registry, in which ocp release image mirrored for installation via "oc adm release mirror" command. Both "--to" and "--to-image" pointed to the same repo. I thought mirroring the image to different repo(doc needed) was a workaround, right?

Comment 8 W. Trevor King 2021-02-04 05:57:57 UTC
> I thought mirroring the image to different repo(doc needed) was a workaround, right?

Right, we don't doc that today, but using a different repo for --to-image, so the repo that Cincinnati indexes isn't polluted with non-release images should help reduce Cincinnati's memory consumption by orders of magnitude

Comment 9 W. Trevor King 2021-03-03 20:51:48 UTC
Marking as a dup of the older graph-builder OOM bug 1850781.

*** This bug has been marked as a duplicate of bug 1850781 ***