Bug 2220823 - noobaa-endpoint pod keeps getting oom kills [NEEDINFO]
Summary: noobaa-endpoint pod keeps getting oom kills
Keywords:
Status: ASSIGNED
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: Multi-Cloud Object Gateway
Version: 4.11
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
: ---
Assignee: Romy Ayalon
QA Contact: krishnaram Karthick
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-07-06 07:16 UTC by Anjali
Modified: 2023-08-09 16:49 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:
nbecker: needinfo? (amenon)
rayalon: needinfo? (amenon)
rayalon: needinfo? (amenon)
nbecker: needinfo? (amenon)


Attachments (Terms of Use)

Description Anjali 2023-07-06 07:16:36 UTC
Description of problem (please be detailed as possible and provide log
snippests):

- noobaa-endpoint gets restarted because of out of memory

noobaa-core-0                                                     1/1     Running   0               45d     10.195.72.55     demchdc6z4x   <none>           <none>
noobaa-db-pg-0                                                    1/1     Running   0               45d     10.195.72.56     demchdc6z4x   <none>           <none>
noobaa-endpoint-5bc769c4d7-2fx5m                                  1/1     Running   2 (6h16m ago)   45d     10.195.71.24     demchdc6zcx   <none>           <none>
noobaa-endpoint-5bc769c4d7-9p7z6                                  1/1     Running   0               45d     10.195.72.54     demchdc6z4x   <none>           <none>
noobaa-endpoint-5bc769c4d7-fqlpq                                  1/1     Running   0               45d     10.195.70.20     demchdc6z3x   <none>           <none>
noobaa-operator-869857597c-6cd5z                                  1/1     Running   5 (33d ago)     45d     10.195.72.36     demchdc6z4x   <none>           <none>

- From the pod yaml file we can see the pod was getting OOM Killed

status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2023-05-04T07:39:19Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2023-06-18T19:04:30Z"
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2023-06-18T19:04:30Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2023-05-04T07:39:19Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: cri-o://804993f55685f3ffff431800ddad4e9cb90da2ee8b8df17760ee4c3f4d952f07
    image: registry.redhat.io/odf4/mcg-core-rhel8@sha256:3d4a72d3849fc111f93183983875c2b46059710c44b2430147d83c1f1c6f9782
    imageID: registry.redhat.io/odf4/mcg-core-rhel8@sha256:3d4a72d3849fc111f93183983875c2b46059710c44b2430147d83c1f1c6f9782
    lastState:
      terminated:
        containerID: cri-o://f8ebf5f5cc3dfea8aef05071a071755c2d7ce443b72a96ae1691ecec3830922c
        exitCode: 137
        finishedAt: "2023-06-18T19:04:24Z"
        reason: OOMKilled <------------------------------
        startedAt: "2023-05-29T09:04:10Z"
    name: endpoint
    ready: true
    restartCount: 2
    started: true
    state:
      running:
        startedAt: "2023-06-18T19:04:25Z"
  hostIP: 139.25.144.15
  phase: Running
  podIP: 10.195.71.24
  podIPs:
  - ip: 10.195.71.24
  qosClass: Guaranteed
  startTime: "2023-05-04T07:39:19Z"

- Currently endpoint memory limit is configured to be 2Gb and we have asked to increase this to 8Gi or 16Gi and see if it is still getting OOMkilled.

   endpoints:

      resources:
        limits:
          cpu: 999m
          memory: 2Gi
        requests:
          cpu: 999m
          memory: 2Gi

- When we check log entries from the affected noobaa pod there are duplicate entries in the database which are not expected and the sql update fails. Have pasted log snippet in private comment. 

- Need help from enggineering to check further into these duplicate ids. 

Version of all relevant components (if applicable):

- ODF Version v4.11.7  

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
- No

Is there any workaround available to the best of your knowledge?

- No

Additional info:

- All mgs and related info can be found in supportshell under ~/03539418


Note You need to log in before you can comment on or make changes to this bug.