Bug 1952121
Summary: | Inventory container crashes because of OOM condition, causing controller pod restarts | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Migration Toolkit for Virtualization | Reporter: | Tzahi Ashkenazi <tashkena> | ||||||||||||
Component: | General | Assignee: | Jeff Ortel <jortel> | ||||||||||||
Status: | CLOSED ERRATA | QA Contact: | Tzahi Ashkenazi <tashkena> | ||||||||||||
Severity: | high | Docs Contact: | Avital Pinnick <apinnick> | ||||||||||||
Priority: | urgent | ||||||||||||||
Version: | 2.0.0 | CC: | apinnick, dagur, dvaanunu, fdupont, istein, jortel | ||||||||||||
Target Milestone: | --- | ||||||||||||||
Target Release: | 2.0.0 | ||||||||||||||
Hardware: | Unspecified | ||||||||||||||
OS: | Unspecified | ||||||||||||||
Whiteboard: | |||||||||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||
Clone Of: | Environment: | ||||||||||||||
Last Closed: | 2021-06-10 17:11:46 UTC | Type: | Bug | ||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||
Documentation: | --- | CRM: | |||||||||||||
Verified Versions: | Category: | --- | |||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
Embargoed: | |||||||||||||||
Attachments: |
|
Description
Tzahi Ashkenazi
2021-04-21 14:42:14 UTC
The fix should be in build 2.0.0-20 / iib:69034. reproduce on 2.0.0.20 oc get pods/forklift-controller-86986fd75b-hgr7f -nopenshift-rhmtv -oyaml |les imageID: registry.redhat.io/rhmtv/rhmtv-controller@sha256:4a3766c3c467a0d24b34ea1ed692c040c76d14cc13b805fc971089255489a88f lastState: terminated: containerID: cri-o://cd7096c6ae240e953d15a35f8a7f6ed7663a6d8c4273961b537b4589bd4a6a77 exitCode: 137 finishedAt: "2021-04-22T12:39:56Z" reason: OOMKilled startedAt: "2021-04-22T07:53:43Z" name: controller ready: true restartCount: 1 started: true state: running: startedAt: "2021-04-22T12:39:58Z" - containerID: cri-o://3630fe55b0df45050030159c23a892fc360380582359ec8f5aba204bd25cb7af image: registry.redhat.io/rhmtv/rhmtv-controller@sha256:4a3766c3c467a0d24b34ea1ed692c040c76d14cc13b805fc971089255489a88f imageID: registry.redhat.io/rhmtv/rhmtv-controller@sha256:4a3766c3c467a0d24b34ea1ed692c040c76d14cc13b805fc971089255489a88f lastState: terminated: containerID: cri-o://65896916004c95b4d65820dac52359bda5beb6d16e310d27a2a3b947c1527c10 exitCode: 137 finishedAt: "2021-04-22T11:32:16Z" reason: OOMKilled startedAt: "2021-04-22T07:53:44Z" name: inventory ready: true restartCount: 1 started: true state: running: startedAt: "2021-04-22T11:32:18Z" hostIP: 192.168.208.15 phase: Running podIP: 10.131.0.162 podIPs: - ip: 10.131.0.162 qosClass: Burstable startTime: "2021-04-22T07:53:33Z" I guess this current BZ is impacted by this one > https://bugzilla.redhat.com/show_bug.cgi?id=1952450 oc describe node f02-h18-000-r640.rdu2.scalelab.redhat.com Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 1179m (1%) 300m (0%) memory 3356Mi (0%) 2400Mi (0%) ephemeral-storage 0 (0%) 0 (0%) hugepages-1Gi 0 (0%) 0 (0%) hugepages-2Mi 0 (0%) 0 (0%) devices.kubevirt.io/kvm 0 0 devices.kubevirt.io/tun 0 0 devices.kubevirt.io/vhost-net 0 0 openshift.io/sriov_nics 0 0 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning SystemOOM 142m kubelet System OOM encountered, victim process: manager, pid: 3564364 Warning SystemOOM 75m kubelet System OOM encountered, victim process: manager, pid: 3564270 Created attachment 1775187 [details]
cloud38 memory profile.
Created attachment 1775188 [details]
kubectl top output.
Created attachment 1775637 [details]
psi4 profile.
Reproduced on my PSI cluster.
Created attachment 1775638 [details]
psi4 controller container profile.
Reproduced on my PSI cluster.
This is the "controller" container (not inventory).
The standard Go http lib transport defaults to: unlimited idle connections to support connection reuse. For some reason, the idle connection retains IO buffers for the same unlimited duration. The fix is to configure the transport to limit the number and lifespan of idle connections. https://github.com/konveyor/forklift-controller/pull/229 reproduce on MTV 2.0.0.21 - another scale env(Cloud10) inventory: Container ID: cri-o://60a44db3b8f133849f5999fff8a0420df92f9cb800902a9d88aa6687750249f4 Image: registry.redhat.io/rhmtv/rhmtv-controller@sha256:4a3766c3c467a0d24b34ea1ed692c040c76d14cc13b805fc971089255489a88f Image ID: registry.redhat.io/rhmtv/rhmtv-controller@sha256:4a3766c3c467a0d24b34ea1ed692c040c76d14cc13b805fc971089255489a88f Port: 8443/TCP Host Port: 0/TCP Command: /usr/local/bin/manager State: Running Started: Wed, 28 Apr 2021 23:50:54 +0000 Last State: Terminated Reason: OOMKilled Exit Code: 137 Started: Wed, 28 Apr 2021 15:17:05 +0000 Finished: Wed, 28 Apr 2021 23:50:53 +0000 Ready: True Restart Count: 2 Limits: cpu: 100m memory: 800Mi Requests: cpu: 100m memory: 350Mi Environment Variables from: forklift-controller-config ConfigMap Optional: false Environment: POD_NAMESPACE: openshift-rhmtv (v1:metadata.namespace) ROLE: inventory SECRET_NAME: webhook-server-secret API_PORT: 8443 API_TLS_ENABLED: true API_TLS_CERTIFICATE: /var/run/secrets/forklift-inventory-serving-cert/tls.crt API_TLS_KEY: /var/run/secrets/forklift-inventory-serving-cert/tls.key METRICS_PORT: 8081 POLICY_AGENT_URL: https://forklift-validation.openshift-rhmtv.svc.cluster.local:8181 POLICY_AGENT_SEARCH_INTERVAL: 120 oc describe node f01-h26-000-r640.rdu2.scalelab.redhat.com Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 1820m (2%) 500m (0%) memory 7890Mi (2%) 500Mi (0%) ephemeral-storage 0 (0%) 0 (0%) hugepages-1Gi 0 (0%) 0 (0%) hugepages-2Mi 0 (0%) 0 (0%) devices.kubevirt.io/kvm 1 1 devices.kubevirt.io/tun 1 1 devices.kubevirt.io/vhost-net 1 1 Events: The fix should be part of build mtv-operator-bundle-container-2.0.0-4 / iib:72115. Would it be possible to reproduce and note the amount of memory consumed by the pod when it is killed? The guess is that it is consuming more than the 800Mi limit. This is possible, but surprising. @jortel, what do you think? Created attachment 1782406 [details]
controller pod memory - grafana screenshot
continue to David comment : Also, OOM occurred on cloud38 as well. During the OOM time, the env was idle and have 4 plans with succeeded status. cloud38: f02-h07-000-r640.rdu2.scalelab.redhat.com (root ; 100yard-) - containerID: cri-o://d9114299068a7ad9221e42619b1a2e7fb1d69156840a8acf1405c82af106f1b1 image: registry.redhat.io/mtv/mtv-controller@sha256:666e415b74f7d93e5b91faba038b191da65619bed3f1ead7ab5fdb56873c61f7 imageID: registry.redhat.io/mtv/mtv-controller@sha256:666e415b74f7d93e5b91faba038b191da65619bed3f1ead7ab5fdb56873c61f7 lastState: terminated: containerID: cri-o://7210a625d5d7a0fd859eb4e81e9284bede921d5da2c61e463eb557a0d3448f1e exitCode: 137 finishedAt: "2021-05-12T02:21:49Z" reason: OOMKilled startedAt: "2021-05-10T15:03:53Z" name: inventory ready: true restartCount: 1 started: true state: running: startedAt: "2021-05-12T02:21:51Z" hostIP: 192.168.208.14 phase: Running podIP: 10.128.3.141 podIPs: - ip: 10.128.3.141 qosClass: Burstable startTime: "2021-05-10T15:03:46Z" root@f02-h07-000-r640:~$ oc get pods -nopenshift-mtv -owide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES forklift-controller-5fd7f96df7-dckj2 2/2 Running 1 46h 10.128.3.141 f02-h17-000-r640.rdu2.scalelab.redhat.com <none> <none> forklift-operator-754dbc46dd-4bvcw 1/1 Running 0 2d 10.128.3.102 f02-h17-000-r640.rdu2.scalelab.redhat.com <none> <none> forklift-ui-f46bbcfd9-tvcr9 1/1 Running 0 2d 10.128.3.105 f02-h17-000-r640.rdu2.scalelab.redhat.com <none> <none> forklift-validation-6687f5954d-rxv2h 1/1 Running 0 2d 10.128.3.104 f02-h17-000-r640.rdu2.scalelab.redhat.com <none> <none> cloud38 MTV:2.0.0.12 CNV:2.6.2 According to the graph you share, it's weird that the container is OOMKilled. Would you mind running the following command and share its output? $ oc get -o yaml -n openshift-mtv deployment forklift-controller root@f01-h14-000-r640:~$ oc get -o yaml -n openshift-mtv deployment forklift-controller apiVersion: apps/v1 kind: Deployment metadata: annotations: deployment.kubernetes.io/revision: "1" creationTimestamp: "2021-05-10T10:59:04Z" generation: 1 labels: app: forklift control-plane: controller-manager controller-tools.k8s.io: "1.0" managedFields: - apiVersion: apps/v1 fieldsType: FieldsV1 fieldsV1: f:metadata: f:labels: .: {} f:app: {} f:control-plane: {} f:controller-tools.k8s.io: {} f:ownerReferences: .: {} k:{"uid":"e0334f7b-0bf5-494b-95b3-215a09222750"}: .: {} f:apiVersion: {} f:kind: {} f:name: {} f:uid: {} f:spec: f:progressDeadlineSeconds: {} f:replicas: {} f:revisionHistoryLimit: {} f:selector: {} f:strategy: f:rollingUpdate: .: {} f:maxSurge: {} f:maxUnavailable: {} f:type: {} f:template: f:metadata: f:annotations: .: {} f:configHash: {} f:labels: .: {} f:app: {} f:control-plane: {} f:controller-tools.k8s.io: {} f:spec: f:containers: k:{"name":"controller"}: .: {} f:command: {} f:env: .: {} k:{"name":"API_HOST"}: .: {} f:name: {} f:value: {} k:{"name":"API_PORT"}: .: {} f:name: {} f:value: {} k:{"name":"API_TLS_ENABLED"}: .: {} f:name: {} f:value: {} k:{"name":"POD_NAMESPACE"}: .: {} f:name: {} f:valueFrom: .: {} f:fieldRef: .: {} f:apiVersion: {} f:fieldPath: {} k:{"name":"ROLE"}: .: {} f:name: {} f:value: {} k:{"name":"SECRET_NAME"}: .: {} f:name: {} f:value: {} f:envFrom: {} f:image: {} f:imagePullPolicy: {} f:name: {} f:ports: .: {} k:{"containerPort":9876,"protocol":"TCP"}: .: {} f:containerPort: {} f:name: {} f:protocol: {} f:resources: .: {} f:limits: .: {} f:cpu: {} f:memory: {} f:requests: .: {} f:cpu: {} f:memory: {} f:terminationMessagePath: {} f:terminationMessagePolicy: {} f:volumeMounts: .: {} k:{"mountPath":"/tmp/cert"}: .: {} f:mountPath: {} f:name: {} f:readOnly: {} k:{"mountPath":"/var/cache/profiler"}: .: {} f:mountPath: {} f:name: {} k:{"name":"inventory"}: .: {} f:command: {} f:env: .: {} k:{"name":"API_PORT"}: .: {} f:name: {} f:value: {} k:{"name":"API_TLS_CERTIFICATE"}: .: {} f:name: {} f:value: {} k:{"name":"API_TLS_ENABLED"}: .: {} f:name: {} f:value: {} k:{"name":"API_TLS_KEY"}: .: {} f:name: {} f:value: {} k:{"name":"METRICS_PORT"}: .: {} f:name: {} f:value: {} k:{"name":"POD_NAMESPACE"}: .: {} f:name: {} f:valueFrom: .: {} f:fieldRef: .: {} f:apiVersion: {} f:fieldPath: {} k:{"name":"POLICY_AGENT_SEARCH_INTERVAL"}: .: {} f:name: {} f:value: {} k:{"name":"POLICY_AGENT_URL"}: .: {} f:name: {} f:value: {} k:{"name":"ROLE"}: .: {} f:name: {} f:value: {} k:{"name":"SECRET_NAME"}: .: {} f:name: {} f:value: {} f:envFrom: {} f:image: {} f:imagePullPolicy: {} f:name: {} f:ports: .: {} k:{"containerPort":8443,"protocol":"TCP"}: .: {} f:containerPort: {} f:name: {} f:protocol: {} f:resources: .: {} f:limits: .: {} f:cpu: {} f:memory: {} f:requests: .: {} f:cpu: {} f:memory: {} f:terminationMessagePath: {} f:terminationMessagePolicy: {} f:volumeMounts: .: {} k:{"mountPath":"/var/cache/inventory"}: .: {} f:mountPath: {} f:name: {} k:{"mountPath":"/var/cache/profiler"}: .: {} f:mountPath: {} f:name: {} k:{"mountPath":"/var/run/secrets/forklift-inventory-serving-cert"}: .: {} f:mountPath: {} f:name: {} f:dnsPolicy: {} f:restartPolicy: {} f:schedulerName: {} f:securityContext: {} f:serviceAccount: {} f:serviceAccountName: {} f:terminationGracePeriodSeconds: {} f:volumes: .: {} k:{"name":"cert"}: .: {} f:name: {} f:secret: .: {} f:defaultMode: {} f:secretName: {} k:{"name":"forklift-inventory-serving-cert"}: .: {} f:name: {} f:secret: .: {} f:defaultMode: {} f:secretName: {} k:{"name":"inventory"}: .: {} f:emptyDir: {} f:name: {} k:{"name":"profiler"}: .: {} f:emptyDir: {} f:name: {} manager: OpenAPI-Generator operation: Update time: "2021-05-10T10:59:04Z" - apiVersion: apps/v1 fieldsType: FieldsV1 fieldsV1: f:metadata: f:annotations: .: {} f:deployment.kubernetes.io/revision: {} f:status: f:availableReplicas: {} f:conditions: .: {} k:{"type":"Available"}: .: {} f:lastTransitionTime: {} f:lastUpdateTime: {} f:message: {} f:reason: {} f:status: {} f:type: {} k:{"type":"Progressing"}: .: {} f:lastTransitionTime: {} f:lastUpdateTime: {} f:message: {} f:reason: {} f:status: {} f:type: {} f:observedGeneration: {} f:readyReplicas: {} f:replicas: {} f:updatedReplicas: {} manager: kube-controller-manager operation: Update time: "2021-05-13T02:49:29Z" name: forklift-controller namespace: openshift-mtv ownerReferences: - apiVersion: forklift.konveyor.io/v1beta1 kind: ForkliftController name: forklift-controller uid: e0334f7b-0bf5-494b-95b3-215a09222750 resourceVersion: "120956429" selfLink: /apis/apps/v1/namespaces/openshift-mtv/deployments/forklift-controller uid: 2b3b14b0-77ab-44ff-b877-585348f7083d spec: progressDeadlineSeconds: 600 replicas: 1 revisionHistoryLimit: 10 selector: matchLabels: app: forklift control-plane: controller-manager controller-tools.k8s.io: "1.0" strategy: rollingUpdate: maxSurge: 25% maxUnavailable: 25% type: RollingUpdate template: metadata: annotations: configHash: /var/cache/inventory creationTimestamp: null labels: app: forklift control-plane: controller-manager controller-tools.k8s.io: "1.0" spec: containers: - command: - /usr/local/bin/manager env: - name: POD_NAMESPACE valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.namespace - name: ROLE value: main - name: API_HOST value: forklift-inventory.openshift-mtv.svc.cluster.local - name: API_PORT value: "8443" - name: API_TLS_ENABLED value: "true" - name: SECRET_NAME value: webhook-server-secret envFrom: - configMapRef: name: forklift-controller-config image: registry.redhat.io/mtv/mtv-controller@sha256:666e415b74f7d93e5b91faba038b191da65619bed3f1ead7ab5fdb56873c61f7 imagePullPolicy: Always name: controller ports: - containerPort: 9876 name: webhook-server protocol: TCP resources: limits: cpu: 100m memory: 800Mi requests: cpu: 100m memory: 350Mi terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /tmp/cert name: cert readOnly: true - mountPath: /var/cache/profiler name: profiler - command: - /usr/local/bin/manager env: - name: POD_NAMESPACE valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.namespace - name: ROLE value: inventory - name: SECRET_NAME value: webhook-server-secret - name: API_PORT value: "8443" - name: API_TLS_ENABLED value: "true" - name: API_TLS_CERTIFICATE value: /var/run/secrets/forklift-inventory-serving-cert/tls.crt - name: API_TLS_KEY value: /var/run/secrets/forklift-inventory-serving-cert/tls.key - name: METRICS_PORT value: "8081" - name: POLICY_AGENT_URL value: https://forklift-validation.openshift-mtv.svc.cluster.local:8181 - name: POLICY_AGENT_SEARCH_INTERVAL value: "120" envFrom: - configMapRef: name: forklift-controller-config image: registry.redhat.io/mtv/mtv-controller@sha256:666e415b74f7d93e5b91faba038b191da65619bed3f1ead7ab5fdb56873c61f7 imagePullPolicy: Always name: inventory ports: - containerPort: 8443 name: api protocol: TCP resources: limits: cpu: 100m memory: 800Mi requests: cpu: 100m memory: 350Mi terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /var/cache/inventory name: inventory - mountPath: /var/cache/profiler name: profiler - mountPath: /var/run/secrets/forklift-inventory-serving-cert name: forklift-inventory-serving-cert dnsPolicy: ClusterFirst restartPolicy: Always schedulerName: default-scheduler securityContext: {} serviceAccount: forklift-controller serviceAccountName: forklift-controller terminationGracePeriodSeconds: 10 volumes: - name: cert secret: defaultMode: 420 secretName: webhook-server-secret - name: forklift-inventory-serving-cert secret: defaultMode: 420 secretName: forklift-inventory-serving-cert - emptyDir: {} name: inventory - emptyDir: {} name: profiler status: availableReplicas: 1 conditions: - lastTransitionTime: "2021-05-10T10:59:04Z" lastUpdateTime: "2021-05-10T11:00:05Z" message: ReplicaSet "forklift-controller-5c745fcf7c" has successfully progressed. reason: NewReplicaSetAvailable status: "True" type: Progressing - lastTransitionTime: "2021-05-13T02:49:29Z" lastUpdateTime: "2021-05-13T02:49:29Z" message: Deployment has minimum availability. reason: MinimumReplicasAvailable status: "True" type: Available observedGeneration: 1 readyReplicas: 1 replicas: 1 updatedReplicas: 1 root@f01-h14-000-r640:~$ The fix should be part of build mtv-operator-bundle-container-2.0.0-17 / iib:76027. The next step will be to fix Open Policy Agent itself, but it's out of scope of this BZ. verify on cloud38 : MTV :2.0.0.19 CNV: 2.6.3 no OOM messages were found during migration and idle state of the pods/nodes for 22 hours in total , since last MTV/CNV upgrade Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (MTV 2.0.0 images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2021:2381 |