Bug 1952121
| Summary: | Inventory container crashes because of OOM condition, causing controller pod restarts | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Migration Toolkit for Virtualization | Reporter: | Tzahi Ashkenazi <tashkena> | ||||||||||||
| Component: | General | Assignee: | Jeff Ortel <jortel> | ||||||||||||
| Status: | CLOSED ERRATA | QA Contact: | Tzahi Ashkenazi <tashkena> | ||||||||||||
| Severity: | high | Docs Contact: | Avital Pinnick <apinnick> | ||||||||||||
| Priority: | urgent | ||||||||||||||
| Version: | 2.0.0 | CC: | apinnick, dagur, dvaanunu, fdupont, istein, jortel | ||||||||||||
| Target Milestone: | --- | ||||||||||||||
| Target Release: | 2.0.0 | ||||||||||||||
| Hardware: | Unspecified | ||||||||||||||
| OS: | Unspecified | ||||||||||||||
| Whiteboard: | |||||||||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||||||
| Doc Text: | Story Points: | --- | |||||||||||||
| Clone Of: | Environment: | ||||||||||||||
| Last Closed: | 2021-06-10 17:11:46 UTC | Type: | Bug | ||||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||||
| Documentation: | --- | CRM: | |||||||||||||
| Verified Versions: | Category: | --- | |||||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
| Embargoed: | |||||||||||||||
| Attachments: |
|
||||||||||||||
The fix should be in build 2.0.0-20 / iib:69034. reproduce on 2.0.0.20
oc get pods/forklift-controller-86986fd75b-hgr7f -nopenshift-rhmtv -oyaml |les
imageID: registry.redhat.io/rhmtv/rhmtv-controller@sha256:4a3766c3c467a0d24b34ea1ed692c040c76d14cc13b805fc971089255489a88f
lastState:
terminated:
containerID: cri-o://cd7096c6ae240e953d15a35f8a7f6ed7663a6d8c4273961b537b4589bd4a6a77
exitCode: 137
finishedAt: "2021-04-22T12:39:56Z"
reason: OOMKilled
startedAt: "2021-04-22T07:53:43Z"
name: controller
ready: true
restartCount: 1
started: true
state:
running:
startedAt: "2021-04-22T12:39:58Z"
- containerID: cri-o://3630fe55b0df45050030159c23a892fc360380582359ec8f5aba204bd25cb7af
image: registry.redhat.io/rhmtv/rhmtv-controller@sha256:4a3766c3c467a0d24b34ea1ed692c040c76d14cc13b805fc971089255489a88f
imageID: registry.redhat.io/rhmtv/rhmtv-controller@sha256:4a3766c3c467a0d24b34ea1ed692c040c76d14cc13b805fc971089255489a88f
lastState:
terminated:
containerID: cri-o://65896916004c95b4d65820dac52359bda5beb6d16e310d27a2a3b947c1527c10
exitCode: 137
finishedAt: "2021-04-22T11:32:16Z"
reason: OOMKilled
startedAt: "2021-04-22T07:53:44Z"
name: inventory
ready: true
restartCount: 1
started: true
state:
running:
startedAt: "2021-04-22T11:32:18Z"
hostIP: 192.168.208.15
phase: Running
podIP: 10.131.0.162
podIPs:
- ip: 10.131.0.162
qosClass: Burstable
startTime: "2021-04-22T07:53:33Z"
I guess this current BZ is impacted by this one > https://bugzilla.redhat.com/show_bug.cgi?id=1952450 oc describe node f02-h18-000-r640.rdu2.scalelab.redhat.com Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 1179m (1%) 300m (0%) memory 3356Mi (0%) 2400Mi (0%) ephemeral-storage 0 (0%) 0 (0%) hugepages-1Gi 0 (0%) 0 (0%) hugepages-2Mi 0 (0%) 0 (0%) devices.kubevirt.io/kvm 0 0 devices.kubevirt.io/tun 0 0 devices.kubevirt.io/vhost-net 0 0 openshift.io/sriov_nics 0 0 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning SystemOOM 142m kubelet System OOM encountered, victim process: manager, pid: 3564364 Warning SystemOOM 75m kubelet System OOM encountered, victim process: manager, pid: 3564270 Created attachment 1775187 [details]
cloud38 memory profile.
Created attachment 1775188 [details]
kubectl top output.
Created attachment 1775637 [details]
psi4 profile.
Reproduced on my PSI cluster.
Created attachment 1775638 [details]
psi4 controller container profile.
Reproduced on my PSI cluster.
This is the "controller" container (not inventory).
The standard Go http lib transport defaults to: unlimited idle connections to support connection reuse. For some reason, the idle connection retains IO buffers for the same unlimited duration. The fix is to configure the transport to limit the number and lifespan of idle connections. https://github.com/konveyor/forklift-controller/pull/229 reproduce on MTV 2.0.0.21 - another scale env(Cloud10)
inventory:
Container ID: cri-o://60a44db3b8f133849f5999fff8a0420df92f9cb800902a9d88aa6687750249f4
Image: registry.redhat.io/rhmtv/rhmtv-controller@sha256:4a3766c3c467a0d24b34ea1ed692c040c76d14cc13b805fc971089255489a88f
Image ID: registry.redhat.io/rhmtv/rhmtv-controller@sha256:4a3766c3c467a0d24b34ea1ed692c040c76d14cc13b805fc971089255489a88f
Port: 8443/TCP
Host Port: 0/TCP
Command:
/usr/local/bin/manager
State: Running
Started: Wed, 28 Apr 2021 23:50:54 +0000
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Wed, 28 Apr 2021 15:17:05 +0000
Finished: Wed, 28 Apr 2021 23:50:53 +0000
Ready: True
Restart Count: 2
Limits:
cpu: 100m
memory: 800Mi
Requests:
cpu: 100m
memory: 350Mi
Environment Variables from:
forklift-controller-config ConfigMap Optional: false
Environment:
POD_NAMESPACE: openshift-rhmtv (v1:metadata.namespace)
ROLE: inventory
SECRET_NAME: webhook-server-secret
API_PORT: 8443
API_TLS_ENABLED: true
API_TLS_CERTIFICATE: /var/run/secrets/forklift-inventory-serving-cert/tls.crt
API_TLS_KEY: /var/run/secrets/forklift-inventory-serving-cert/tls.key
METRICS_PORT: 8081
POLICY_AGENT_URL: https://forklift-validation.openshift-rhmtv.svc.cluster.local:8181
POLICY_AGENT_SEARCH_INTERVAL: 120
oc describe node f01-h26-000-r640.rdu2.scalelab.redhat.com
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 1820m (2%) 500m (0%)
memory 7890Mi (2%) 500Mi (0%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
devices.kubevirt.io/kvm 1 1
devices.kubevirt.io/tun 1 1
devices.kubevirt.io/vhost-net 1 1
Events:
The fix should be part of build mtv-operator-bundle-container-2.0.0-4 / iib:72115. Would it be possible to reproduce and note the amount of memory consumed by the pod when it is killed? The guess is that it is consuming more than the 800Mi limit. This is possible, but surprising. @jortel, what do you think? Created attachment 1782406 [details]
controller pod memory - grafana screenshot
continue to David comment :
Also, OOM occurred on cloud38 as well.
During the OOM time, the env was idle and have 4 plans with succeeded status.
cloud38: f02-h07-000-r640.rdu2.scalelab.redhat.com (root ; 100yard-)
- containerID: cri-o://d9114299068a7ad9221e42619b1a2e7fb1d69156840a8acf1405c82af106f1b1
image: registry.redhat.io/mtv/mtv-controller@sha256:666e415b74f7d93e5b91faba038b191da65619bed3f1ead7ab5fdb56873c61f7
imageID: registry.redhat.io/mtv/mtv-controller@sha256:666e415b74f7d93e5b91faba038b191da65619bed3f1ead7ab5fdb56873c61f7
lastState:
terminated:
containerID: cri-o://7210a625d5d7a0fd859eb4e81e9284bede921d5da2c61e463eb557a0d3448f1e
exitCode: 137
finishedAt: "2021-05-12T02:21:49Z"
reason: OOMKilled
startedAt: "2021-05-10T15:03:53Z"
name: inventory
ready: true
restartCount: 1
started: true
state:
running:
startedAt: "2021-05-12T02:21:51Z"
hostIP: 192.168.208.14
phase: Running
podIP: 10.128.3.141
podIPs:
- ip: 10.128.3.141
qosClass: Burstable
startTime: "2021-05-10T15:03:46Z"
root@f02-h07-000-r640:~$ oc get pods -nopenshift-mtv -owide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
forklift-controller-5fd7f96df7-dckj2 2/2 Running 1 46h 10.128.3.141 f02-h17-000-r640.rdu2.scalelab.redhat.com <none> <none>
forklift-operator-754dbc46dd-4bvcw 1/1 Running 0 2d 10.128.3.102 f02-h17-000-r640.rdu2.scalelab.redhat.com <none> <none>
forklift-ui-f46bbcfd9-tvcr9 1/1 Running 0 2d 10.128.3.105 f02-h17-000-r640.rdu2.scalelab.redhat.com <none> <none>
forklift-validation-6687f5954d-rxv2h 1/1 Running 0 2d 10.128.3.104 f02-h17-000-r640.rdu2.scalelab.redhat.com <none> <none>
cloud38
MTV:2.0.0.12
CNV:2.6.2
According to the graph you share, it's weird that the container is OOMKilled. Would you mind running the following command and share its output? $ oc get -o yaml -n openshift-mtv deployment forklift-controller
root@f01-h14-000-r640:~$ oc get -o yaml -n openshift-mtv deployment forklift-controller
apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/revision: "1"
creationTimestamp: "2021-05-10T10:59:04Z"
generation: 1
labels:
app: forklift
control-plane: controller-manager
controller-tools.k8s.io: "1.0"
managedFields:
- apiVersion: apps/v1
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:labels:
.: {}
f:app: {}
f:control-plane: {}
f:controller-tools.k8s.io: {}
f:ownerReferences:
.: {}
k:{"uid":"e0334f7b-0bf5-494b-95b3-215a09222750"}:
.: {}
f:apiVersion: {}
f:kind: {}
f:name: {}
f:uid: {}
f:spec:
f:progressDeadlineSeconds: {}
f:replicas: {}
f:revisionHistoryLimit: {}
f:selector: {}
f:strategy:
f:rollingUpdate:
.: {}
f:maxSurge: {}
f:maxUnavailable: {}
f:type: {}
f:template:
f:metadata:
f:annotations:
.: {}
f:configHash: {}
f:labels:
.: {}
f:app: {}
f:control-plane: {}
f:controller-tools.k8s.io: {}
f:spec:
f:containers:
k:{"name":"controller"}:
.: {}
f:command: {}
f:env:
.: {}
k:{"name":"API_HOST"}:
.: {}
f:name: {}
f:value: {}
k:{"name":"API_PORT"}:
.: {}
f:name: {}
f:value: {}
k:{"name":"API_TLS_ENABLED"}:
.: {}
f:name: {}
f:value: {}
k:{"name":"POD_NAMESPACE"}:
.: {}
f:name: {}
f:valueFrom:
.: {}
f:fieldRef:
.: {}
f:apiVersion: {}
f:fieldPath: {}
k:{"name":"ROLE"}:
.: {}
f:name: {}
f:value: {}
k:{"name":"SECRET_NAME"}:
.: {}
f:name: {}
f:value: {}
f:envFrom: {}
f:image: {}
f:imagePullPolicy: {}
f:name: {}
f:ports:
.: {}
k:{"containerPort":9876,"protocol":"TCP"}:
.: {}
f:containerPort: {}
f:name: {}
f:protocol: {}
f:resources:
.: {}
f:limits:
.: {}
f:cpu: {}
f:memory: {}
f:requests:
.: {}
f:cpu: {}
f:memory: {}
f:terminationMessagePath: {}
f:terminationMessagePolicy: {}
f:volumeMounts:
.: {}
k:{"mountPath":"/tmp/cert"}:
.: {}
f:mountPath: {}
f:name: {}
f:readOnly: {}
k:{"mountPath":"/var/cache/profiler"}:
.: {}
f:mountPath: {}
f:name: {}
k:{"name":"inventory"}:
.: {}
f:command: {}
f:env:
.: {}
k:{"name":"API_PORT"}:
.: {}
f:name: {}
f:value: {}
k:{"name":"API_TLS_CERTIFICATE"}:
.: {}
f:name: {}
f:value: {}
k:{"name":"API_TLS_ENABLED"}:
.: {}
f:name: {}
f:value: {}
k:{"name":"API_TLS_KEY"}:
.: {}
f:name: {}
f:value: {}
k:{"name":"METRICS_PORT"}:
.: {}
f:name: {}
f:value: {}
k:{"name":"POD_NAMESPACE"}:
.: {}
f:name: {}
f:valueFrom:
.: {}
f:fieldRef:
.: {}
f:apiVersion: {}
f:fieldPath: {}
k:{"name":"POLICY_AGENT_SEARCH_INTERVAL"}:
.: {}
f:name: {}
f:value: {}
k:{"name":"POLICY_AGENT_URL"}:
.: {}
f:name: {}
f:value: {}
k:{"name":"ROLE"}:
.: {}
f:name: {}
f:value: {}
k:{"name":"SECRET_NAME"}:
.: {}
f:name: {}
f:value: {}
f:envFrom: {}
f:image: {}
f:imagePullPolicy: {}
f:name: {}
f:ports:
.: {}
k:{"containerPort":8443,"protocol":"TCP"}:
.: {}
f:containerPort: {}
f:name: {}
f:protocol: {}
f:resources:
.: {}
f:limits:
.: {}
f:cpu: {}
f:memory: {}
f:requests:
.: {}
f:cpu: {}
f:memory: {}
f:terminationMessagePath: {}
f:terminationMessagePolicy: {}
f:volumeMounts:
.: {}
k:{"mountPath":"/var/cache/inventory"}:
.: {}
f:mountPath: {}
f:name: {}
k:{"mountPath":"/var/cache/profiler"}:
.: {}
f:mountPath: {}
f:name: {}
k:{"mountPath":"/var/run/secrets/forklift-inventory-serving-cert"}:
.: {}
f:mountPath: {}
f:name: {}
f:dnsPolicy: {}
f:restartPolicy: {}
f:schedulerName: {}
f:securityContext: {}
f:serviceAccount: {}
f:serviceAccountName: {}
f:terminationGracePeriodSeconds: {}
f:volumes:
.: {}
k:{"name":"cert"}:
.: {}
f:name: {}
f:secret:
.: {}
f:defaultMode: {}
f:secretName: {}
k:{"name":"forklift-inventory-serving-cert"}:
.: {}
f:name: {}
f:secret:
.: {}
f:defaultMode: {}
f:secretName: {}
k:{"name":"inventory"}:
.: {}
f:emptyDir: {}
f:name: {}
k:{"name":"profiler"}:
.: {}
f:emptyDir: {}
f:name: {}
manager: OpenAPI-Generator
operation: Update
time: "2021-05-10T10:59:04Z"
- apiVersion: apps/v1
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
.: {}
f:deployment.kubernetes.io/revision: {}
f:status:
f:availableReplicas: {}
f:conditions:
.: {}
k:{"type":"Available"}:
.: {}
f:lastTransitionTime: {}
f:lastUpdateTime: {}
f:message: {}
f:reason: {}
f:status: {}
f:type: {}
k:{"type":"Progressing"}:
.: {}
f:lastTransitionTime: {}
f:lastUpdateTime: {}
f:message: {}
f:reason: {}
f:status: {}
f:type: {}
f:observedGeneration: {}
f:readyReplicas: {}
f:replicas: {}
f:updatedReplicas: {}
manager: kube-controller-manager
operation: Update
time: "2021-05-13T02:49:29Z"
name: forklift-controller
namespace: openshift-mtv
ownerReferences:
- apiVersion: forklift.konveyor.io/v1beta1
kind: ForkliftController
name: forklift-controller
uid: e0334f7b-0bf5-494b-95b3-215a09222750
resourceVersion: "120956429"
selfLink: /apis/apps/v1/namespaces/openshift-mtv/deployments/forklift-controller
uid: 2b3b14b0-77ab-44ff-b877-585348f7083d
spec:
progressDeadlineSeconds: 600
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
app: forklift
control-plane: controller-manager
controller-tools.k8s.io: "1.0"
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
template:
metadata:
annotations:
configHash: /var/cache/inventory
creationTimestamp: null
labels:
app: forklift
control-plane: controller-manager
controller-tools.k8s.io: "1.0"
spec:
containers:
- command:
- /usr/local/bin/manager
env:
- name: POD_NAMESPACE
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
- name: ROLE
value: main
- name: API_HOST
value: forklift-inventory.openshift-mtv.svc.cluster.local
- name: API_PORT
value: "8443"
- name: API_TLS_ENABLED
value: "true"
- name: SECRET_NAME
value: webhook-server-secret
envFrom:
- configMapRef:
name: forklift-controller-config
image: registry.redhat.io/mtv/mtv-controller@sha256:666e415b74f7d93e5b91faba038b191da65619bed3f1ead7ab5fdb56873c61f7
imagePullPolicy: Always
name: controller
ports:
- containerPort: 9876
name: webhook-server
protocol: TCP
resources:
limits:
cpu: 100m
memory: 800Mi
requests:
cpu: 100m
memory: 350Mi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /tmp/cert
name: cert
readOnly: true
- mountPath: /var/cache/profiler
name: profiler
- command:
- /usr/local/bin/manager
env:
- name: POD_NAMESPACE
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
- name: ROLE
value: inventory
- name: SECRET_NAME
value: webhook-server-secret
- name: API_PORT
value: "8443"
- name: API_TLS_ENABLED
value: "true"
- name: API_TLS_CERTIFICATE
value: /var/run/secrets/forklift-inventory-serving-cert/tls.crt
- name: API_TLS_KEY
value: /var/run/secrets/forklift-inventory-serving-cert/tls.key
- name: METRICS_PORT
value: "8081"
- name: POLICY_AGENT_URL
value: https://forklift-validation.openshift-mtv.svc.cluster.local:8181
- name: POLICY_AGENT_SEARCH_INTERVAL
value: "120"
envFrom:
- configMapRef:
name: forklift-controller-config
image: registry.redhat.io/mtv/mtv-controller@sha256:666e415b74f7d93e5b91faba038b191da65619bed3f1ead7ab5fdb56873c61f7
imagePullPolicy: Always
name: inventory
ports:
- containerPort: 8443
name: api
protocol: TCP
resources:
limits:
cpu: 100m
memory: 800Mi
requests:
cpu: 100m
memory: 350Mi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/cache/inventory
name: inventory
- mountPath: /var/cache/profiler
name: profiler
- mountPath: /var/run/secrets/forklift-inventory-serving-cert
name: forklift-inventory-serving-cert
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: forklift-controller
serviceAccountName: forklift-controller
terminationGracePeriodSeconds: 10
volumes:
- name: cert
secret:
defaultMode: 420
secretName: webhook-server-secret
- name: forklift-inventory-serving-cert
secret:
defaultMode: 420
secretName: forklift-inventory-serving-cert
- emptyDir: {}
name: inventory
- emptyDir: {}
name: profiler
status:
availableReplicas: 1
conditions:
- lastTransitionTime: "2021-05-10T10:59:04Z"
lastUpdateTime: "2021-05-10T11:00:05Z"
message: ReplicaSet "forklift-controller-5c745fcf7c" has successfully progressed.
reason: NewReplicaSetAvailable
status: "True"
type: Progressing
- lastTransitionTime: "2021-05-13T02:49:29Z"
lastUpdateTime: "2021-05-13T02:49:29Z"
message: Deployment has minimum availability.
reason: MinimumReplicasAvailable
status: "True"
type: Available
observedGeneration: 1
readyReplicas: 1
replicas: 1
updatedReplicas: 1
root@f01-h14-000-r640:~$
The fix should be part of build mtv-operator-bundle-container-2.0.0-17 / iib:76027. The next step will be to fix Open Policy Agent itself, but it's out of scope of this BZ. verify on cloud38 : MTV :2.0.0.19 CNV: 2.6.3 no OOM messages were found during migration and idle state of the pods/nodes for 22 hours in total , since last MTV/CNV upgrade Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (MTV 2.0.0 images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2021:2381 |
Description of problem: on cloud38 BM 6 nodes on the idle state ( all pods deleted and created from scratch ) the controller pod restarted 5 times during 180min root@f02-h07-000-r640:/home/kni/scripts/iperf$ oc get pods NAME READY STATUS RESTARTS AGE forklift-controller-64585c555b-t2bkl 2/2 Running 4 175m forklift-operator-847f9d45d7-pgnzx 1/1 Running 0 175m forklift-ui-7fc8495999-6xhk2 1/1 Running 0 175m forklift-validation-7977854bdd-xfqm9 1/1 Running 0 175m iperf-client-h15 1/1 Running 0 95m from the controller pod : State: Running Started: Wed, 21 Apr 2021 14:24:16 +0000 Last State: Terminated Reason: OOMKilled Exit Code: 137 Started: Wed, 21 Apr 2021 13:41:14 +0000 Finished: Wed, 21 Apr 2021 14:24:13 +0000 Ready: True Restart Count: 4 Limits: cpu: 100m memory: 800Mi Requests: cpu: 100m memory: 350Mi Environment Variables from: forklift-controller-config ConfigMap Optional: false Environment: POD_NAMESPACE: openshift-rhmtv (v1:metadata.namespace) ROLE: inventory SECRET_NAME: webhook-server-secret API_PORT: 8443 API_TLS_ENABLED: true API_TLS_CERTIFICATE: /var/run/secrets/forklift-inventory-serving-cert/tls.crt API_TLS_KEY: /var/run/secrets/forklift-inventory-serving-cert/tls.key METRICS_PORT: 8081 POLICY_AGENT_URL: https://forklift-validation.openshift-rhmtv.svc.cluster.local:8181 POLICY_AGENT_SEARCH_INTERVAL: 120 Version-Release number of selected component (if applicable): MTV 2.0.0.18 CNV 2.6.1