Bug 1809593
| Summary: | openshift-kube-apiserver, openshift-etcd and openshift-kube-scheduler pods in OOMKilled status, after OCP deployment | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Alex Kalenyuk <akalenyu> | |
| Component: | Node | Assignee: | Peter Hunt <pehunt> | |
| Status: | CLOSED ERRATA | QA Contact: | Sunil Choudhary <schoudha> | |
| Severity: | high | Docs Contact: | ||
| Priority: | unspecified | |||
| Version: | 4.4 | CC: | aos-bugs, jokerman, mfojtik, pehunt, rlopez, rphillips, zyu | |
| Target Milestone: | --- | |||
| Target Release: | 4.4.0 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | No Doc Update | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1856418 (view as bug list) | Environment: | ||
| Last Closed: | 2020-05-13 22:00:17 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1856418 | |||
This is a known issue, and there's a fix in progress! If this is the result of what I suspect, it is fixed in latest CRI-O. My guess is CRI-O was incorrectly reporting an OOM kill of conmon, which could result in CRI-O spoofing the OOM kill of a container. I have since dropped the code that did this from CRI-O (https://github.com/cri-o/cri-o/commit/56ab421fda8380f2aa1dcb18265c7aaa588c8a9e) These should prevent false negatives of container OOM killing. This version of CRI-O has recently been tagged into 4.4 nightlies *** Bug 1810636 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581 |
Description of problem: openshift-kube-apiserver, openshift-etcd and openshift-kube-scheduler pods in OOMKilled status, after OCP deployment Version-Release number of selected component (if applicable): Client Version: 4.4.0-0.nightly-2020-02-17-022408 Server Version: 4.4.0-0.nightly-2020-03-02-011520 Kubernetes Version: v1.17.1 How reproducible: Occurred in last 2 deployments Steps to Reproduce: 1. Deploy environment on PSI IPI 2. run: oc get pods -A | grep -v Running | grep -v Completed Actual results: Several pods are in "OOMKilled" status Expected results: All pods are running Additional info: [cloud-user@ocp-psi-executor bug_1794050]$ oc get pods -A | grep -v Running | grep -v Completed NAMESPACE NAME READY STATUS RESTARTS AGE openshift-etcd revision-pruner-3-akalenyu-96t9h-master-2 0/1 OOMKilled 0 149m openshift-kube-apiserver revision-pruner-3-akalenyu-96t9h-master-0 0/1 OOMKilled 0 147m openshift-kube-scheduler revision-pruner-5-akalenyu-96t9h-master-1 0/1 OOMKilled 0 143m [cloud-user@ocp-psi-executor bug_1794050]$ oc logs -f revision-pruner-3-akalenyu-96t9h-master-2 -n openshift-etcd I0303 10:05:16.896154 1 cmd.go:38] &{<nil> true {false} prune true map[max-eligible-revision:0xc000649c20 protected-revisions:0xc000649cc0 resource-dir:0xc000649d60 static-pod-name:0xc000649e00 v:0xc00014f9a0] [0xc00014f9a0 0xc000649c20 0xc000649cc0 0xc000649d60 0xc000649e00] [] map[add-dir-header:0xc00014f360 alsologtostderr:0xc00014f400 help:0xc000324320 log-backtrace-at:0xc00014f4a0 log-dir:0xc00014f540 log-file:0xc00014f5e0 log-file-max-size:0xc00014f680 log-flush-frequency:0xc00024c0a0 logtostderr:0xc00014f720 max-eligible-revision:0xc000649c20 protected-revisions:0xc000649cc0 resource-dir:0xc000649d60 skip-headers:0xc00014f7c0 skip-log-headers:0xc00014f860 static-pod-name:0xc000649e00 stderrthreshold:0xc00014f900 v:0xc00014f9a0 vmodule:0xc00014fa40] [0xc000649c20 0xc000649cc0 0xc000649d60 0xc000649e00 0xc00014f360 0xc00014f400 0xc00014f4a0 0xc00014f540 0xc00014f5e0 0xc00014f680 0xc00024c0a0 0xc00014f720 0xc00014f7c0 0xc00014f860 0xc00014f900 0xc00014f9a0 0xc00014fa40 0xc000324320] [0xc00014f360 0xc00014f400 0xc000324320 0xc00014f4a0 0xc00014f540 0xc00014f5e0 0xc00014f680 0xc00024c0a0 0xc00014f720 0xc000649c20 0xc000649cc0 0xc000649d60 0xc00014f7c0 0xc00014f860 0xc000649e00 0xc00014f900 0xc00014f9a0 0xc00014fa40] map[104:0xc000324320 118:0xc00014f9a0] [] -1 0 0xc000665a70 true <nil> []} I0303 10:05:16.896435 1 cmd.go:39] (*prune.PruneOptions)(0xc00086aa80)({ MaxEligibleRevision: (int) 3, ProtectedRevisions: ([]int) (len=3 cap=3) { (int) 1, (int) 2, (int) 3 }, ResourceDir: (string) (len=36) "/etc/kubernetes/static-pod-resources", StaticPodName: (string) (len=8) "etcd-pod" }) [cloud-user@ocp-psi-executor bug_1794050]$ oc logs -f revision-pruner-3-akalenyu-96t9h-master-0 -n openshift-kube-apiserver I0303 10:07:39.186186 1 cmd.go:38] &{<nil> true {false} prune true map[max-eligible-revision:0xc0005df680 protected-revisions:0xc0005df720 resource-dir:0xc0005df7c0 static-pod-name:0xc0005df860 v:0xc000389d60] [0xc000389d60 0xc0005df680 0xc0005df720 0xc0005df7c0 0xc0005df860] [] map[add-dir-header:0xc000389720 alsologtostderr:0xc0003897c0 help:0xc0006040a0 log-backtrace-at:0xc000389860 log-dir:0xc000389900 log-file:0xc0003899a0 log-file-max-size:0xc000389a40 log-flush-frequency:0xc0000d4b40 logtostderr:0xc000389ae0 max-eligible-revision:0xc0005df680 protected-revisions:0xc0005df720 resource-dir:0xc0005df7c0 skip-headers:0xc000389b80 skip-log-headers:0xc000389c20 static-pod-name:0xc0005df860 stderrthreshold:0xc000389cc0 v:0xc000389d60 vmodule:0xc000389e00] [0xc0005df680 0xc0005df720 0xc0005df7c0 0xc0005df860 0xc000389720 0xc0003897c0 0xc000389860 0xc000389900 0xc0003899a0 0xc000389a40 0xc0000d4b40 0xc000389ae0 0xc000389b80 0xc000389c20 0xc000389cc0 0xc000389d60 0xc000389e00 0xc0006040a0] [0xc000389720 0xc0003897c0 0xc0006040a0 0xc000389860 0xc000389900 0xc0003899a0 0xc000389a40 0xc0000d4b40 0xc000389ae0 0xc0005df680 0xc0005df720 0xc0005df7c0 0xc000389b80 0xc000389c20 0xc0005df860 0xc000389cc0 0xc000389d60 0xc000389e00] map[104:0xc0006040a0 118:0xc000389d60] [] -1 0 0xc00079d290 true <nil> []} I0303 10:07:39.186411 1 cmd.go:39] (*prune.PruneOptions)(0xc0004e5880)({ MaxEligibleRevision: (int) 4, ProtectedRevisions: ([]int) (len=4 cap=4) { (int) 1, (int) 2, (int) 3, (int) 4 }, ResourceDir: (string) (len=36) "/etc/kubernetes/static-pod-resources", StaticPodName: (string) (len=18) "kube-apiserver-pod" }) [cloud-user@ocp-psi-executor bug_1794050]$ oc logs -f revision-pruner-5-akalenyu-96t9h-master-1 -n openshift-kube-scheduler I0303 10:12:05.481067 1 cmd.go:38] &{<nil> true {false} prune true map[max-eligible-revision:0xc00056a640 protected-revisions:0xc00056a6e0 resource-dir:0xc00056a780 static-pod-name:0xc00056a820 v:0xc000715d60] [0xc000715d60 0xc00056a640 0xc00056a6e0 0xc00056a780 0xc00056a820] [] map[add-dir-header:0xc000715720 alsologtostderr:0xc0007157c0 help:0xc00056a8c0 log-backtrace-at:0xc000715860 log-dir:0xc000715900 log-file:0xc0007159a0 log-file-max-size:0xc000715a40 log-flush-frequency:0xc0000eabe0 logtostderr:0xc000715ae0 max-eligible-revision:0xc00056a640 protected-revisions:0xc00056a6e0 resource-dir:0xc00056a780 skip-headers:0xc000715b80 skip-log-headers:0xc000715c20 static-pod-name:0xc00056a820 stderrthreshold:0xc000715cc0 v:0xc000715d60 vmodule:0xc000715e00] [0xc00056a640 0xc00056a6e0 0xc00056a780 0xc00056a820 0xc000715720 0xc0007157c0 0xc000715860 0xc000715900 0xc0007159a0 0xc000715a40 0xc0000eabe0 0xc000715ae0 0xc000715b80 0xc000715c20 0xc000715cc0 0xc000715d60 0xc000715e00 0xc00056a8c0] [0xc000715720 0xc0007157c0 0xc00056a8c0 0xc000715860 0xc000715900 0xc0007159a0 0xc000715a40 0xc0000eabe0 0xc000715ae0 0xc00056a640 0xc00056a6e0 0xc00056a780 0xc000715b80 0xc000715c20 0xc00056a820 0xc000715cc0 0xc000715d60 0xc000715e00] map[104:0xc00056a8c0 118:0xc000715d60] [] -1 0 0xc0005fabd0 true <nil> []} I0303 10:12:05.481253 1 cmd.go:39] (*prune.PruneOptions)(0xc00040e600)({ MaxEligibleRevision: (int) 5, ProtectedRevisions: ([]int) (len=5 cap=5) { (int) 1, (int) 2, (int) 3, (int) 4, (int) 5 }, ResourceDir: (string) (len=36) "/etc/kubernetes/static-pod-resources", StaticPodName: (string) (len=18) "kube-scheduler-pod" }) [cloud-user@ocp-psi-executor bug_1794050]$ oc describe pod revision-pruner-5-akalenyu-96t9h-master-1 -n openshift-kube-scheduler Name: revision-pruner-5-akalenyu-96t9h-master-1 Namespace: openshift-kube-scheduler Priority: 2000001000 Priority Class Name: system-node-critical Node: akalenyu-96t9h-master-1/192.168.0.13 Start Time: Tue, 03 Mar 2020 05:12:00 -0500 Labels: app=pruner Annotations: k8s.v1.cni.cncf.io/networks-status: Status: Succeeded IP: 10.129.0.33 IPs: IP: 10.129.0.33 Containers: pruner: Container ID: cri-o://a6192cb463d2f996b52abd40131f62f08b9b2567b94f32b8a250b6302450de84 Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:75e8ff11b60c71089972c6bc0bd01d619a4fd8e011e213c9855b76fce0c11e7d Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:75e8ff11b60c71089972c6bc0bd01d619a4fd8e011e213c9855b76fce0c11e7d Port: <none> Host Port: <none> Command: cluster-kube-scheduler-operator prune Args: -v=4 --max-eligible-revision=5 --protected-revisions=1,2,3,4,5 --resource-dir=/etc/kubernetes/static-pod-resources --static-pod-name=kube-scheduler-pod State: Terminated Reason: OOMKilled Exit Code: 0 Started: Tue, 03 Mar 2020 05:12:05 -0500 Finished: Tue, 03 Mar 2020 05:12:05 -0500 Ready: False Restart Count: 0 Limits: cpu: 150m memory: 100M Requests: cpu: 150m memory: 100M Environment: <none> Mounts: /etc/kubernetes/ from kubelet-dir (rw) /var/run/secrets/kubernetes.io/serviceaccount from installer-sa-token-mk8qd (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: kubelet-dir: Type: HostPath (bare host directory volume) Path: /etc/kubernetes/ HostPathType: installer-sa-token-mk8qd: Type: Secret (a volume populated by a Secret) SecretName: installer-sa-token-mk8qd Optional: false QoS Class: Guaranteed Node-Selectors: <none> Tolerations: Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Pulled 144m kubelet, akalenyu-96t9h-master-1 Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:75e8ff11b60c71089972c6bc0bd01d619a4fd8e011e213c9855b76fce0c11e7d" already present on machine Normal Created 144m kubelet, akalenyu-96t9h-master-1 Created container pruner Normal Started 144m kubelet, akalenyu-96t9h-master-1 Started container pruner