Bug 1809593

Summary: openshift-kube-apiserver, openshift-etcd and openshift-kube-scheduler pods in OOMKilled status, after OCP deployment
Product: OpenShift Container Platform Reporter: Alex Kalenyuk <akalenyu>
Component: NodeAssignee: Peter Hunt <pehunt>
Status: CLOSED ERRATA QA Contact: Sunil Choudhary <schoudha>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.4CC: aos-bugs, jokerman, mfojtik, pehunt, rlopez, rphillips, zyu
Target Milestone: ---   
Target Release: 4.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 1856418 (view as bug list) Environment:
Last Closed: 2020-05-13 22:00:17 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1856418    

Description Alex Kalenyuk 2020-03-03 12:58:53 UTC
Description of problem:
openshift-kube-apiserver, openshift-etcd and openshift-kube-scheduler pods in OOMKilled status, after OCP deployment

Version-Release number of selected component (if applicable):
Client Version: 4.4.0-0.nightly-2020-02-17-022408
Server Version: 4.4.0-0.nightly-2020-03-02-011520
Kubernetes Version: v1.17.1

How reproducible:
Occurred in last 2 deployments

Steps to Reproduce:
1. Deploy environment on PSI IPI
2. run: oc get pods -A | grep -v Running | grep -v Completed

Actual results:
Several pods are in "OOMKilled" status

Expected results:
All pods are running

Additional info:

[cloud-user@ocp-psi-executor bug_1794050]$ oc get pods -A | grep -v Running | grep -v Completed
NAMESPACE                                               NAME                                                              READY   STATUS      RESTARTS   AGE
openshift-etcd                                          revision-pruner-3-akalenyu-96t9h-master-2                         0/1     OOMKilled   0          149m
openshift-kube-apiserver                                revision-pruner-3-akalenyu-96t9h-master-0                         0/1     OOMKilled   0          147m
openshift-kube-scheduler                                revision-pruner-5-akalenyu-96t9h-master-1                         0/1     OOMKilled   0          143m
[cloud-user@ocp-psi-executor bug_1794050]$ oc logs -f revision-pruner-3-akalenyu-96t9h-master-2 -n openshift-etcd
I0303 10:05:16.896154       1 cmd.go:38] &{<nil> true {false} prune true map[max-eligible-revision:0xc000649c20 protected-revisions:0xc000649cc0 resource-dir:0xc000649d60 static-pod-name:0xc000649e00 v:0xc00014f9a0] [0xc00014f9a0 0xc000649c20 0xc000649cc0 0xc000649d60 0xc000649e00] [] map[add-dir-header:0xc00014f360 alsologtostderr:0xc00014f400 help:0xc000324320 log-backtrace-at:0xc00014f4a0 log-dir:0xc00014f540 log-file:0xc00014f5e0 log-file-max-size:0xc00014f680 log-flush-frequency:0xc00024c0a0 logtostderr:0xc00014f720 max-eligible-revision:0xc000649c20 protected-revisions:0xc000649cc0 resource-dir:0xc000649d60 skip-headers:0xc00014f7c0 skip-log-headers:0xc00014f860 static-pod-name:0xc000649e00 stderrthreshold:0xc00014f900 v:0xc00014f9a0 vmodule:0xc00014fa40] [0xc000649c20 0xc000649cc0 0xc000649d60 0xc000649e00 0xc00014f360 0xc00014f400 0xc00014f4a0 0xc00014f540 0xc00014f5e0 0xc00014f680 0xc00024c0a0 0xc00014f720 0xc00014f7c0 0xc00014f860 0xc00014f900 0xc00014f9a0 0xc00014fa40 0xc000324320] [0xc00014f360 0xc00014f400 0xc000324320 0xc00014f4a0 0xc00014f540 0xc00014f5e0 0xc00014f680 0xc00024c0a0 0xc00014f720 0xc000649c20 0xc000649cc0 0xc000649d60 0xc00014f7c0 0xc00014f860 0xc000649e00 0xc00014f900 0xc00014f9a0 0xc00014fa40] map[104:0xc000324320 118:0xc00014f9a0] [] -1 0 0xc000665a70 true <nil> []}
I0303 10:05:16.896435       1 cmd.go:39] (*prune.PruneOptions)(0xc00086aa80)({
 MaxEligibleRevision: (int) 3,
 ProtectedRevisions: ([]int) (len=3 cap=3) {
  (int) 1,
  (int) 2,
  (int) 3
 },
 ResourceDir: (string) (len=36) "/etc/kubernetes/static-pod-resources",
 StaticPodName: (string) (len=8) "etcd-pod"
})
[cloud-user@ocp-psi-executor bug_1794050]$ oc logs -f revision-pruner-3-akalenyu-96t9h-master-0 -n openshift-kube-apiserver
I0303 10:07:39.186186       1 cmd.go:38] &{<nil> true {false} prune true map[max-eligible-revision:0xc0005df680 protected-revisions:0xc0005df720 resource-dir:0xc0005df7c0 static-pod-name:0xc0005df860 v:0xc000389d60] [0xc000389d60 0xc0005df680 0xc0005df720 0xc0005df7c0 0xc0005df860] [] map[add-dir-header:0xc000389720 alsologtostderr:0xc0003897c0 help:0xc0006040a0 log-backtrace-at:0xc000389860 log-dir:0xc000389900 log-file:0xc0003899a0 log-file-max-size:0xc000389a40 log-flush-frequency:0xc0000d4b40 logtostderr:0xc000389ae0 max-eligible-revision:0xc0005df680 protected-revisions:0xc0005df720 resource-dir:0xc0005df7c0 skip-headers:0xc000389b80 skip-log-headers:0xc000389c20 static-pod-name:0xc0005df860 stderrthreshold:0xc000389cc0 v:0xc000389d60 vmodule:0xc000389e00] [0xc0005df680 0xc0005df720 0xc0005df7c0 0xc0005df860 0xc000389720 0xc0003897c0 0xc000389860 0xc000389900 0xc0003899a0 0xc000389a40 0xc0000d4b40 0xc000389ae0 0xc000389b80 0xc000389c20 0xc000389cc0 0xc000389d60 0xc000389e00 0xc0006040a0] [0xc000389720 0xc0003897c0 0xc0006040a0 0xc000389860 0xc000389900 0xc0003899a0 0xc000389a40 0xc0000d4b40 0xc000389ae0 0xc0005df680 0xc0005df720 0xc0005df7c0 0xc000389b80 0xc000389c20 0xc0005df860 0xc000389cc0 0xc000389d60 0xc000389e00] map[104:0xc0006040a0 118:0xc000389d60] [] -1 0 0xc00079d290 true <nil> []}
I0303 10:07:39.186411       1 cmd.go:39] (*prune.PruneOptions)(0xc0004e5880)({
 MaxEligibleRevision: (int) 4,
 ProtectedRevisions: ([]int) (len=4 cap=4) {
  (int) 1,
  (int) 2,
  (int) 3,
  (int) 4
 },
 ResourceDir: (string) (len=36) "/etc/kubernetes/static-pod-resources",
 StaticPodName: (string) (len=18) "kube-apiserver-pod"
})
[cloud-user@ocp-psi-executor bug_1794050]$ oc logs -f revision-pruner-5-akalenyu-96t9h-master-1 -n openshift-kube-scheduler
I0303 10:12:05.481067       1 cmd.go:38] &{<nil> true {false} prune true map[max-eligible-revision:0xc00056a640 protected-revisions:0xc00056a6e0 resource-dir:0xc00056a780 static-pod-name:0xc00056a820 v:0xc000715d60] [0xc000715d60 0xc00056a640 0xc00056a6e0 0xc00056a780 0xc00056a820] [] map[add-dir-header:0xc000715720 alsologtostderr:0xc0007157c0 help:0xc00056a8c0 log-backtrace-at:0xc000715860 log-dir:0xc000715900 log-file:0xc0007159a0 log-file-max-size:0xc000715a40 log-flush-frequency:0xc0000eabe0 logtostderr:0xc000715ae0 max-eligible-revision:0xc00056a640 protected-revisions:0xc00056a6e0 resource-dir:0xc00056a780 skip-headers:0xc000715b80 skip-log-headers:0xc000715c20 static-pod-name:0xc00056a820 stderrthreshold:0xc000715cc0 v:0xc000715d60 vmodule:0xc000715e00] [0xc00056a640 0xc00056a6e0 0xc00056a780 0xc00056a820 0xc000715720 0xc0007157c0 0xc000715860 0xc000715900 0xc0007159a0 0xc000715a40 0xc0000eabe0 0xc000715ae0 0xc000715b80 0xc000715c20 0xc000715cc0 0xc000715d60 0xc000715e00 0xc00056a8c0] [0xc000715720 0xc0007157c0 0xc00056a8c0 0xc000715860 0xc000715900 0xc0007159a0 0xc000715a40 0xc0000eabe0 0xc000715ae0 0xc00056a640 0xc00056a6e0 0xc00056a780 0xc000715b80 0xc000715c20 0xc00056a820 0xc000715cc0 0xc000715d60 0xc000715e00] map[104:0xc00056a8c0 118:0xc000715d60] [] -1 0 0xc0005fabd0 true <nil> []}
I0303 10:12:05.481253       1 cmd.go:39] (*prune.PruneOptions)(0xc00040e600)({
 MaxEligibleRevision: (int) 5,
 ProtectedRevisions: ([]int) (len=5 cap=5) {
  (int) 1,
  (int) 2,
  (int) 3,
  (int) 4,
  (int) 5
 },
 ResourceDir: (string) (len=36) "/etc/kubernetes/static-pod-resources",
 StaticPodName: (string) (len=18) "kube-scheduler-pod"
})
[cloud-user@ocp-psi-executor bug_1794050]$ oc describe pod revision-pruner-5-akalenyu-96t9h-master-1 -n openshift-kube-scheduler
Name:                 revision-pruner-5-akalenyu-96t9h-master-1
Namespace:            openshift-kube-scheduler
Priority:             2000001000
Priority Class Name:  system-node-critical
Node:                 akalenyu-96t9h-master-1/192.168.0.13
Start Time:           Tue, 03 Mar 2020 05:12:00 -0500
Labels:               app=pruner
Annotations:          k8s.v1.cni.cncf.io/networks-status: 
Status:               Succeeded
IP:                   10.129.0.33
IPs:
  IP:  10.129.0.33
Containers:
  pruner:
    Container ID:  cri-o://a6192cb463d2f996b52abd40131f62f08b9b2567b94f32b8a250b6302450de84
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:75e8ff11b60c71089972c6bc0bd01d619a4fd8e011e213c9855b76fce0c11e7d
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:75e8ff11b60c71089972c6bc0bd01d619a4fd8e011e213c9855b76fce0c11e7d
    Port:          <none>
    Host Port:     <none>
    Command:
      cluster-kube-scheduler-operator
      prune
    Args:
      -v=4
      --max-eligible-revision=5
      --protected-revisions=1,2,3,4,5
      --resource-dir=/etc/kubernetes/static-pod-resources
      --static-pod-name=kube-scheduler-pod
    State:          Terminated
      Reason:       OOMKilled
      Exit Code:    0
      Started:      Tue, 03 Mar 2020 05:12:05 -0500
      Finished:     Tue, 03 Mar 2020 05:12:05 -0500
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     150m
      memory:  100M
    Requests:
      cpu:        150m
      memory:     100M
    Environment:  <none>
    Mounts:
      /etc/kubernetes/ from kubelet-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from installer-sa-token-mk8qd (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  kubelet-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/kubernetes/
    HostPathType:  
  installer-sa-token-mk8qd:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  installer-sa-token-mk8qd
    Optional:    false
QoS Class:       Guaranteed
Node-Selectors:  <none>
Tolerations:     
Events:
  Type    Reason   Age   From                              Message
  ----    ------   ----  ----                              -------
  Normal  Pulled   144m  kubelet, akalenyu-96t9h-master-1  Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:75e8ff11b60c71089972c6bc0bd01d619a4fd8e011e213c9855b76fce0c11e7d" already present on machine
  Normal  Created  144m  kubelet, akalenyu-96t9h-master-1  Created container pruner
  Normal  Started  144m  kubelet, akalenyu-96t9h-master-1  Started container pruner

Comment 2 Peter Hunt 2020-03-03 19:32:20 UTC
This is a known issue, and there's a fix in progress!

Comment 3 Peter Hunt 2020-03-11 13:17:43 UTC
If this is the result of what I suspect, it is fixed in latest CRI-O.

My guess is CRI-O was incorrectly reporting an OOM kill of conmon, which could result in CRI-O spoofing the OOM kill of a container. I have since dropped the code that did this from CRI-O (https://github.com/cri-o/cri-o/commit/56ab421fda8380f2aa1dcb18265c7aaa588c8a9e)

These should prevent false negatives of container OOM killing.

This version of CRI-O has recently been tagged into 4.4 nightlies

Comment 4 Peter Hunt 2020-03-11 13:34:44 UTC
*** Bug 1810636 has been marked as a duplicate of this bug. ***

Comment 11 errata-xmlrpc 2020-05-13 22:00:17 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581