Bug 1938580

Summary: machine-api-operator in CrashLoopBackOff state
Product: OpenShift Container Platform Reporter: pdsilva
Component: Cloud ComputeAssignee: Alberto <agarcial>
Cloud Compute sub component: Other Providers QA Contact: sunzhaohua <zhsun>
Status: CLOSED DUPLICATE Docs Contact:
Severity: medium    
Priority: unspecified CC: lmcfadde
Version: 4.8   
Target Milestone: ---   
Target Release: ---   
Hardware: ppc64le   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-03-15 09:54:32 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description pdsilva 2021-03-14 17:51:02 UTC
Description of problem:
The machine-api-operator pod is in CrashLoopBackOff state after installation on Power.
There is a OOMKilled error in the pod description.

# oc get nodes
NAME       STATUS   ROLES    AGE     VERSION
master-0   Ready    master   4h18m   v1.20.0+e1bc274
master-1   Ready    master   4h18m   v1.20.0+e1bc274
master-2   Ready    master   4h18m   v1.20.0+e1bc274
worker-0   Ready    worker   4h4m    v1.20.0+e1bc274
worker-1   Ready    worker   4h4m    v1.20.0+e1bc274


# oc get co
NAME                                       VERSION                                     AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.8.0-0.nightly-ppc64le-2021-03-14-051438   True        False         False      42m
baremetal                                  4.8.0-0.nightly-ppc64le-2021-03-14-051438   True        False         False      65m
cloud-credential                           4.8.0-0.nightly-ppc64le-2021-03-14-051438   True        False         False      73m
cluster-autoscaler                         4.8.0-0.nightly-ppc64le-2021-03-14-051438   True        False         False      65m
config-operator                            4.8.0-0.nightly-ppc64le-2021-03-14-051438   True        False         False      66m
console                                    4.8.0-0.nightly-ppc64le-2021-03-14-051438   True        False         False      49m
csi-snapshot-controller                    4.8.0-0.nightly-ppc64le-2021-03-14-051438   True        False         False      66m
dns                                        4.8.0-0.nightly-ppc64le-2021-03-14-051438   True        False         False      65m
etcd                                       4.8.0-0.nightly-ppc64le-2021-03-14-051438   True        False         False      65m
image-registry                             4.8.0-0.nightly-ppc64le-2021-03-14-051438   True        False         False      41m
ingress                                    4.8.0-0.nightly-ppc64le-2021-03-14-051438   True        False         False      53m
insights                                   4.8.0-0.nightly-ppc64le-2021-03-14-051438   True        False         False      60m
kube-apiserver                             4.8.0-0.nightly-ppc64le-2021-03-14-051438   True        False         False      64m
kube-controller-manager                    4.8.0-0.nightly-ppc64le-2021-03-14-051438   True        False         False      64m
kube-scheduler                             4.8.0-0.nightly-ppc64le-2021-03-14-051438   True        False         False      63m
kube-storage-version-migrator              4.8.0-0.nightly-ppc64le-2021-03-14-051438   True        False         False      52m
machine-api                                4.8.0-0.nightly-ppc64le-2021-03-14-051438   True        False         False      65m
machine-approver                           4.8.0-0.nightly-ppc64le-2021-03-14-051438   True        False         False      66m
machine-config                             4.8.0-0.nightly-ppc64le-2021-03-14-051438   True        False         False      64m
marketplace                                4.8.0-0.nightly-ppc64le-2021-03-14-051438   True        False         False      65m
monitoring                                 4.8.0-0.nightly-ppc64le-2021-03-14-051438   True        False         False      52m
network                                    4.8.0-0.nightly-ppc64le-2021-03-14-051438   True        False         False      66m
node-tuning                                4.8.0-0.nightly-ppc64le-2021-03-14-051438   True        False         False      66m
openshift-apiserver                        4.8.0-0.nightly-ppc64le-2021-03-14-051438   True        False         False      60m
openshift-controller-manager               4.8.0-0.nightly-ppc64le-2021-03-14-051438   True        False         False      40m
openshift-samples                          4.8.0-0.nightly-ppc64le-2021-03-14-051438   True        False         False      60m
operator-lifecycle-manager                 4.8.0-0.nightly-ppc64le-2021-03-14-051438   True        False         False      65m
operator-lifecycle-manager-catalog         4.8.0-0.nightly-ppc64le-2021-03-14-051438   True        False         False      65m
operator-lifecycle-manager-packageserver   4.8.0-0.nightly-ppc64le-2021-03-14-051438   True        False         False      61m
service-ca                                 4.8.0-0.nightly-ppc64le-2021-03-14-051438   True        False         False      66m
storage                                    4.8.0-0.nightly-ppc64le-2021-03-14-051438   True        False         False      66m

# oc get pods -A | grep  openshift-machine-api
openshift-machine-api                              cluster-autoscaler-operator-689586d58c-jbp6l              2/2     Running            1          4h54m
openshift-machine-api                              cluster-baremetal-operator-8b948876-wcprh                 2/2     Running            0          4h54m
openshift-machine-api                              machine-api-operator-664cfb7d45-fmbjp                     1/2     CrashLoopBackOff   21         27m


# oc describe pod machine-api-operator-664cfb7d45-4v299 -n openshift-machine-api
Name:                 machine-api-operator-664cfb7d45-4v299
Namespace:            openshift-machine-api
Priority:             2000001000
Priority Class Name:  system-node-critical
Node:                 master-2/192.168.26.126
Start Time:           Sun, 14 Mar 2021 09:05:12 -0400
Labels:               k8s-app=machine-api-operator
                      pod-template-hash=664cfb7d45
Annotations:          k8s.v1.cni.cncf.io/network-status:
                        [{
                            "name": "",
                            "interface": "eth0",
                            "ips": [
                                "10.129.0.68"
                            ],
                            "default": true,
                            "dns": {}
                        }]
                      k8s.v1.cni.cncf.io/networks-status:
                        [{
                            "name": "",
                            "interface": "eth0",
                            "ips": [
                                "10.129.0.68"
                            ],
                            "default": true,
                            "dns": {}
                        }]
                      openshift.io/scc: restricted
Status:               Running
IP:                   10.129.0.68
IPs:
  IP:           10.129.0.68
Controlled By:  ReplicaSet/machine-api-operator-664cfb7d45
Containers:
  kube-rbac-proxy:
    Container ID:  cri-o://600cd3f48e6378c622f7e0b5aba926b866754b8e4967369468e51bc2fba2f4ad
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e13f54418ac0779b58b73b3dc392609ac7731d47a1ca7cf493446eaef10024ed
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e13f54418ac0779b58b73b3dc392609ac7731d47a1ca7cf493446eaef10024ed
    Port:          8443/TCP
    Host Port:     0/TCP
    Args:
      --secure-listen-address=0.0.0.0:8443
      --upstream=http://localhost:8080/
      --tls-cert-file=/etc/tls/private/tls.crt
      --tls-private-key-file=/etc/tls/private/tls.key
      --config-file=/etc/kube-rbac-proxy/config-file.yaml
      --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305
      --logtostderr=true
      --v=3
    State:          Running
      Started:      Sun, 14 Mar 2021 10:10:16 -0400
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Sun, 14 Mar 2021 10:06:21 -0400
      Finished:     Sun, 14 Mar 2021 10:07:28 -0400
    Ready:          True
    Restart Count:  21
    Limits:
      cpu:     100m
      memory:  50Mi
    Requests:
      cpu:        10m
      memory:     20Mi
    Environment:  <none>
    Mounts:
      /etc/kube-rbac-proxy from config (rw)
      /etc/tls/private from machine-api-operator-tls (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from machine-api-operator-token-qgrz6 (ro)
  machine-api-operator:
    Container ID:  cri-o://fa3ab32a2dd8f40f4c54575b985a68730daa469fa8697f113b1adb255df95cb2
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0ed79a307757581cbaab976b95dde902c4724a7eb4ef7fee7991cf1b63205fe0
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0ed79a307757581cbaab976b95dde902c4724a7eb4ef7fee7991cf1b63205fe0
    Port:          <none>
    Host Port:     <none>
    Command:
      /machine-api-operator
    Args:
      start
      --images-json=/etc/machine-api-operator-config/images/images.json
      --alsologtostderr
      --v=3
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Sun, 14 Mar 2021 10:07:29 -0400
      Finished:     Sun, 14 Mar 2021 10:07:30 -0400
    Ready:          False
    Restart Count:  20
    Limits:
      cpu:     100m
      memory:  50Mi
    Requests:
      cpu:     10m
      memory:  50Mi
    Environment:
      RELEASE_VERSION:      4.8.0-0.nightly-ppc64le-2021-03-14-051438
      COMPONENT_NAMESPACE:  openshift-machine-api (v1:metadata.namespace)
      METRICS_PORT:         8080
    Mounts:
      /etc/machine-api-operator-config/images from images (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from machine-api-operator-token-qgrz6 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      kube-rbac-proxy
    Optional:  false
  images:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      machine-api-operator-images
    Optional:  false
  machine-api-operator-tls:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  machine-api-operator-tls
    Optional:    false
  machine-api-operator-token-qgrz6:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  machine-api-operator-token-qgrz6
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  node-role.kubernetes.io/master=
Tolerations:     node-role.kubernetes.io/master:NoSchedule op=Exists
                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                 node.kubernetes.io/not-ready:NoExecute op=Exists for 120s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 120s
Events:
  Type     Reason            Age                    From               Message
  ----     ------            ----                   ----               -------
  Warning  FailedScheduling  73m                    default-scheduler  no nodes available to schedule pods
  Warning  FailedScheduling  67m                    default-scheduler  0/3 nodes are available: 3 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate.
  Warning  FailedScheduling  67m                    default-scheduler  0/3 nodes are available: 3 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate.
  Normal   Scheduled         66m                    default-scheduler  Successfully assigned openshift-machine-api/machine-api-operator-664cfb7d45-4v299 to master-2
  Warning  FailedScheduling  73m                    default-scheduler  no nodes available to schedule pods
  Warning  FailedMount       66m                    kubelet            MountVolume.SetUp failed for volume "machine-api-operator-tls" : failed to sync secret cache: timed out waiting for the condition
  Warning  FailedMount       66m (x6 over 66m)      kubelet            MountVolume.SetUp failed for volume "machine-api-operator-tls" : secret "machine-api-operator-tls" not found
  Normal   AddedInterface    65m                    multus             Add eth0 [10.129.0.10/23]
  Normal   Pulling           65m                    kubelet            Pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0ed79a307757581cbaab976b95dde902c4724a7eb4ef7fee7991cf1b63205fe0"
  Normal   Pulled            65m                    kubelet            Successfully pulled image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0ed79a307757581cbaab976b95dde902c4724a7eb4ef7fee7991cf1b63205fe0" in 13.894041473s
  Normal   AddedInterface    65m                    multus             Add eth0 [10.129.0.14/23]
  Normal   Started           65m (x2 over 65m)      kubelet            Started container kube-rbac-proxy
  Normal   Created           65m (x2 over 65m)      kubelet            Created container kube-rbac-proxy
  Normal   Pulled            65m                    kubelet            Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0ed79a307757581cbaab976b95dde902c4724a7eb4ef7fee7991cf1b63205fe0" already present on machine
  Normal   Created           64m (x2 over 65m)      kubelet            Created container machine-api-operator
  Normal   Started           64m (x2 over 65m)      kubelet            Started container machine-api-operator
  Normal   SandboxChanged    64m (x3 over 65m)      kubelet            Pod sandbox changed, it will be killed and re-created.
  Normal   Killing           64m (x2 over 65m)      kubelet            Stopping container machine-api-operator
  Normal   AddedInterface    64m                    multus             Add eth0 [10.129.0.15/23]
  Normal   AddedInterface    64m                    multus             Add eth0 [10.129.0.20/23]
  Normal   AddedInterface    64m                    multus             Add eth0 [10.129.0.21/23]
  Normal   AddedInterface    63m                    multus             Add eth0 [10.129.0.25/23]
  Normal   AddedInterface    61m                    multus             Add eth0 [10.129.0.27/23]
  Normal   AddedInterface    58m                    multus             Add eth0 [10.129.0.33/23]
  Normal   AddedInterface    54m                    multus             Add eth0 [10.129.0.40/23]
  Normal   AddedInterface    49m                    multus             Add eth0 [10.129.0.47/23]
  Normal   AddedInterface    43m                    multus             Add eth0 [10.129.0.49/23]
  Normal   AddedInterface    38m                    multus             Add eth0 [10.129.0.53/23]
  Normal   AddedInterface    38m                    multus             Add eth0 [10.129.0.54/23]
  Normal   AddedInterface    37m                    multus             Add eth0 [10.129.0.57/23]
  Normal   AddedInterface    36m                    multus             Add eth0 [10.129.0.58/23]
  Normal   AddedInterface    33m                    multus             Add eth0 [10.129.0.59/23]
  Normal   AddedInterface    30m                    multus             Add eth0 [10.129.0.60/23]
  Normal   AddedInterface    25m                    multus             Add eth0 [10.129.0.61/23]
  Normal   AddedInterface    20m                    multus             Add eth0 [10.129.0.62/23]
  Warning  BackOff           16m (x162 over 64m)    kubelet            Back-off restarting failed container
  Normal   AddedInterface    14m                    multus             Add eth0 [10.129.0.63/23]
  Normal   AddedInterface    9m44s                  multus             Add eth0 [10.129.0.64/23]
  Normal   AddedInterface    9m23s                  multus             Add eth0 [10.129.0.65/23]
  Normal   AddedInterface    8m37s                  multus             Add eth0 [10.129.0.66/23]
  Normal   AddedInterface    7m5s                   multus             Add eth0 [10.129.0.67/23]
  Warning  BackOff           6m37s (x207 over 64m)  kubelet            Back-off restarting failed container
  Normal   AddedInterface    4m22s                  multus             Add eth0 [10.129.0.68/23]
  Normal   Pulled            99s (x25 over 65m)     kubelet            Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e13f54418ac0779b58b73b3dc392609ac7731d47a1ca7cf493446eaef10024ed" already present on machine
  
  # oc logs  machine-api-operator-664cfb7d45-fmbjp -n openshift-machine-api machine-api-operator
I0314 17:27:51.040810       1 start.go:62] Version: 4.8.0-202103140432.p0-dirty
I0314 17:27:51.137782       1 leaderelection.go:243] attempting to acquire leader lease openshift-machine-api/machine-api-operator...


How reproducible:
Always

Steps to Reproduce:
1. Install the nightly build of 4.8 on Power

Actual results:
machine-api-operator pod is in CrashLoopBackOff state.

Additional info:
# oc version
Client Version: 4.8.0-0.nightly-ppc64le-2021-03-14-051438
Server Version: 4.8.0-0.nightly-ppc64le-2021-03-14-051438
Kubernetes Version: v1.20.0+e1bc274

Comment 1 Joel Speed 2021-03-15 09:54:32 UTC
We just added resource limits to pods on this build and have been told this is not right for OpenShift workloads.
We are reverting this out and that will resolve this issue.

*** This bug has been marked as a duplicate of bug 1938493 ***