Bug 2064837

Summary: cluster-cloud-controller-manager not able to start and into crashloop-backoff during cluster upgrade from OCP 4.8.x to OCP 4.9.21 halted due to Operator "
Product: OpenShift Container Platform Reporter: Nirupma Kashyap <nkashyap>
Component: Cloud ComputeAssignee: dmoiseev
Cloud Compute sub component: Cloud Controller Manager QA Contact: Milind Yadav <miyadav>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: medium CC: dmoiseev, wking
Version: 4.9   
Target Milestone: ---   
Target Release: 4.9.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-05-18 13:20:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2037680    
Bug Blocks:    

Description Nirupma Kashyap 2022-03-16 17:37:12 UTC
Description of problem:
Upgrade to OCP 4.9.21 halted due to Operator "cluster-cloud-controller-manager" not able to start and into crashloop-backoff


Version-Release number of selected component (if applicable):
4.9.21

How reproducible:


Steps to Reproduce:
1. Try upgrade cluster from 4.8.24 to 4.9.21
2. 
3.

Actual results:
cluster-cloud-controller-manager into crash loop-backoff and throwing below error:
~~~
2022-03-15T20:14:16.859746210Z I0315 20:14:16.858913       1 deleg.go:130] CCMOperator/controller-runtime/metrics "msg"="metrics server is starting to listen"  "addr"=":8080"
2022-03-15T20:14:16.859746210Z E0315 20:14:16.859117       1 deleg.go:144] CCMOperator/controller-runtime/metrics "msg"="metrics server failed to listen. You may want to disable the metrics server or use another port if it is due to conflicts" "error"="error listening on :8080: listen tcp :8080: bind: address already in use"  
2022-03-15T20:14:16.859746210Z E0315 20:14:16.859134       1 deleg.go:144] CCMOperator/setup "msg"="unable to start manager" "error"="error listening on :8080: listen tcp :8080: bind: address already in use"  
~~~


Expected results:
cluster-cloud-controller-manager pod should start without any error.

Additional info:
Here OCP cluster is in stuck into middle of upgrade from 4.8.24 to 4.9.21 due to  cluster-cloud-controller-manager into crashloopback-off, upon checked we noticed that the pod is scheduled on a master2 node where port 8080 is already used by kube-apiserver.

Comment 5 Nirupma Kashyap 2022-05-09 08:52:43 UTC
Hi team,

Can we have some updates as when this fix will be backported to 4.9 ?

Regards,
Nirupma

Comment 6 Joel Speed 2022-05-09 09:21:22 UTC
This is in the queue for QE to test, they should get to it soon

Comment 10 Milind Yadav 2022-05-12 07:41:09 UTC
Upgraded cluster from 4.8.39 to  4.9.0-0.nightly-2022-05-11-100812

.
..
.
.
.
05-12 12:40:35.091  clusteroperators: 
05-12 12:40:35.091   NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
05-12 12:40:35.091  authentication                             4.9.0-0.nightly-2022-05-11-100812   True        False         False      39m     
05-12 12:40:35.091  baremetal                                  4.9.0-0.nightly-2022-05-11-100812   True        False         False      126m    
05-12 12:40:35.091  cloud-controller-manager                   4.9.0-0.nightly-2022-05-11-100812   True        False         False      60m     
05-12 12:40:35.091  cloud-credential                           4.9.0-0.nightly-2022-05-11-100812   True        False         False      135m    
05-12 12:40:35.091  cluster-autoscaler                         4.9.0-0.nightly-2022-05-11-100812   True        False         False      126m    
05-12 12:40:35.091  config-operator                            4.9.0-0.nightly-2022-05-11-100812   True        False         False      127m    
05-12 12:40:35.091  console                                    4.9.0-0.nightly-2022-05-11-100812   True        False         False      38m     
05-12 12:40:35.091  csi-snapshot-controller                    4.9.0-0.nightly-2022-05-11-100812   True        False         False      126m    
05-12 12:40:35.091  dns                                        4.9.0-0.nightly-2022-05-11-100812   True        False         False      126m    
05-12 12:40:35.091  etcd                                       4.9.0-0.nightly-2022-05-11-100812   True        False         False      126m    
05-12 12:40:35.091  image-registry                             4.9.0-0.nightly-2022-05-11-100812   True        False         False      120m    
05-12 12:40:35.091  ingress                                    4.9.0-0.nightly-2022-05-11-100812   True        False         False      119m    
05-12 12:40:35.091  insights                                   4.9.0-0.nightly-2022-05-11-100812   True        False         False      120m    
05-12 12:40:35.091  kube-apiserver                             4.9.0-0.nightly-2022-05-11-100812   True        False         False      124m    
05-12 12:40:35.091  kube-controller-manager                    4.9.0-0.nightly-2022-05-11-100812   True        False         False      124m    
.
.
.

No backoff error :

oc get pod/cluster-cloud-controller-manager-operator-65b77dc777-nkqcs -n openshift-cloud-controller-manager-operator -o yaml
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: "2022-05-12T07:27:26Z"
  generateName: cluster-cloud-controller-manager-operator-65b77dc777-
  labels:
    k8s-app: cloud-manager-operator
    pod-template-hash: 65b77dc777
  name: cluster-cloud-controller-manager-operator-65b77dc777-nkqcs
  namespace: openshift-cloud-controller-manager-operator
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: cluster-cloud-controller-manager-operator-65b77dc777
    uid: 0ff1a37f-f2b1-49c1-9632-2fbb5f88f23b
  resourceVersion: "105115"
  uid: 624ce15c-2d1a-4d4b-b7b6-915f73c7dd89
spec:
  containers:
  - command:
    - /bin/bash
    - -c
    - |
      #!/bin/bash
      set -o allexport
      if [[ -f /etc/kubernetes/apiserver-url.env ]]; then
        source /etc/kubernetes/apiserver-url.env
      else
        URL_ONLY_KUBECONFIG=/etc/kubernetes/kubeconfig
      fi
      exec /cluster-controller-manager-operator \
      --leader-elect=true \
      --leader-elect-lease-duration=137s \
      --leader-elect-renew-deadline=107s \
      --leader-elect-retry-period=26s \
      --leader-elect-resource-namespace=openshift-cloud-controller-manager-operator \
      "--images-json=/etc/cloud-controller-manager-config/images.json" \
      --metrics-bind-address=:9258 \
      --health-addr=127.0.0.1:9259
    env:
    - name: RELEASE_VERSION
      value: 4.9.0-0.nightly-2022-05-11-100812
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:375606eb429ffe7ef890295bf55c5122c300ad3879577629827dd9ddbdc191a9
    imagePullPolicy: IfNotPresent
    name: cluster-cloud-controller-manager
    ports:
    - containerPort: 9258
      hostPort: 9258
      name: metrics
      protocol: TCP
    - containerPort: 9259
      hostPort: 9259
      name: healthz
      protocol: TCP
    resources:
      requests:
        cpu: 10m
        memory: 50Mi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /etc/cloud-controller-manager-config/
      name: images
    - mountPath: /etc/kubernetes
      name: host-etc-kube
      readOnly: true
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-9j79z
      readOnly: true
  - command:
    - /bin/bash
    - -c
    - |
      #!/bin/bash
      set -o allexport
      if [[ -f /etc/kubernetes/apiserver-url.env ]]; then
        source /etc/kubernetes/apiserver-url.env
      else
        URL_ONLY_KUBECONFIG=/etc/kubernetes/kubeconfig
      fi
      exec /config-sync-controllers \
      --leader-elect=true \
      --leader-elect-lease-duration=137s \
      --leader-elect-renew-deadline=107s \
      --leader-elect-retry-period=26s \
      --leader-elect-resource-namespace=openshift-cloud-controller-manager-operator \
      --health-addr=127.0.0.1:9260
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:375606eb429ffe7ef890295bf55c5122c300ad3879577629827dd9ddbdc191a9
    imagePullPolicy: IfNotPresent
    name: config-sync-controllers
    ports:
    - containerPort: 9260
      hostPort: 9260
      name: healthz
      protocol: TCP
    resources:
      requests:
        cpu: 10m
        memory: 25Mi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /etc/kubernetes
      name: host-etc-kube
      readOnly: true
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-9j79z
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  hostNetwork: true
  imagePullSecrets:
  - name: cluster-cloud-controller-manager-dockercfg-ms79t
  nodeName: ip-10-0-52-239.us-east-2.compute.internal
  nodeSelector:
    node-role.kubernetes.io/master: ""
  preemptionPolicy: PreemptLowerPriority
  priority: 2000001000
  priorityClassName: system-node-critical
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: cluster-cloud-controller-manager
  serviceAccountName: cluster-cloud-controller-manager
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 120
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 120
  - effect: NoSchedule
    key: node.cloudprovider.kubernetes.io/uninitialized
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/not-ready
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/memory-pressure
    operator: Exists
  volumes:
  - configMap:
      defaultMode: 420
      name: cloud-controller-manager-images
    name: images
  - hostPath:
      path: /etc/kubernetes
      type: Directory
    name: host-etc-kube
  - name: kube-api-access-9j79z
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
      - configMap:
          items:
          - key: service-ca.crt
            path: service-ca.crt
          name: openshift-service-ca.crt
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2022-05-12T07:27:26Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2022-05-12T07:27:28Z"
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2022-05-12T07:27:28Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2022-05-12T07:27:26Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: cri-o://8934c13b3c99a255ef7bef9bd3a1f91b1efc07bbdefcf030c899994d6575d307
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:375606eb429ffe7ef890295bf55c5122c300ad3879577629827dd9ddbdc191a9
    imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:375606eb429ffe7ef890295bf55c5122c300ad3879577629827dd9ddbdc191a9
    lastState: {}
    name: cluster-cloud-controller-manager
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2022-05-12T07:27:27Z"
  - containerID: cri-o://9dfc15a88d27619ef47aba70ea06f15e4d16f288da09e873b1f36ce5eae8f845
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:375606eb429ffe7ef890295bf55c5122c300ad3879577629827dd9ddbdc191a9
    imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:375606eb429ffe7ef890295bf55c5122c300ad3879577629827dd9ddbdc191a9
    lastState: {}
    name: config-sync-controllers
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2022-05-12T07:27:27Z"
  hostIP: 10.0.52.239
  phase: Running
  podIP: 10.0.52.239
  podIPs:
  - ip: 10.0.52.239
  qosClass: Burstable
  startTime: "2022-05-12T07:27:26Z"

Moving to verified based on these results .

Comment 12 errata-xmlrpc 2022-05-18 13:20:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.9.33 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:2206