2064837 – cluster-cloud-controller-manager not able to start and into crashloop-backoff during cluster upgrade from OCP 4.8.x to OCP 4.9.21 halted due to Operator "

Bug 2064837 - cluster-cloud-controller-manager not able to start and into crashloop-backoff during cluster upgrade from OCP 4.8.x to OCP 4.9.21 halted due to Operator "

Summary: cluster-cloud-controller-manager not able to start and into crashloop-backoff...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.9.z
Assignee:	dmoiseev
QA Contact:	Milind Yadav
Docs Contact:
URL:
Whiteboard:
Depends On:	2037680
Blocks:
TreeView+	depends on / blocked

Reported:	2022-03-16 17:37 UTC by Nirupma Kashyap
Modified:	2022-10-12 09:44 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-05-18 13:20:29 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-cloud-controller-manager-operator pull 180	0	None	open	[release-4.9] Bug 2064837: Fix CCCMO metric ports configuration	2022-03-18 15:37:40 UTC
Red Hat Product Errata	RHBA-2022:2206	0	None	None	None	2022-05-18 13:20:57 UTC

Description Nirupma Kashyap 2022-03-16 17:37:12 UTC

Description of problem:
Upgrade to OCP 4.9.21 halted due to Operator "cluster-cloud-controller-manager" not able to start and into crashloop-backoff


Version-Release number of selected component (if applicable):
4.9.21

How reproducible:


Steps to Reproduce:
1. Try upgrade cluster from 4.8.24 to 4.9.21
2. 
3.

Actual results:
cluster-cloud-controller-manager into crash loop-backoff and throwing below error:
~~~
2022-03-15T20:14:16.859746210Z I0315 20:14:16.858913       1 deleg.go:130] CCMOperator/controller-runtime/metrics "msg"="metrics server is starting to listen"  "addr"=":8080"
2022-03-15T20:14:16.859746210Z E0315 20:14:16.859117       1 deleg.go:144] CCMOperator/controller-runtime/metrics "msg"="metrics server failed to listen. You may want to disable the metrics server or use another port if it is due to conflicts" "error"="error listening on :8080: listen tcp :8080: bind: address already in use"  
2022-03-15T20:14:16.859746210Z E0315 20:14:16.859134       1 deleg.go:144] CCMOperator/setup "msg"="unable to start manager" "error"="error listening on :8080: listen tcp :8080: bind: address already in use"  
~~~


Expected results:
cluster-cloud-controller-manager pod should start without any error.

Additional info:
Here OCP cluster is in stuck into middle of upgrade from 4.8.24 to 4.9.21 due to  cluster-cloud-controller-manager into crashloopback-off, upon checked we noticed that the pod is scheduled on a master2 node where port 8080 is already used by kube-apiserver.

Comment 5 Nirupma Kashyap 2022-05-09 08:52:43 UTC

Hi team,

Can we have some updates as when this fix will be backported to 4.9 ?

Regards,
Nirupma

Comment 6 Joel Speed 2022-05-09 09:21:22 UTC

This is in the queue for QE to test, they should get to it soon

Comment 10 Milind Yadav 2022-05-12 07:41:09 UTC

Upgraded cluster from 4.8.39 to  4.9.0-0.nightly-2022-05-11-100812

.
..
.
.
.
05-12 12:40:35.091  clusteroperators: 
05-12 12:40:35.091   NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
05-12 12:40:35.091  authentication                             4.9.0-0.nightly-2022-05-11-100812   True        False         False      39m     
05-12 12:40:35.091  baremetal                                  4.9.0-0.nightly-2022-05-11-100812   True        False         False      126m    
05-12 12:40:35.091  cloud-controller-manager                   4.9.0-0.nightly-2022-05-11-100812   True        False         False      60m     
05-12 12:40:35.091  cloud-credential                           4.9.0-0.nightly-2022-05-11-100812   True        False         False      135m    
05-12 12:40:35.091  cluster-autoscaler                         4.9.0-0.nightly-2022-05-11-100812   True        False         False      126m    
05-12 12:40:35.091  config-operator                            4.9.0-0.nightly-2022-05-11-100812   True        False         False      127m    
05-12 12:40:35.091  console                                    4.9.0-0.nightly-2022-05-11-100812   True        False         False      38m     
05-12 12:40:35.091  csi-snapshot-controller                    4.9.0-0.nightly-2022-05-11-100812   True        False         False      126m    
05-12 12:40:35.091  dns                                        4.9.0-0.nightly-2022-05-11-100812   True        False         False      126m    
05-12 12:40:35.091  etcd                                       4.9.0-0.nightly-2022-05-11-100812   True        False         False      126m    
05-12 12:40:35.091  image-registry                             4.9.0-0.nightly-2022-05-11-100812   True        False         False      120m    
05-12 12:40:35.091  ingress                                    4.9.0-0.nightly-2022-05-11-100812   True        False         False      119m    
05-12 12:40:35.091  insights                                   4.9.0-0.nightly-2022-05-11-100812   True        False         False      120m    
05-12 12:40:35.091  kube-apiserver                             4.9.0-0.nightly-2022-05-11-100812   True        False         False      124m    
05-12 12:40:35.091  kube-controller-manager                    4.9.0-0.nightly-2022-05-11-100812   True        False         False      124m    
.
.
.

No backoff error :

oc get pod/cluster-cloud-controller-manager-operator-65b77dc777-nkqcs -n openshift-cloud-controller-manager-operator -o yaml
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: "2022-05-12T07:27:26Z"
  generateName: cluster-cloud-controller-manager-operator-65b77dc777-
  labels:
    k8s-app: cloud-manager-operator
    pod-template-hash: 65b77dc777
  name: cluster-cloud-controller-manager-operator-65b77dc777-nkqcs
  namespace: openshift-cloud-controller-manager-operator
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: cluster-cloud-controller-manager-operator-65b77dc777
    uid: 0ff1a37f-f2b1-49c1-9632-2fbb5f88f23b
  resourceVersion: "105115"
  uid: 624ce15c-2d1a-4d4b-b7b6-915f73c7dd89
spec:
  containers:
  - command:
    - /bin/bash
    - -c
    - |
      #!/bin/bash
      set -o allexport
      if [[ -f /etc/kubernetes/apiserver-url.env ]]; then
        source /etc/kubernetes/apiserver-url.env
      else
        URL_ONLY_KUBECONFIG=/etc/kubernetes/kubeconfig
      fi
      exec /cluster-controller-manager-operator \
      --leader-elect=true \
      --leader-elect-lease-duration=137s \
      --leader-elect-renew-deadline=107s \
      --leader-elect-retry-period=26s \
      --leader-elect-resource-namespace=openshift-cloud-controller-manager-operator \
      "--images-json=/etc/cloud-controller-manager-config/images.json" \
      --metrics-bind-address=:9258 \
      --health-addr=127.0.0.1:9259
    env:
    - name: RELEASE_VERSION
      value: 4.9.0-0.nightly-2022-05-11-100812
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:375606eb429ffe7ef890295bf55c5122c300ad3879577629827dd9ddbdc191a9
    imagePullPolicy: IfNotPresent
    name: cluster-cloud-controller-manager
    ports:
    - containerPort: 9258
      hostPort: 9258
      name: metrics
      protocol: TCP
    - containerPort: 9259
      hostPort: 9259
      name: healthz
      protocol: TCP
    resources:
      requests:
        cpu: 10m
        memory: 50Mi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /etc/cloud-controller-manager-config/
      name: images
    - mountPath: /etc/kubernetes
      name: host-etc-kube
      readOnly: true
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-9j79z
      readOnly: true
  - command:
    - /bin/bash
    - -c
    - |
      #!/bin/bash
      set -o allexport
      if [[ -f /etc/kubernetes/apiserver-url.env ]]; then
        source /etc/kubernetes/apiserver-url.env
      else
        URL_ONLY_KUBECONFIG=/etc/kubernetes/kubeconfig
      fi
      exec /config-sync-controllers \
      --leader-elect=true \
      --leader-elect-lease-duration=137s \
      --leader-elect-renew-deadline=107s \
      --leader-elect-retry-period=26s \
      --leader-elect-resource-namespace=openshift-cloud-controller-manager-operator \
      --health-addr=127.0.0.1:9260
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:375606eb429ffe7ef890295bf55c5122c300ad3879577629827dd9ddbdc191a9
    imagePullPolicy: IfNotPresent
    name: config-sync-controllers
    ports:
    - containerPort: 9260
      hostPort: 9260
      name: healthz
      protocol: TCP
    resources:
      requests:
        cpu: 10m
        memory: 25Mi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /etc/kubernetes
      name: host-etc-kube
      readOnly: true
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-9j79z
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  hostNetwork: true
  imagePullSecrets:
  - name: cluster-cloud-controller-manager-dockercfg-ms79t
  nodeName: ip-10-0-52-239.us-east-2.compute.internal
  nodeSelector:
    node-role.kubernetes.io/master: ""
  preemptionPolicy: PreemptLowerPriority
  priority: 2000001000
  priorityClassName: system-node-critical
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: cluster-cloud-controller-manager
  serviceAccountName: cluster-cloud-controller-manager
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 120
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 120
  - effect: NoSchedule
    key: node.cloudprovider.kubernetes.io/uninitialized
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/not-ready
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/memory-pressure
    operator: Exists
  volumes:
  - configMap:
      defaultMode: 420
      name: cloud-controller-manager-images
    name: images
  - hostPath:
      path: /etc/kubernetes
      type: Directory
    name: host-etc-kube
  - name: kube-api-access-9j79z
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
      - configMap:
          items:
          - key: service-ca.crt
            path: service-ca.crt
          name: openshift-service-ca.crt
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2022-05-12T07:27:26Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2022-05-12T07:27:28Z"
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2022-05-12T07:27:28Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2022-05-12T07:27:26Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: cri-o://8934c13b3c99a255ef7bef9bd3a1f91b1efc07bbdefcf030c899994d6575d307
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:375606eb429ffe7ef890295bf55c5122c300ad3879577629827dd9ddbdc191a9
    imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:375606eb429ffe7ef890295bf55c5122c300ad3879577629827dd9ddbdc191a9
    lastState: {}
    name: cluster-cloud-controller-manager
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2022-05-12T07:27:27Z"
  - containerID: cri-o://9dfc15a88d27619ef47aba70ea06f15e4d16f288da09e873b1f36ce5eae8f845
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:375606eb429ffe7ef890295bf55c5122c300ad3879577629827dd9ddbdc191a9
    imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:375606eb429ffe7ef890295bf55c5122c300ad3879577629827dd9ddbdc191a9
    lastState: {}
    name: config-sync-controllers
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2022-05-12T07:27:27Z"
  hostIP: 10.0.52.239
  phase: Running
  podIP: 10.0.52.239
  podIPs:
  - ip: 10.0.52.239
  qosClass: Burstable
  startTime: "2022-05-12T07:27:26Z"

Moving to verified based on these results .

Comment 12 errata-xmlrpc 2022-05-18 13:20:29 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.9.33 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:2206

Note You need to log in before you can comment on or make changes to this bug.