2028054 – Cloud controller manager operator can't get leader lease when upgrading from 4.8 up to 4.9

Bug 2028054 - Cloud controller manager operator can't get leader lease when upgrading from 4.8 up to 4.9

Summary: Cloud controller manager operator can't get leader lease when upgrading from ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.10.0
Assignee:	Joel Speed
QA Contact:	sunzhaohua
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2034504
TreeView+	depends on / blocked

Reported:	2021-12-01 12:01 UTC by Oleksandr.Khomenko
Modified:	2022-06-16 15:46 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-03-10 16:31:07 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-cloud-controller-manager-operator pull 157	0	None	open	Bug 2028054: Ensure old deployment is deleted before new deployment creation	2021-12-17 10:05:39 UTC
Red Hat Product Errata	RHSA-2022:0056	0	None	None	None	2022-03-10 16:31:20 UTC

Description Oleksandr.Khomenko 2021-12-01 12:01:43 UTC

Description of problem:
Cloud controller manager operator can't get leader lease when upgrading from 4.8 up to 4.9

oc adm upgrade:
An upgrade is in progress. Working towards 4.9.0-0.okd-2021-11-28-035710: 142 of 744 done (19% complete), waiting on cloud-controller-manager

Version-Release number of selected component (if applicable):


How reproducible:

Start upgrading from version 4.8 up to 4.9

Link to must-gather archive:
https://drive.google.com/file/d/1JIqzEIKhDkY97eHzolzfez27ylyWnATP/view?usp=sharing

Comment 1 Damiano Donati 2021-12-10 13:04:33 UTC

Joel Speed and I had a debugging session together on this issue.

This is what we found out by looking into the must-gather:
At first we looked at the logs of the cloud-controller-manager and we observed it was failing to acquire the lease.
This lead us to look at its Deployment to verify the lease options for it. Here we noticed there wasn't any option set for leader election (lease locking). We also found out that the version on the Deployment was still on 4.8.0-0.okd-2021-11-14-052418 rather than being on the newer 4.9.0-0.okd-2021-11-28-035710.
We then checked the CVO logs to understand better what was happening to the Deployment during the upgrade, we noticed that the version being applied for cloud-controller-manager was still 4.8.0-0.okd-2021-11-14-052418 even after CVO reported that it had applied the correct upgraded Deployment version.
We checked the payload for 4.9.0-0.okd-2021-11-28-035710 (registry.ci.openshift.org/origin/release@sha256:7d8356245fc3a75fe11d1832ce9fef17f3dd0f2ea6f38271319c95918416b9d9) and it's carrying the right cloud-controller-manager version (4.9.0) as expected.
We are not sure what's happening with the CVO and thus we would like to hear the opinion of the CVO team.
Thanks.

Comment 2 Joel Speed 2021-12-10 13:09:18 UTC

Based on Damiano's comment above, as far as we can tell the CVO has not applied the deployment correctly as we can see the older deployment present. Could the CVO team please take a look and see if there is anything suspicious that might hint at why the deployment was not applied correctly.

Comment 3 Jack Ottofaro 2021-12-10 15:03:52 UTC

I think this is the issue but not sure of the cause:

2021-11-29T19:00:55.483985662Z E1129 19:00:55.483928       1 task.go:112] error running apply for clusteroperator "cloud-controller-manager" (143 of 744): Cluster operator cloud-controller-manager is updating versions

Here's a suspicious log that came before:

2021-11-29T19:00:32.071114112Z W1129 19:00:32.071049       1 apps.go:136] deployment openshift-cloud-controller-manager-operator/cluster-cloud-controller-manager-operator is not setting any expected conditions, and is therefore in an unknown state

And I wonder about these. Did the prior version not have these resources or was something amiss on the cluster before the upgrade?

2021-11-29T19:00:30.003051400Z I1129 19:00:30.003018       1 sync_worker.go:753] Running sync for role "openshift-config/cluster-cloud-controller-manager" (125 of 744)
2021-11-29T19:00:30.042210678Z I1129 19:00:30.042030       1 rbac.go:107] Role openshift-config/cluster-cloud-controller-manager not found, creating
2021-11-29T19:00:30.124886381Z I1129 19:00:30.124790       1 sync_worker.go:765] Done syncing for role "openshift-config/cluster-cloud-controller-manager" (125 of 744)
2021-11-29T19:00:30.124886381Z I1129 19:00:30.124860       1 sync_worker.go:753] Running sync for rolebinding "openshift-config/cluster-cloud-controller-manager" (126 of 744)
2021-11-29T19:00:30.140980767Z I1129 19:00:30.140905       1 rbac.go:78] RoleBinding openshift-config/cluster-cloud-controller-manager not found, creating
2021-11-29T19:00:30.226483944Z I1129 19:00:30.226388       1 sync_worker.go:765] Done syncing for rolebinding "openshift-config/cluster-cloud-controller-manager" (126 of 744)
2021-11-29T19:00:30.226483944Z I1129 19:00:30.226441       1 sync_worker.go:753] Running sync for role "openshift-config-managed/cluster-cloud-controller-manager" (127 of 744)
2021-11-29T19:00:30.243570660Z I1129 19:00:30.243130       1 rbac.go:107] Role openshift-config-managed/cluster-cloud-controller-manager not found, creating
2021-11-29T19:00:30.341797658Z I1129 19:00:30.341731       1 sync_worker.go:765] Done syncing for role "openshift-config-managed/cluster-cloud-controller-manager" (127 of 744)
2021-11-29T19:00:30.341797658Z I1129 19:00:30.341784       1 sync_worker.go:753] Running sync for rolebinding "openshift-config-managed/cluster-cloud-controller-manager" (128 of 744)
2021-11-29T19:00:30.346762331Z I1129 19:00:30.346720       1 rbac.go:78] RoleBinding openshift-config-managed/cluster-cloud-controller-manager not found, creating
2021-11-29T19:00:30.424789309Z I1129 19:00:30.424721       1 sync_worker.go:765] Done syncing for rolebinding "openshift-config-managed/cluster-cloud-controller-manager" (128 of 744)
...
2021-11-29T19:00:31.309464223Z I1129 19:00:31.309453       1 sync_worker.go:753] Running sync for serviceaccount "openshift-cloud-controller-manager/cloud-node-manager" (138 of 744)
2021-11-29T19:00:31.451222341Z I1129 19:00:31.451157       1 core.go:89] ServiceAccount openshift-cloud-controller-manager/cloud-node-manager not found, creating
2021-11-29T19:00:31.490959962Z I1129 19:00:31.490891       1 sync_worker.go:765] Done syncing for serviceaccount "openshift-cloud-controller-manager/cloud-node-manager" (138 of 744)
20

Comment 4 W. Trevor King 2021-12-10 22:52:31 UTC

Poking at the comment 0 gather:

  $ tar xz --strip-components 2 <gather.tar.gz

By gather-time:

  $ yaml2json <namespaces/openshift-cloud-controller-manager-operator/apps/deployments.yaml | jq '.items[] | {generation: .metadata.generation, env: [.spec.template.spec.containers[] | (.env // [])[]], status}'
  {
    "generation": 1,
    "env": [
      {
        "name": "RELEASE_VERSION",
        "value": "4.9.0-0.okd-2021-11-28-035710"
      }
    ],
    "status": {
      "availableReplicas": 1,
      "conditions": [
        {
          "lastTransitionTime": "2021-11-29T19:00:44Z",
          "lastUpdateTime": "2021-11-29T19:00:44Z",
          "message": "Deployment has minimum availability.",
          "reason": "MinimumReplicasAvailable",
          "status": "True",
          "type": "Available"
        },
        {
          "lastTransitionTime": "2021-11-29T19:00:32Z",
          "lastUpdateTime": "2021-11-29T19:00:44Z",
          "message": "ReplicaSet \"cluster-cloud-controller-manager-operator-65fc66598\" has successfully progressed.",
          "reason": "NewReplicaSetAvailable",
          "status": "True",
          "type": "Progressing"
        }
      ],
      "observedGeneration": 1,
      "readyReplicas": 1,
      "replicas": 1,
      "updatedReplicas": 1
    }
  }

So the deployment controller is caught up (observedGeneration == generation), the deployment is happy (.status.conditions), and the release version is 4.9.  As you point out, it's failing to get the leader lease:

  $ tail -n1 namespaces/openshift-cloud-controller-manager-operator/pods/cluster-cloud-controller-manager-operator-65fc66598-dlttr/cluster-cloud-controller-manager/cluster-cloud-controller-manager/logs/current.log 
  2021-11-29T19:00:45.465242225Z I1129 19:00:45.465100       1 leaderelection.go:248] attempting to acquire leader lease openshift-cloud-controller-manager-operator/cluster-cloud-controller-manager-leader...

Checking the lease:

  $ yaml2json <namespaces/openshift-cloud-controller-manager-operator/core/configmaps.yaml | jq '.items[] | select(.metadata.name == "cluster-cloud-controller-manager-leader").metadata.annotations["control-plane.alpha.kubernetes.io/leader"] | fromjson'
  {
    "holderIdentity": "cluster-cloud-controller-manager-fd8df4bbb-2c8qh_0932a044-1def-404e-9db0-b7b3d22be98d",
    "leaseDurationSeconds": 120,
    "acquireTime": "2021-11-29T16:20:09Z",
    "renewTime": "2021-11-30T16:27:03Z",
    "leaderTransitions": 14
  }

Ah.

  $ yaml2json <namespaces/openshift-cloud-controller-manager-operator/core/pods.yaml | jq -r '.items[].metadata.name'
  cluster-cloud-controller-manager-fd8df4bbb-2c8qh
  cluster-cloud-controller-manager-operator-65fc66598-dlttr

I'm not clear on the expected semantics for those, but you seem to have both going in parallel and competing for the same lease.

Comment 5 Damiano Donati 2021-12-15 09:50:58 UTC

Thanks both for your input on this.

As Trevor pointed out there are two components going in parallel and competing for the same lease:
cluster-cloud-controller-manager-fd8df4bbb-2c8qh
cluster-cloud-controller-manager-operator-65fc66598-dlttr

An important thing to note here, which can give us some context is that in OCP 4.9 the `cluster-cloud-controller-manager` Deployment has been renamed to `cluster-cloud-controller-manager-operator` and the former has been tombstoned in favour of the latter here: https://github.com/openshift/cluster-cloud-controller-manager-operator/pull/87

This should mean that the `cluster-cloud-controller-manager` Deployment should be deleted by the CVO and ``cluster-cloud-controller-manager-operator` with its ReplicaSet and Pod should take its place.

Although at gather time we can see that there are still an old and a new Pod Ready belonging to the tombstoned and to the new Deployment accordingly.
This can be seen by looking at the pods running in the `openshift-cloud-controller-manager-operator`:

[namespaces/openshift-cloud-controller-manager-operator] $ tree
.
├── pods
│   ├── cluster-cloud-controller-manager-fd8df4bbb-2c8qh
│   │   ├── cluster-cloud-controller-manager
│   │   │   └── cluster-cloud-controller-manager
│   │   │       └── logs
│   │   │           ├── current.log <---------------------------- see #### 1
│   │   │           ├── previous.insecure.log
│   │   │           └── previous.log
│   │   └── cluster-cloud-controller-manager-fd8df4bbb-2c8qh.yaml
│   │
│   │
│   └── cluster-cloud-controller-manager-operator-65fc66598-dlttr
│       ├── cloud-config-sync-controller
│       │   └── cloud-config-sync-controller
│       │       └── logs
│       │           ├── current.log
│       │           ├── previous.insecure.log
│       │           └── previous.log
│       ├── cluster-cloud-controller-manager
│       │   └── cluster-cloud-controller-manager
│       │       └── logs
│       │           ├── current.log <---------------------------- see #### 2
│       │           ├── previous.insecure.log
│       │           └── previous.log
│       └── cluster-cloud-controller-manager-operator-65fc66598-dlttr.yaml

#### 1
[namespaces/openshift-cloud-controller-manager-operator] $ tail -f pods/cluster-cloud-controller-manager-fd8df4bbb-2c8qh/cluster-cloud-controller-manager/cluster-cloud-controller-manager/logs/current.log
2021-11-30T15:56:37.264242172Z I1130 15:56:37.264157       1 clusteroperator_controller.go:127] FeatureGate cluster is not specifying external cloud provider requirement. Skipping...
2021-11-30T16:03:01.768943173Z I1130 16:03:01.768545       1 clusteroperator_controller.go:127] FeatureGate cluster is not specifying external cloud provider requirement. Skipping...
2021-11-30T16:05:34.901658128Z I1130 16:05:34.901523       1 clusteroperator_controller.go:127] FeatureGate cluster is not specifying external cloud provider requirement. Skipping...
2021-11-30T16:06:29.790813584Z I1130 16:06:29.790724       1 clusteroperator_controller.go:127] FeatureGate cluster is not specifying external cloud provider requirement. Skipping...
2021-11-30T16:12:52.725530168Z I1130 16:12:52.725350       1 clusteroperator_controller.go:127] FeatureGate cluster is not specifying external cloud provider requirement. Skipping...
2021-11-30T16:15:54.649188999Z I1130 16:15:54.649116       1 clusteroperator_controller.go:127] FeatureGate cluster is not specifying external cloud provider requirement. Skipping...
2021-11-30T16:16:22.317529401Z I1130 16:16:22.317412       1 clusteroperator_controller.go:127] FeatureGate cluster is not specifying external cloud provider requirement. Skipping...
2021-11-30T16:22:43.682771538Z I1130 16:22:43.682676       1 clusteroperator_controller.go:127] FeatureGate cluster is not specifying external cloud provider requirement. Skipping...
2021-11-30T16:26:14.397560066Z I1130 16:26:14.397475       1 clusteroperator_controller.go:127] FeatureGate cluster is not specifying external cloud provider requirement. Skipping...
2021-11-30T16:26:14.843193443Z I1130 16:26:14.843129       1 clusteroperator_controller.go:127] FeatureGate cluster is not specifying external cloud provider requirement. Skipping...
^C

#### 2
[namespaces/openshift-cloud-controller-manager-operator] $ tail -f pods/cluster-cloud-controller-manager-operator-65fc66598-dlttr/cluster-cloud-controller-manager/cluster-cloud-controller-manager/logs/current.log
2021-11-29T19:00:43.900827246Z I1129 19:00:43.900582       1 request.go:665] Waited for 1.032320368s due to client-side throttling, not priority and fairness, request: GET:https://api-int.dev.god.de:6443/apis/autoscaling.openshift.io/v1?timeout=32s
2021-11-29T19:00:45.464022114Z I1129 19:00:45.463959       1 deleg.go:130] CCMOperator/controller-runtime/metrics "msg"="metrics server is starting to listen"  "addr"=":8080"
2021-11-29T19:00:45.464725363Z I1129 19:00:45.464690       1 deleg.go:130] CCMOperator/setup "msg"="starting manager"
2021-11-29T19:00:45.465144530Z I1129 19:00:45.465081       1 internal.go:383] CCMOperator/controller-runtime/manager "msg"="starting metrics server"  "path"="/metrics"
2021-11-29T19:00:45.465242225Z I1129 19:00:45.465100       1 leaderelection.go:248] attempting to acquire leader lease openshift-cloud-controller-manager-operator/cluster-cloud-controller-manager-leader...
^C

So this would explain why there are two processes from two completely different Deployments competing for the same lease.

But why is there still a Pod belonging to the tombstoned (`cluster-cloud-controller-manager`) Deployment?

Has the Deployment been deleted? Yes, there is only the new one!

[namespaces/openshift-cloud-controller-manager-operator/apps] $ yaml2json deployments.yaml | jq '.items[] | .metadata.name,.status'
"cluster-cloud-controller-manager-operator"
{
  "availableReplicas": 1,
  "conditions": [
    {
      "lastTransitionTime": "2021-11-29T19:00:44Z",
      "lastUpdateTime": "2021-11-29T19:00:44Z",
      "message": "Deployment has minimum availability.",
      "reason": "MinimumReplicasAvailable",
      "status": "True",
      "type": "Available"
    },
    {
      "lastTransitionTime": "2021-11-29T19:00:32Z",
      "lastUpdateTime": "2021-11-29T19:00:44Z",
      "message": "ReplicaSet \"cluster-cloud-controller-manager-operator-65fc66598\" has successfully progressed.",
      "reason": "NewReplicaSetAvailable",
      "status": "True",
      "type": "Progressing"
    }
  ],
  "observedGeneration": 1,
  "readyReplicas": 1,
  "replicas": 1,
  "updatedReplicas": 1
}

What about their ReplicaSets? No, they are both still there.

[namespaces/openshift-cloud-controller-manager-operator/apps] $ yaml2json replicasets.yaml | jq '.items[] | .metadata.name,.status,.metadata.ownerReferences'
"cluster-cloud-controller-manager-66bbd89dcb"
{
  "observedGeneration": 2,
  "replicas": 0
}
[
  {
    "apiVersion": "apps/v1",
    "blockOwnerDeletion": true,
    "controller": true,
    "kind": "Deployment",
    "name": "cluster-cloud-controller-manager",
    "uid": "49fa8259-f438-4148-b2af-9a7c56a5a033"
  }
]
"cluster-cloud-controller-manager-74cd748f5"
{
  "observedGeneration": 2,
  "replicas": 0
}
[
  {
    "apiVersion": "apps/v1",
    "blockOwnerDeletion": true,
    "controller": true,
    "kind": "Deployment",
    "name": "cluster-cloud-controller-manager",
    "uid": "49fa8259-f438-4148-b2af-9a7c56a5a033"
  }
]
"cluster-cloud-controller-manager-74d459f9f6"
{
  "observedGeneration": 2,
  "replicas": 0
}
[
  {
    "apiVersion": "apps/v1",
    "blockOwnerDeletion": true,
    "controller": true,
    "kind": "Deployment",
    "name": "cluster-cloud-controller-manager",
    "uid": "49fa8259-f438-4148-b2af-9a7c56a5a033"
  }
]
"cluster-cloud-controller-manager-fd8df4bbb"
{
  "availableReplicas": 1,
  "fullyLabeledReplicas": 1,
  "observedGeneration": 1,
  "readyReplicas": 1,
  "replicas": 1
}
[
  {
    "apiVersion": "apps/v1",
    "blockOwnerDeletion": true,
    "controller": true,
    "kind": "Deployment",
    "name": "cluster-cloud-controller-manager",
    "uid": "49fa8259-f438-4148-b2af-9a7c56a5a033"
  }
]
"cluster-cloud-controller-manager-operator-65fc66598"
{
  "availableReplicas": 1,
  "fullyLabeledReplicas": 1,
  "observedGeneration": 1,
  "readyReplicas": 1,
  "replicas": 1
}
[
  {
    "apiVersion": "apps/v1",
    "blockOwnerDeletion": true,
    "controller": true,
    "kind": "Deployment",
    "name": "cluster-cloud-controller-manager-operator",
    "uid": "5ee4862f-84f5-4842-9964-fcf1b34e8847"
  }
]

Why isn't the ReplicaSet being Garbage Collected?

We can find a clue in the `kube-controller-manager`:
[namespaces/openshift-kube-controller-manager/pods] $ cat kube-controller-manager-okd4-master-01/kube-controller-manager/kube-controller-manager/logs/current.log
...
2021-11-29T19:00:32.122399034Z I1129 19:00:32.122296       1 deployment_controller.go:583] "Deployment has been deleted" deployment="openshift-cloud-controller-manager-operator/cluster-cloud-controller-manager"
2021-11-29T19:00:32.150231936Z I1129 19:00:32.149963       1 replica_set.go:563] "Too few replicas" replicaSet="openshift-cloud-controller-manager-operator/cluster-cloud-controller-manager-operator-65fc66598" need=1 creating=1
2021-11-29T19:00:32.150329866Z I1129 19:00:32.150276       1 event.go:291] "Event occurred" object="openshift-cloud-controller-manager-operator/cluster-cloud-controller-manager-operator" kind="Deployment" apiVersion="apps/v1" type="Normal" reason="ScalingReplicaSet" message="Scaled up replica set cluster-cloud-controller-manager-operator-65fc66598 to 1"
2021-11-29T19:00:32.218331594Z I1129 19:00:32.218241       1 deployment_controller.go:490] "Error syncing deployment" deployment="openshift-cloud-controller-manager-operator/cluster-cloud-controller-manager-operator" err="Operation cannot be fulfilled on deployments.apps \"cluster-cloud-controller-manager-operator\": the object has been modified; please apply your changes to the latest version and try again"
2021-11-29T19:00:32.355719392Z I1129 19:00:32.355613       1 event.go:291] "Event occurred" object="openshift-cloud-controller-manager-operator/cluster-cloud-controller-manager-operator-65fc66598" kind="ReplicaSet" apiVersion="apps/v1" type="Normal" reason="SuccessfulCreate" message="Created pod: cluster-cloud-controller-manager-operator-65fc66598-dlttr"
2021-11-29T19:00:36.028156581Z I1129 19:00:36.028046       1 shared_informer.go:266] stop requested
2021-11-29T19:00:36.028156581Z E1129 19:00:36.028124       1 shared_informer.go:243] unable to sync caches for resource quota
2021-11-29T19:00:36.028241842Z E1129 19:00:36.028149       1 resource_quota_controller.go:452] timed out waiting for quota monitor sync
2021-11-29T19:00:36.952001965Z I1129 19:00:36.951869       1 shared_informer.go:266] stop requested
2021-11-29T19:00:36.952001965Z E1129 19:00:36.951925       1 shared_informer.go:243] unable to sync caches for garbage collector
2021-11-29T19:00:36.952001965Z E1129 19:00:36.951947       1 garbagecollector.go:242] timed out waiting for dependency graph builder sync during GC sync (attempt 9)
...

Comment 6 Jack Ottofaro 2021-12-15 14:30:16 UTC

I think the GC would work in the other direction - if the ReplicaSet is deleted the pods it controls may be deleted. However it would probably be best to explicitly delete all using the delete manifest in explicit order you control and can test out. See the enhancement [1] for more details.

[1] https://github.com/openshift/enhancements/blob/master/enhancements/update/object-removal-manifest-annotation.md

Comment 7 Joel Speed 2021-12-15 14:38:30 UTC

I don't think we can specify a manifest to delete a replicaset. The Deployment we create has the annotation and was deleted. In turn, the K8s GC should remove any child objects of this Deployment (owner reference points to a UUID of an object, if that object no longer exists, GC should remove the OR, if last OR, object is deleted IIRC).

We need to work out what prevented the GC from removing the replicaset and in turn the pods owned by the replicaset

Comment 8 Jack Ottofaro 2021-12-15 15:08:28 UTC

(In reply to Joel Speed from comment #7)
> I don't think we can specify a manifest to delete a replicaset. The
> Deployment we create has the annotation and was deleted. In turn, the K8s GC
> should remove any child objects of this Deployment (owner reference points
> to a UUID of an object, if that object no longer exists, GC should remove
> the OR, if last OR, object is deleted IIRC).
> 
> We need to work out what prevented the GC from removing the replicaset and
> in turn the pods owned by the replicaset

That's true, you're deleting the Deployment. I'm confusing with piece-parts that don't exist as manifests.

Comment 9 W. Trevor King 2021-12-17 09:45:50 UTC

[1] seems a bit racy, with the deletion manifest landing after the new Deployment's entry.  If the new deployment blocks on the lease, and the CVO blocks on the new deployment, it may never get around to requesting the old deployment's deletion.  Back to comment 0's must-gather:

$ grep 'deployment.*cluster-cloud-controller-manager' namespaces/openshift-cluster-version/pods/cluster-version-operator-576d994c74-7sjqz/cluster-version-operator/cluster-version-operator/logs/current.log | head -n5
2021-11-29T19:00:31.616313241Z I1129 19:00:31.616295       1 sync_worker.go:753] Running sync for deployment "openshift-cloud-controller-manager-operator/cluster-cloud-controller-manager-operator" (141 of 744)
2021-11-29T19:00:32.071114112Z W1129 19:00:32.071049       1 apps.go:136] deployment openshift-cloud-controller-manager-operator/cluster-cloud-controller-manager-operator is not setting any expected conditions, and is therefore in an unknown state
2021-11-29T19:00:32.071114112Z I1129 19:00:32.071079       1 sync_worker.go:765] Done syncing for deployment "openshift-cloud-controller-manager-operator/cluster-cloud-controller-manager-operator" (141 of 744)
2021-11-29T19:00:32.071166297Z I1129 19:00:32.071109       1 sync_worker.go:753] Running sync for deployment "openshift-cloud-controller-manager-operator/cluster-cloud-controller-manager" (142 of 744)
2021-11-29T19:00:32.119322801Z I1129 19:00:32.119175       1 helper.go:65] Delete requested for deployment "openshift-cloud-controller-manager-operator/cluster-cloud-controller-manager".

That's the "unknown state" loophole Jack pointed out in comment 3, where the CVO says "I dunno, Deployment controller doesn't have an option, I guess that's ok :shrug:?" and moves on to the next manifest.  Although in this case your new deployment seems to be Available=True with Progressing=True and NewReplicaSetAvailable, so I guess you aren't requiring a successful leader acquisition before you declare yourself ready.  As long as the CVO retains the "unknown state" loophole or the operator retains the "I'm ready even if I don't have the lease" loophole, we don't have to worry about the delete manifest coming second.  But I don't see any reason to have the delete manifest second either.  The CVO won't wait for deletion to finalize at manifest-reconciliation time (although it will not declare the update as a whole complete until the deleted manifest is finalized), so you'd get basically the same behavior if you put the delete first, and then you wouldn't have to rely on assumptions about CVO and incoming-operator behavior.  And in the sad case where the deletion goes through but for some reason the new deployment doesn't come up (or does but sticks before the pod goes ready), the Deployment controller will be Available=False and/or ProgressDeadlineExceeded, and the CVO will alert on and complain about.

Back to the stuck ReplicaSet/Pod, no action on the pod around 19:00 on the 29th:

$ grep -hr cluster-cloud-controller-manager-fd8df4bbb host_service_logs/masters | sort | tail -n2
Nov 29 16:17:26.331855 okd4-master-01 hyperkube[1542]: I1129 16:17:26.330210    1542 kubelet.go:1982] "SyncLoop (PLEG): event for pod" pod="openshift-cloud-controller-manager-operator/cluster-cloud-controller-manager-fd8df4bbb-2c8qh" event=&{ID:be61bb6f-febe-45d0-a16d-82c130697a6a Type:ContainerStarted Data:acfc6928bba1f78351ed17184b97419870163819774aac1b74c2ba14b5f3d8bf}
Nov 30 16:27:43.342180 okd4-master-01 hyperkube[1542]: I1130 16:27:43.342104    1542 logs.go:319] "Finished parsing log file" path="/var/log/pods/openshift-cloud-controller-manager-operator_cluster-cloud-controller-manager-fd8df4bbb-2c8qh_be61bb6f-febe-45d0-a16d-82c130697a6a/cluster-cloud-controller-manager/0.log"

Looking for the garbage collector:

$ grep -lr garbagecollector
namespaces/openshift-kube-controller-manager/pods/kube-controller-manager-okd4-master-01/kube-controller-manager/kube-controller-manager/logs/current.log

Looks like maybe it hung trying to build its cache?

$ grep garbagecollector namespaces/openshift-kube-controller-manager/pods/kube-controller-manager-okd4-master-01/kube-controller-manager/kube-controller-manager/logs/current.log
2021-11-29T18:56:05.462870970Z I1129 18:56:05.462776       1 controllermanager.go:616] Starting "garbagecollector"
2021-11-29T18:56:05.490707132Z I1129 18:56:05.490582       1 garbagecollector.go:142] Starting garbage collector controller
...
2021-11-30T16:30:58.230770272Z E1130 16:30:58.230686       1 garbagecollector.go:242] timed out waiting for dependency graph builder sync during GC sync (attempt 2581)
2021-11-30T16:30:58.344267189Z I1130 16:30:58.344182       1 garbagecollector.go:213] syncing garbage collector with updated resources from discovery (attempt 2582): added: [/v1, Resource=configmap...

Possibly because of some issue with whatever v1.PartialObjectMetadata is:

$ grep -o 'Failed to watch.*' namespaces/openshift-kube-controller-manager/pods/kube-controller-manager-okd4-master-01/kube-controller-manager/kube-controller-manager/logs/current.log  | sort | uniq -c
   1732 Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: Internal error occurred: error resolving resource

I dunno what that's about; no other hits in the must-gather:

$ grep -ri PartialObjectMetadata | grep -v kube-controller-manager/logs/current.log
...no hits...

[1]: https://github.com/openshift/cluster-cloud-controller-manager-operator/pull/87

Comment 10 Joel Speed 2021-12-17 10:03:07 UTC

Thanks for the analysis and suggestion Trevor, I've raised a PR to switch the ordering of the manifests and would appreciate your review as a CVO expert on that PR.

Comment 11 Joel Speed 2021-12-17 10:04:46 UTC

Setting blocker- on this BZ as this bug won't affect 4.9 to 4.10 upgrades (the rename causing this issue was introduced in 4.9 so will affect only 4.8 to 4.9 upgrades)

Comment 12 W. Trevor King 2021-12-18 07:14:37 UTC

#157 looked good to me.  But with that attached here, should we create a new bug for whatever the garbage-collector issue was (possibly something to do with PartialObjectMetadata?)?  Or are we going to write that off as a strange artifact of whatever test cluster this must-gather came from?

Comment 13 Joel Speed 2021-12-20 08:59:00 UTC

> Or are we going to write that off as a strange artifact of whatever test cluster this must-gather came from?

It does seem like a random occurrence, which has only been reported once so far (and via an OKD cluster). I'm happy to clone this, but I don't know to whom I'd send the cloned bug to get the correct eyes on it, any ideas?

Comment 15 W. Trevor King 2021-12-20 20:05:52 UTC

Garbage collector is in the openshift-kube-controller-manager namespace, and PartialObjectMetadata is an upstream Kube resource type, so... hrm, no component listed in [1]?  Seems like it should be the kube-controller-manager component, regardless of the lack of a 'component' declaration in OWNERS.

[1]: https://github.com/openshift/kubernetes-kube-controller-manager/blob/309eb2cada807e5d5c8f471f652ad6ede462caff/OWNERS

Comment 17 Bruce Link 2021-12-21 03:49:55 UTC

With respect to comment 13, I have the same problem on another OKD cluster, trying to upgrade from 4.8.0-0.okd-2021-10-24-061736 to 4.9.0-0.okd-2021-12-12-025847 on vSphere UPI, upgrade hangs at:
error running apply for clusteroperator "cloud-controller-manager" (143 of 744): Cluster operator cloud-controller-manager is updating versions

Same symptoms as original poster, both cluster-cloud-controller-manager and cluster-cloud-controller-manager-operator pods are present.

Comment 18 sunzhaohua 2021-12-24 07:38:33 UTC

Move to verified as this won't affect 4.9 to 4.10 upgrades. Checked QE upgrade CI results, no upgrade hangs at cloud-controller-manager.

Comment 21 Kai-Uwe Rommel 2022-02-24 17:34:49 UTC

We just did an upgrade of an OKD cluster from 4.8-2021-10-24 to 4.9-2022-01-14 and saw the same problem.
According to https://github.com/openshift/okd/issues/996 this should have been solved in this OKD 4.9 release.
Yet the problem still occured for us.
OKD team advised to report this here.

If a must-gather from this cluster is needed: 
https://domino.ars.de/file/must-gather.local.3738823612859762202.tar.gz

Comment 22 Joel Speed 2022-02-28 13:49:54 UTC

@kai-uwe.rommel I've just looked into the must-gather that you've attached and can't see anything wrong there. The cloud controller manager operator pod is working as expected and has acquired the lease, the upgrade has also gone smoothly. Did you apply a workaround to resolve the issue before posting the must gather?

Comment 23 Kai-Uwe Rommel 2022-02-28 20:48:50 UTC

Yes, we applied the workaround mentioned in https://github.com/openshift/okd/issues/996 by bdlink on Dec 21st.
We saw other problems after the upgrade but these were unrelated.
After I reported there that we also saw this problem with the CCM, Vadim Rutkowsky responded that the OKD 4.9 version we used should already have a fix.
Since we still saw the problem despite 4.9 should have the fix he asked me to report it here as well.

Comment 25 errata-xmlrpc 2022-03-10 16:31:07 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Note You need to log in before you can comment on or make changes to this bug.