Hide Forgot
Seen in the wild, a cluster admin stuck an update by inserting: $ yaml2json <namespaces/openshift-marketplace/apps/deployments.yaml | jq -r '.items[].spec.template.spec.containers[].envFrom[]' { "secretRef": { "name": "openshift-reg" } } This broke the deploy: $ yaml2json <namespaces/openshift-marketplace/pods/marketplace-operator-f7cc88d59-hhh75/marketplace-operator-f7cc88d59-hhh75.yaml | jq -r '.status.containerStatuses[].state' { "waiting": { "message": "secret \"openshift-reg\" not found", "reason": "CreateContainerConfigError" } } We should reconcile the environment settings even if the manifest has no opinions on their content. Also, the CVO should have complained about the Deployment failing to reconcile, but it does not: $ grep 'deployment.*marketplace' namespaces/openshift-cluster-version/pods/cluster-version-operator-5d8d8bf546-qgbsd/cluster-version-operator/cluster-version-operator/logs/current.log | grep -v Throttling 2021-04-19T17:23:26.881236307Z I0419 17:23:26.881220 1 sync_worker.go:769] Running sync for deployment "openshift-marketplace/marketplace-operator" (512 of 668) 2021-04-19T17:23:29.926595332Z I0419 17:23:29.926544 1 sync_worker.go:781] Done syncing for deployment "openshift-marketplace/marketplace-operator" (512 of 668) ... 2021-04-19T20:37:18.117408461Z I0419 20:37:18.117372 1 sync_worker.go:769] Running sync for deployment "openshift-marketplace/marketplace-operator" (512 of 668) 2021-04-19T20:37:21.312143536Z I0419 20:37:21.312085 1 sync_worker.go:781] Done syncing for deployment "openshift-marketplace/marketplace-operator" (512 of 668) Checking the Deployment status: $ yaml2json <namespaces/openshift-marketplace/apps/deployments.yaml | jq -r '.items[].status.conditions' [ { "lastTransitionTime": "2021-04-12T22:04:41Z", "lastUpdateTime": "2021-04-12T22:04:41Z", "message": "Deployment has minimum availability.", "reason": "MinimumReplicasAvailable", "status": "True", "type": "Available" }, { "lastTransitionTime": "2021-04-19T17:33:30Z", "lastUpdateTime": "2021-04-19T17:33:30Z", "message": "ReplicaSet \"marketplace-operator-f7cc88d59\" has timed out progressing.", "reason": "ProgressDeadlineExceeded", "status": "False", "type": "Progressing" } ] We should be complaining about that ProgressDeadlineExceeded directly as well.
Reproducing it with 4.8.0-fc.0. Steps to reproduce: 1. Install a cluster with 4.8.0-fc.0 2. Update the marketplace deployment by inject env variable for containers: envFrom: - secretRef: name: openshift-reg 3. Check the results: marketplace-operator is recreated with config error # oc get all NAME READY STATUS RESTARTS AGE pod/certified-operators-lddkh 1/1 Running 0 155m pod/community-operators-82cdb 1/1 Running 0 16m pod/marketplace-operator-587b7bf94d-2sg5n 1/1 Running 0 153m pod/marketplace-operator-7686ffb476-84x6s 0/1 CreateContainerConfigError 0 47s pod/qe-app-registry-24l67 1/1 Running 0 156m pod/redhat-marketplace-xx8jl 1/1 Running 0 156m pod/redhat-operators-rccr2 1/1 Running 0 156m NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/certified-operators ClusterIP 172.30.236.173 <none> 50051/TCP 179m service/community-operators ClusterIP 172.30.65.242 <none> 50051/TCP 179m service/marketplace-operator-metrics ClusterIP 172.30.177.155 <none> 8383/TCP,8081/TCP 3h8m service/qe-app-registry ClusterIP 172.30.163.173 <none> 50051/TCP 160m service/redhat-marketplace ClusterIP 172.30.32.12 <none> 50051/TCP 179m service/redhat-operators ClusterIP 172.30.131.49 <none> 50051/TCP 179m NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/marketplace-operator 1/1 1 1 3h8m NAME DESIRED CURRENT READY AGE replicaset.apps/marketplace-operator-587b7bf94d 1 1 1 3h8m replicaset.apps/marketplace-operator-7686ffb476 1 1 0 47s The deployment is broke by: # oc describe pod/marketplace-operator-7686ffb476-84x6s Warning Failed <invalid> (x10 over 75s) kubelet Error: secret "openshift-reg" not found The deployment is not terminated directly. # oc get deployment -ojson | jq -r '.items[].status.conditions' [ { "lastTransitionTime": "2021-04-23T04:07:30Z", "lastUpdateTime": "2021-04-23T04:07:30Z", "message": "Deployment has minimum availability.", "reason": "MinimumReplicasAvailable", "status": "True", "type": "Available" }, { "lastTransitionTime": "2021-04-23T03:31:14Z", "lastUpdateTime": "2021-04-23T06:38:54Z", "message": "ReplicaSet \"marketplace-operator-7686ffb476\" is progressing.", "reason": "ReplicaSetUpdated", "status": "True", "type": "Progressing" } ] CVO does not log the errors. # grep 'deployment.*marketplace' cvo.log | grep -v Throttling I0423 06:46:33.540152 1 sync_worker.go:767] Running sync for deployment "openshift-marketplace/marketplace-operator" (521 of 678) I0423 06:46:33.954966 1 sync_worker.go:779] Done syncing for deployment "openshift-marketplace/marketplace-operator" (521 of 678) I0423 06:50:05.461871 1 sync_worker.go:767] Running sync for deployment "openshift-marketplace/marketplace-operator" (521 of 678) I0423 06:50:05.851594 1 sync_worker.go:779] Done syncing for deployment "openshift-marketplace/marketplace-operator" (521 of 678)
> # oc get deployment -ojson | jq -r '.items[].status.conditions' > [ > { > "lastTransitionTime": "2021-04-23T04:07:30Z", > "lastUpdateTime": "2021-04-23T04:07:30Z", > "message": "Deployment has minimum availability.", > "reason": "MinimumReplicasAvailable", > "status": "True", > "type": "Available" > }, > { > "lastTransitionTime": "2021-04-23T03:31:14Z", > "lastUpdateTime": "2021-04-23T06:38:54Z", > "message": "ReplicaSet \"marketplace-operator-7686ffb476\" is > progressing.", > "reason": "ReplicaSetUpdated", > "status": "True", > "type": "Progressing" > } > ] > The deployment is timeout eventually. # oc get deployment -ojson | jq -r '.items[].status.conditions' [ { "lastTransitionTime": "2021-04-23T09:35:15Z", "lastUpdateTime": "2021-04-23T09:35:15Z", "message": "Deployment has minimum availability.", "reason": "MinimumReplicasAvailable", "status": "True", "type": "Available" }, { "lastTransitionTime": "2021-04-23T10:33:25Z", "lastUpdateTime": "2021-04-23T10:33:25Z", "message": "ReplicaSet \"marketplace-operator-7686ffb476\" has timed out progressing.", "reason": "ProgressDeadlineExceeded", "status": "False", "type": "Progressing" } ]
Verifying with 4.8.0-0.nightly-2021-04-22-225832. # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2021-04-22-225832 True False 13m Cluster version is 4.8.0-0.nightly-2021-04-22-225832 Update the marketplace deployment by injecting the env variables: envFrom: - secretRef: name: openshift-reg The deployment has an error. # oc get all NAME READY STATUS RESTARTS AGE pod/certified-operators-59phh 1/1 Running 0 15m pod/community-operators-6pfwq 1/1 Running 0 16m pod/marketplace-operator-548df64d-5nctr 1/1 Running 0 16m pod/marketplace-operator-649ccff7f4-lfvnm 0/1 CreateContainerConfigError 0 7s pod/qe-app-registry-vmrr7 1/1 Running 0 2m13s pod/redhat-marketplace-hxhfn 1/1 Running 0 15m pod/redhat-operators-882cr 1/1 Running 0 15m NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/certified-operators ClusterIP 172.30.30.200 <none> 50051/TCP 39m service/community-operators ClusterIP 172.30.102.1 <none> 50051/TCP 39m service/marketplace-operator-metrics ClusterIP 172.30.208.23 <none> 8383/TCP,8081/TCP 47m service/qe-app-registry ClusterIP 172.30.219.78 <none> 50051/TCP 19m service/redhat-marketplace ClusterIP 172.30.126.70 <none> 50051/TCP 39m service/redhat-operators ClusterIP 172.30.113.251 <none> 50051/TCP 39m NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/marketplace-operator 1/1 1 1 47m NAME DESIRED CURRENT READY AGE replicaset.apps/marketplace-operator-548df64d 1 1 1 47m replicaset.apps/marketplace-operator-649ccff7f4 1 1 0 8s # oc get pod/marketplace-operator-649ccff7f4-lfvnm -ojson | jq -r '.status.containerStatuses[].state' { "waiting": { "message": "secret \"openshift-reg\" not found", "reason": "CreateContainerConfigError" } } # oc get deployment -ojson | jq -r '.items[].status.conditions' [ { "lastTransitionTime": "2021-04-23T08:27:09Z", "lastUpdateTime": "2021-04-23T08:27:09Z", "message": "Deployment has minimum availability.", "reason": "MinimumReplicasAvailable", "status": "True", "type": "Available" }, { "lastTransitionTime": "2021-04-23T07:55:53Z", "lastUpdateTime": "2021-04-23T08:42:59Z", "message": "ReplicaSet \"marketplace-operator-649ccff7f4\" is progressing.", "reason": "ReplicaSetUpdated", "status": "True", "type": "Progressing" } ] After a while, the operator is reconciled. # oc get deployment -ojson | jq -r '.items[].status.conditions' [ { "lastTransitionTime": "2021-04-23T08:27:09Z", "lastUpdateTime": "2021-04-23T08:27:09Z", "message": "Deployment has minimum availability.", "reason": "MinimumReplicasAvailable", "status": "True", "type": "Available" }, { "lastTransitionTime": "2021-04-23T07:55:53Z", "lastUpdateTime": "2021-04-23T08:46:14Z", "message": "ReplicaSet \"marketplace-operator-548df64d\" has successfully progressed.", "reason": "NewReplicaSetAvailable", "status": "True", "type": "Progressing" } ] # oc get all NAME READY STATUS RESTARTS AGE pod/certified-operators-59phh 1/1 Running 0 21m pod/community-operators-6pfwq 1/1 Running 0 22m pod/marketplace-operator-548df64d-5nctr 1/1 Running 0 22m pod/qe-app-registry-vmrr7 1/1 Running 0 8m18s pod/redhat-marketplace-hxhfn 1/1 Running 0 21m pod/redhat-operators-882cr 1/1 Running 0 21m NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/certified-operators ClusterIP 172.30.30.200 <none> 50051/TCP 45m service/community-operators ClusterIP 172.30.102.1 <none> 50051/TCP 45m service/marketplace-operator-metrics ClusterIP 172.30.208.23 <none> 8383/TCP,8081/TCP 53m service/qe-app-registry ClusterIP 172.30.219.78 <none> 50051/TCP 25m service/redhat-marketplace ClusterIP 172.30.126.70 <none> 50051/TCP 45m service/redhat-operators ClusterIP 172.30.113.251 <none> 50051/TCP 45m NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/marketplace-operator 1/1 1 1 53m NAME DESIRED CURRENT READY AGE replicaset.apps/marketplace-operator-548df64d 1 1 1 53m replicaset.apps/marketplace-operator-649ccff7f4 0 0 0 6m13s
To verify commit#8132d6a, I changed the image digest to an invalid one and changed the progressDeadlineSeconds to 5. But I'm unlucky and I doesn't get ProgressDeadlineExceeded. @Trevor, could you please hint me how can I get ProgressDeadlineExceeded scenario? Thanks.
Possibly cordoning all the nodes and then killing something with replicas > 2 that is managed by the CVO? Looking for candidates: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-aws/1385644983909879808/artifacts/e2e-aws/gather-extra/artifacts/deployments.json | jq -r '.items[] | select(.spec.replicas >= 2).metadata | .namespace + " " + .name' ... openshift-console downloads ... So something like: $ NODES="$(oc get -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' nodes)" $ for NODE in ${NODES}; do oc cordon "${NODE}"; done $ POD="$(oc -n openshift-console get -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' pods | grep downloads | head -n1)" $ oc -n openshift-console delete pod "${POD}" and then wait 10 minutes [1] for the default progressDeadlineSeconds to expire. [1]: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#progress-deadline-seconds
Verifying commit#8132d6a. Cordon all of the nodes: $ NODES="$(oc get -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' nodes)" $ for NODE in ${NODES}; do oc cordon "${NODE}"; done Delete a deployment: $ oc delete deployment.apps/downloadselete -n openshift-console Wait 600s for ProgressDeadlineExceeded: $ oc get deployment.apps/downloads -ojson | jq -r '.status.conditions' [ { "lastTransitionTime": "2021-04-25T09:49:57Z", "lastUpdateTime": "2021-04-25T09:49:57Z", "message": "Deployment does not have minimum availability.", "reason": "MinimumReplicasUnavailable", "status": "False", "type": "Available" }, { "lastTransitionTime": "2021-04-25T09:59:58Z", "lastUpdateTime": "2021-04-25T09:59:58Z", "message": "ReplicaSet \"downloads-7fdddb48b4\" has timed out progressing.", "reason": "ProgressDeadlineExceeded", "status": "False", "type": "Progressing" } ] Describe download pod: $ oc describe pod/downloads-7fdddb48b4-wnljj Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 12m default-scheduler 0/6 nodes are available: 6 node(s) were unschedulable. Warning FailedScheduling 12m default-scheduler 0/6 nodes are available: 6 node(s) were unschedulable. ProgressDeadlineExceeded error is not logged in pod event and CVO log file.
> Delete a deployment... If you delete the Deployment, it will remove both pods. The CVO will create a replacement Deployment, but with 0 of 2 pods ready, that Deployment will be, as you saw, Available=False as well as ProgressDeadlineExceeded. Comment 6 called for leaving the Deployment alone, but deleting a single pod. I expect that will leave the Deployment Available=True while still getting us ProgressDeadlineExceeded. But in either case, Available=False and ProgressDeadlineExceeded should both be sufficient to get the CVO to mention the issue in ClusterVersion conditions. If it doesn't, can you include the ClusterVersion YAML?
Comment #6 does not get ProgressDeadlineExceeded so I delete the deployment. # oc get clusterversion -oyaml apiVersion: v1 items: - apiVersion: config.openshift.io/v1 kind: ClusterVersion metadata: creationTimestamp: "2021-04-23T07:55:27Z" generation: 1 name: version resourceVersion: "1694483" uid: 98e037f9-ed02-491d-894b-1f5965638c00 spec: channel: stable-4.7 clusterID: 93f31b0e-ef15-4b2f-b441-fe5c6c786231 status: availableUpdates: null conditions: - lastTransitionTime: "2021-04-23T08:22:33Z" message: Done applying 4.8.0-0.nightly-2021-04-22-225832 status: "True" type: Available - lastTransitionTime: "2021-04-26T01:33:35Z" message: 'Some cluster operators are still updating: machine-config, monitoring' reason: ClusterOperatorsNotAvailable status: "True" type: Failing - lastTransitionTime: "2021-04-23T08:22:33Z" message: 'Error while reconciling 4.8.0-0.nightly-2021-04-22-225832: some cluster operators have not yet rolled out' reason: ClusterOperatorsNotAvailable status: "False" type: Progressing - lastTransitionTime: "2021-04-23T07:55:27Z" message: 'Unable to retrieve available updates: currently reconciling cluster version 4.8.0-0.nightly-2021-04-22-225832 not found in the "stable-4.7" channel' reason: VersionNotFound status: "False" type: RetrievedUpdates - lastTransitionTime: "2021-04-25T09:50:05Z" message: 'Cluster operator machine-config cannot be upgraded between minor versions: One or more machine config pools are updating, please see `oc get mcp` for further details' reason: PoolUpdating status: "False" type: Upgradeable desired: image: registry.ci.openshift.org/ocp/release@sha256:018e7acbdde984b80c0ed7eaeea7d3f27dc4442dc15e7f9c6de02ffaf9e9eee9 version: 4.8.0-0.nightly-2021-04-22-225832 history: - completionTime: "2021-04-23T08:22:33Z" image: registry.ci.openshift.org/ocp/release@sha256:018e7acbdde984b80c0ed7eaeea7d3f27dc4442dc15e7f9c6de02ffaf9e9eee9 startedTime: "2021-04-23T07:55:27Z" state: Completed verified: false version: 4.8.0-0.nightly-2021-04-22-225832 observedGeneration: 1 versionHash: 14fG0Q-Un8E= kind: List metadata: resourceVersion: "" selfLink: "" # oc -n openshift-console get all NAME READY STATUS RESTARTS AGE pod/console-b5f64d9c7-lncw9 1/1 Running 0 18h pod/console-b5f64d9c7-mk89l 1/1 Running 0 18h pod/downloads-7fdddb48b4-wnljj 0/1 Pending 0 16h pod/downloads-7fdddb48b4-zl2g2 0/1 Pending 0 16h NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/console ClusterIP 172.30.87.18 <none> 443/TCP 2d18h service/downloads ClusterIP 172.30.119.245 <none> 80/TCP 2d18h NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/console 2/2 2 2 2d18h deployment.apps/downloads 0/2 2 0 16h NAME DESIRED CURRENT READY AGE replicaset.apps/console-585b4f86d6 0 0 0 2d17h replicaset.apps/console-5dcbfdc595 0 0 0 18h replicaset.apps/console-6d466d4bcc 0 0 0 2d18h replicaset.apps/console-b5f64d9c7 2 2 2 18h replicaset.apps/downloads-7fdddb48b4 2 2 0 16h NAME HOST/PORT PATH SERVICES PORT TERMINATION WILDCARD route.route.openshift.io/console console-openshift-console.apps.yangyang0423-1.qe.gcp.devcluster.openshift.com console https reencrypt/Redirect None route.route.openshift.io/downloads downloads-openshift-console.apps.yangyang0423-1.qe.gcp.devcluster.openshift.com downloads http edge/Redirect None
Hmm. Poking around in a cluster-bot cluster: $ oc get clusterversion -o jsonpath='{.status.desired.version}{"\n"}' version 4.8.0-0.nightly-2021-04-25-195440 $ NODES="$(oc get -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' nodes)" $ for NODE in ${NODES}; do oc adm cordon "${NODE}"; done $ POD="$(oc -n openshift-console get -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' pods | grep downloads | head -n1)" $ oc -n openshift-console delete pod "${POD}" $ oc -n openshift-console get -o json deployment downloads | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message' | sort 2021-04-26T03:38:50Z Progressing=True NewReplicaSetAvailable: ReplicaSet "downloads-8dcc96bf7" has successfully progressed. 2021-04-26T03:59:55Z Available=False MinimumReplicasUnavailable: Deployment does not have minimum availability. Hrm, I didn't even have to wait 10m. ClusterVersion doesn't say anything about this issue: $ oc get -o json clusterversion version | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message' | sort 2021-04-26T03:22:45Z RetrievedUpdates=False NoChannel: The update channel has not been configured. 2021-04-26T03:56:13Z Available=True : Done applying 4.8.0-0.nightly-2021-04-25-195440 2021-04-26T03:56:13Z Failing=False : 2021-04-26T03:56:13Z Progressing=False : Cluster version is 4.8.0-0.nightly-2021-04-25-195440 2021-04-26T04:00:48Z Upgradeable=False PoolUpdating: Cluster operator machine-config cannot be upgraded between minor versions: One or more machine config pools are updating, please see `oc get mcp` for further details I wonder what the deal is with machine-config? $ oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-555bdd56b2ffb5c9b65c53df4805605b False True False 3 0 3 0 36m worker rendered-worker-5d9d6269bbbb1561d7ae1b34c447f4c9 False True False 3 0 3 0 36m 3 of 3 in each pool are updated, but Updated=False and Updating=True because they are cordoned, and therefore not ready. That doesn't feel right. But anyway, back to the CVO: $ POD="$(oc -n openshift-cluster-version get -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' pods)" $ oc -n openshift-cluster-version logs "${POD}" | grep 'Running sync.*state\|downloads\|Result of work' | grep -v 'Throttling request' ... I0426 04:08:49.958758 1 sync_worker.go:549] Running sync 4.8.0-0.nightly-2021-04-25-195440 (force=false) on generation 1 in state Reconciling at attempt 0 I0426 04:08:56.715714 1 sync_worker.go:767] Running sync for customresourcedefinition "consoleclidownloads.console.openshift.io" (51 of 677) I0426 04:08:56.816355 1 sync_worker.go:779] Done syncing for customresourcedefinition "consoleclidownloads.console.openshift.io" (51 of 677) I0426 04:08:57.115298 1 sync_worker.go:767] Running sync for customresourcedefinition "storagestates.migration.k8s.io" (198 of 677) I0426 04:09:15.087322 1 sync_worker.go:767] Running sync for route "openshift-console/downloads" (448 of 677) I0426 04:09:15.122404 1 sync_worker.go:779] Done syncing for route "openshift-console/downloads" (448 of 677) I0426 04:09:15.122460 1 sync_worker.go:767] Running sync for service "openshift-console/downloads" (449 of 677) I0426 04:09:15.321107 1 sync_worker.go:779] Done syncing for service "openshift-console/downloads" (449 of 677) I0426 04:09:18.161172 1 task_graph.go:555] Result of work: [] ... Hrm, no downloads deployment. Ah, because the console folks moved it out from under the CVO to control it directly in [1]. So sorry for the stale choice of deployment. Uncordon so the downloads deployment can recover: $ for NODE in ${NODES}; do oc adm uncordon "${NODE}"; done And go looking for another deployment: $ oc adm release extract --to manifest $ grep -ho 'replicas: .*' $(grep -rl 'kind: Deployment' manifest) | sort | uniq -c 40 replicas: 1 So how to trigger Available=True with ProgressDeadlineExceeded... Maybe if I cordon everyone during an update, we'll be unable to schedule... But I don't want the old pod to go away first, because then we'd be Available=False. I see the default maxSurge is 25% [2]. And maxUnavailable rounds down [3]. So maybe our RollingUpdate deployments are surging. $ grep RollingUpdate manifest/*deployment* manifest/0000_50_olm_07-olm-operator.deployment.ibm-cloud-managed.yaml: type: RollingUpdate manifest/0000_50_olm_07-olm-operator.deployment.yaml: type: RollingUpdate manifest/0000_50_olm_08-catalog-operator.deployment.ibm-cloud-managed.yaml: type: RollingUpdate manifest/0000_50_olm_08-catalog-operator.deployment.yaml: type: RollingUpdate $ grep name manifest/0000_50_olm_07-olm-operator.deployment.yaml | head -n2 name: olm-operator namespace: openshift-operator-lifecycle-manager Ok, so I'm going to cordon all the nodes again, and then touch OLM to trigger a rolling update, which will hopefully surge a new pod, and then complain about that pod being stuck: $ for NODE in ${NODES}; do oc adm cordon "${NODE}"; done $ oc -n openshift-operator-lifecycle-manager get -o json deployment olm-operator | jq .spec.template.metadata.annotations { "workload.openshift.io/management": "{\"effect\": \"PreferredDuringScheduling\"}" } $ oc -n openshift-operator-lifecycle-manager patch deployment olm-operator --type json -p '[{"op": "add", "path": "/spec/template/metadata/annotations/poke", "value": "123"}]' $ oc -n openshift-operator-lifecycle-manager get -o json deployment olm-operator | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message' | sort 2021-04-26T03:23:18Z Progressing=True ReplicaSetUpdated: ReplicaSet "olm-operator-6c99f456cf" is progressing. 2021-04-26T03:31:28Z Available=True MinimumReplicasAvailable: Deployment has minimum availability. $ oc -n openshift-operator-lifecycle-manager get pods NAME READY STATUS RESTARTS AGE ... olm-operator-589957f7bd-z4tlv 1/1 Running 0 88m olm-operator-6c99f456cf-fr8t6 0/1 Pending 0 54s ... Perfect :). Waiting for 10m... Now: $ oc -n openshift-operator-lifecycle-manager get pods | grep olm-operator olm-operator-589957f7bd-z4tlv 1/1 Running 0 97m olm-operator-6c99f456cf-fr8t6 0/1 Pending 0 10m $ oc -n openshift-operator-lifecycle-manager get -o json deployment olm-operator | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message' | sort 2021-04-26T03:31:28Z Available=True MinimumReplicasAvailable: Deployment has minimum availability. 2021-04-26T05:00:31Z Progressing=False ProgressDeadlineExceeded: ReplicaSet "olm-operator-6c99f456cf" has timed out progressing. $ oc -n openshift-cluster-version logs "${POD}" | grep 'Running sync.*state\|deployment.*olm-operator\|Result of work' | grep -v 'Throttling request' | tail I0426 04:53:35.383334 1 task_graph.go:555] Result of work: [] I0426 04:56:26.340740 1 sync_worker.go:549] Running sync 4.8.0-0.nightly-2021-04-25-195440 (force=false) on generation 1 in state Reconciling at attempt 0 I0426 04:56:33.347692 1 sync_worker.go:767] Running sync for customresourcedefinition "storagestates.migration.k8s.io" (198 of 677) I0426 04:56:53.696766 1 sync_worker.go:767] Running sync for deployment "openshift-operator-lifecycle-manager/olm-operator" (501 of 677) I0426 04:56:53.996634 1 sync_worker.go:779] Done syncing for deployment "openshift-operator-lifecycle-manager/olm-operator" (501 of 677) I0426 04:56:54.547242 1 task_graph.go:555] Result of work: [] I0426 04:59:45.504344 1 sync_worker.go:549] Running sync 4.8.0-0.nightly-2021-04-25-195440 (force=false) on generation 1 in state Reconciling at attempt 0 I0426 04:59:52.619062 1 sync_worker.go:767] Running sync for customresourcedefinition "storagestates.migration.k8s.io" (198 of 677) I0426 05:00:13.211034 1 sync_worker.go:767] Running sync for deployment "openshift-operator-lifecycle-manager/olm-operator" (501 of 677) I0426 05:00:13.363510 1 sync_worker.go:779] Done syncing for deployment "openshift-operator-lifecycle-manager/olm-operator" (501 of 677) The CVO sticks there for around five minutes waiting for machine-config to happy up, which never happens, because we're still cordoned. Once that times out, we wait out the cool-down between sync cycles, start in again, and notice the stuck deployment: $ oc -n openshift-cluster-version logs "${POD}" | grep 'Running sync.*in state\|deployment.*olm-operator\|Result of work' | grep -v 'Throttling request' | tail I0426 04:56:53.696766 1 sync_worker.go:767] Running sync for deployment "openshift-operator-lifecycle-manager/olm-operator" (501 of 677) I0426 04:56:53.996634 1 sync_worker.go:779] Done syncing for deployment "openshift-operator-lifecycle-manager/olm-operator" (501 of 677) I0426 04:56:54.547242 1 task_graph.go:555] Result of work: [] I0426 04:59:45.504344 1 sync_worker.go:549] Running sync 4.8.0-0.nightly-2021-04-25-195440 (force=false) on generation 1 in state Reconciling at attempt 0 I0426 05:00:13.211034 1 sync_worker.go:767] Running sync for deployment "openshift-operator-lifecycle-manager/olm-operator" (501 of 677) I0426 05:00:13.363510 1 sync_worker.go:779] Done syncing for deployment "openshift-operator-lifecycle-manager/olm-operator" (501 of 677) I0426 05:05:27.417819 1 task_graph.go:555] Result of work: [Cluster operator machine-config is not available] I0426 05:08:44.045069 1 sync_worker.go:549] Running sync 4.8.0-0.nightly-2021-04-25-195440 (force=false) on generation 1 in state Reconciling at attempt 0 I0426 05:09:11.750990 1 sync_worker.go:767] Running sync for deployment "openshift-operator-lifecycle-manager/olm-operator" (501 of 677) E0426 05:09:11.901500 1 task.go:112] error running apply for deployment "openshift-operator-lifecycle-manager/olm-operator" (501 of 677): deployment openshift-operator-lifecycle-manager/olm-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "olm-operator-6c99f456cf" has timed out progressing. It takes us another 5m to give up on the stuck Deployment (we should probably fail-fast on ProgressDeadlineExceeded to avoid stacking timeouts, but can adjust to that in follow-up work). But eventually we do give up: $ oc -n openshift-cluster-version logs "${POD}" | grep 'Running sync.*in state\|deployment.*olm-operator\|Result of work' | grep -v 'Throttling request' | tail E0426 05:12:19.268254 1 task.go:112] error running apply for deployment "openshift-operator-lifecycle-manager/olm-operator" (501 of 677): deployment openshift-operator-lifecycle-manager/olm-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "olm-operator-6c99f456cf" has timed out progressing. E0426 05:12:37.640209 1 task.go:112] error running apply for deployment "openshift-operator-lifecycle-manager/olm-operator" (501 of 677): deployment openshift-operator-lifecycle-manager/olm-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "olm-operator-6c99f456cf" has timed out progressing. E0426 05:12:57.047203 1 task.go:112] error running apply for deployment "openshift-operator-lifecycle-manager/olm-operator" (501 of 677): deployment openshift-operator-lifecycle-manager/olm-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "olm-operator-6c99f456cf" has timed out progressing. E0426 05:13:22.513803 1 task.go:112] error running apply for deployment "openshift-operator-lifecycle-manager/olm-operator" (501 of 677): deployment openshift-operator-lifecycle-manager/olm-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "olm-operator-6c99f456cf" has timed out progressing. E0426 05:13:45.891628 1 task.go:112] error running apply for deployment "openshift-operator-lifecycle-manager/olm-operator" (501 of 677): deployment openshift-operator-lifecycle-manager/olm-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "olm-operator-6c99f456cf" has timed out progressing. E0426 05:14:00.301317 1 task.go:112] error running apply for deployment "openshift-operator-lifecycle-manager/olm-operator" (501 of 677): deployment openshift-operator-lifecycle-manager/olm-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "olm-operator-6c99f456cf" has timed out progressing. E0426 05:14:23.508240 1 task.go:112] error running apply for deployment "openshift-operator-lifecycle-manager/olm-operator" (501 of 677): deployment openshift-operator-lifecycle-manager/olm-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "olm-operator-6c99f456cf" has timed out progressing. I0426 05:14:25.958826 1 task_graph.go:555] Result of work: [Cluster operator machine-config is not available deployment openshift-operator-lifecycle-manager/olm-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "olm-operator-6c99f456cf" has timed out progressing.] I0426 05:14:25.958900 1 sync_worker.go:939] Update error 501 of 677: WorkloadNotProgressing deployment openshift-operator-lifecycle-manager/olm-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "olm-operator-6c99f456cf" has timed out progressing. (*errors.errorString: deployment openshift-operator-lifecycle-manager/olm-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "olm-operator-6c99f456cf" has timed out progressing.) E0426 05:14:25.958922 1 sync_worker.go:353] unable to synchronize image (waiting 2m50.956499648s): deployment openshift-operator-lifecycle-manager/olm-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "olm-operator-6c99f456cf" has timed out progressing. $ oc get -o json clusterversion version | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message' | sort 2021-04-26T03:22:45Z RetrievedUpdates=False NoChannel: The update channel has not been configured. 2021-04-26T03:56:13Z Available=True : Done applying 4.8.0-0.nightly-2021-04-25-195440 2021-04-26T03:56:13Z Progressing=False WorkloadNotProgressing: Error while reconciling 4.8.0-0.nightly-2021-04-25-195440: the workload openshift-operator-lifecycle-manager/olm-operator cannot roll out 2021-04-26T04:48:57Z Upgradeable=False PoolUpdating: Cluster operator machine-config cannot be upgraded between minor versions: One or more machine config pools are updating, please see `oc get mcp` for further details 2021-04-26T05:05:28Z Failing=True WorkloadNotProgressing: deployment openshift-operator-lifecycle-manager/olm-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "olm-operator-6c99f456cf" has timed out progressing. Hooray, there we go :) [1]: https://github.com/openshift/console-operator/pull/508 [2]: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#max-surge [3]: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#max-unavailable
Trevor, appreciate your help and detailed information. Verifying commit#8132d6a. Cordon all of the nodes: $ NODES="$(oc get -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' nodes)" $ for NODE in ${NODES}; do oc cordon "${NODE}"; done Update the catalog-operator deployment: $ oc patch deployment catalog-operator --type json -p '[{"op": "add", "path": "/spec/template/metadata/annotations/poke", "value": "123"}]' Check new catalog-operator pod is pending: $ oc get po NAME READY STATUS RESTARTS AGE catalog-operator-675bc6d7cb-kx5bm 1/1 Running 0 3d1h catalog-operator-7b46fdf5-txzhx 0/1 Pending 0 15m olm-operator-65b47b65f5-sj7xl 1/1 Running 0 3d1h packageserver-676bc98c9-kgv5m 1/1 Running 0 3d1h packageserver-676bc98c9-xqvsp 1/1 Running 0 3d1h Wait 10m... check the deployment state: $ oc get deployment catalog-operator -ojson | jq -r '.status.conditions[]' { "lastTransitionTime": "2021-04-23T08:26:40Z", "lastUpdateTime": "2021-04-23T08:26:40Z", "message": "Deployment has minimum availability.", "reason": "MinimumReplicasAvailable", "status": "True", "type": "Available" } { "lastTransitionTime": "2021-04-26T10:04:38Z", "lastUpdateTime": "2021-04-26T10:04:38Z", "message": "ReplicaSet \"catalog-operator-7b46fdf5\" has timed out progressing.", "reason": "ProgressDeadlineExceeded", "status": "False", "type": "Progressing" } After a while, check the CVO conditions: $ oc get clusterversion -o json | jq -r '.items[].status.conditions[]' { "lastTransitionTime": "2021-04-23T08:22:33Z", "message": "Done applying 4.8.0-0.nightly-2021-04-22-225832", "status": "True", "type": "Available" } { "lastTransitionTime": "2021-04-26T10:14:20Z", "message": "deployment openshift-operator-lifecycle-manager/catalog-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet \"catalog-operator-7b46fdf5\" has timed out progressing.", "reason": "WorkloadNotProgressing", "status": "True", "type": "Failing" } { "lastTransitionTime": "2021-04-23T08:22:33Z", "message": "Error while reconciling 4.8.0-0.nightly-2021-04-22-225832: the workload openshift-operator-lifecycle-manager/catalog-operator cannot roll out", "reason": "WorkloadNotProgressing", "status": "False", "type": "Progressing" } { "lastTransitionTime": "2021-04-23T07:55:27Z", "message": "Unable to retrieve available updates: currently reconciling cluster version 4.8.0-0.nightly-2021-04-22-225832 not found in the \"stable-4.7\" channel", "reason": "VersionNotFound", "status": "False", "type": "RetrievedUpdates" } { "lastTransitionTime": "2021-04-26T08:05:10Z", "message": "Cluster operator machine-config cannot be upgraded between minor versions: One or more machine config pools are updating, please see `oc get mcp` for further details", "reason": "PoolUpdating", "status": "False", "type": "Upgradeable" } Check the CVO log: $ oc -n openshift-cluster-version logs cluster-version-operator-6f4c448795-54z97 | grep 'deployment.*catalog\|Result of work' | tail -5 E0426 11:42:19.121497 1 task.go:112] error running apply for deployment "openshift-operator-lifecycle-manager/catalog-operator" (502 of 677): deployment openshift-operator-lifecycle-manager/catalog-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "catalog-operator-7b46fdf5" has timed out progressing. E0426 11:42:41.875235 1 task.go:112] error running apply for deployment "openshift-operator-lifecycle-manager/catalog-operator" (502 of 677): deployment openshift-operator-lifecycle-manager/catalog-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "catalog-operator-7b46fdf5" has timed out progressing. I0426 11:42:45.607570 1 task_graph.go:555] Result of work: [deployment openshift-operator-lifecycle-manager/catalog-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "catalog-operator-7b46fdf5" has timed out progressing. Cluster operator monitoring is not available] I0426 11:42:45.607688 1 sync_worker.go:939] Update error 502 of 677: WorkloadNotProgressing deployment openshift-operator-lifecycle-manager/catalog-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "catalog-operator-7b46fdf5" has timed out progressing. (*errors.errorString: deployment openshift-operator-lifecycle-manager/catalog-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "catalog-operator-7b46fdf5" has timed out progressing.) E0426 11:42:45.607736 1 sync_worker.go:353] unable to synchronize image (waiting 2m50.956499648s): deployment openshift-operator-lifecycle-manager/catalog-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "catalog-operator-7b46fdf5" has timed out progressing. The 2 commits work as design. Moving it to verified state.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438