Bug 1951339
Summary: | Cluster-version operator does not manage operand container environments when manifest lacks opinions | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | W. Trevor King <wking> |
Component: | Cluster Version Operator | Assignee: | W. Trevor King <wking> |
Status: | CLOSED ERRATA | QA Contact: | Yang Yang <yanyang> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 4.1.z | CC: | aos-bugs, jokerman, mbargenq, yanyang |
Target Milestone: | --- | ||
Target Release: | 4.8.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
Cause: The cluster-version operator did not reconcile env and envFrom for manifests which did not set those properties.
Consequence: Cluster administrators were able to manipulate those properties, potentially adding broken values, without the cluster-version operator stomping their changes.
Fix: The cluster-version operator now clears env and envFrom if they are unset in the manifest.
Result: The cluster will automatically recover from invalid cluster-admin injections to these properties.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2021-07-27 23:01:52 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
W. Trevor King
2021-04-20 00:00:16 UTC
Reproducing it with 4.8.0-fc.0. Steps to reproduce: 1. Install a cluster with 4.8.0-fc.0 2. Update the marketplace deployment by inject env variable for containers: envFrom: - secretRef: name: openshift-reg 3. Check the results: marketplace-operator is recreated with config error # oc get all NAME READY STATUS RESTARTS AGE pod/certified-operators-lddkh 1/1 Running 0 155m pod/community-operators-82cdb 1/1 Running 0 16m pod/marketplace-operator-587b7bf94d-2sg5n 1/1 Running 0 153m pod/marketplace-operator-7686ffb476-84x6s 0/1 CreateContainerConfigError 0 47s pod/qe-app-registry-24l67 1/1 Running 0 156m pod/redhat-marketplace-xx8jl 1/1 Running 0 156m pod/redhat-operators-rccr2 1/1 Running 0 156m NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/certified-operators ClusterIP 172.30.236.173 <none> 50051/TCP 179m service/community-operators ClusterIP 172.30.65.242 <none> 50051/TCP 179m service/marketplace-operator-metrics ClusterIP 172.30.177.155 <none> 8383/TCP,8081/TCP 3h8m service/qe-app-registry ClusterIP 172.30.163.173 <none> 50051/TCP 160m service/redhat-marketplace ClusterIP 172.30.32.12 <none> 50051/TCP 179m service/redhat-operators ClusterIP 172.30.131.49 <none> 50051/TCP 179m NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/marketplace-operator 1/1 1 1 3h8m NAME DESIRED CURRENT READY AGE replicaset.apps/marketplace-operator-587b7bf94d 1 1 1 3h8m replicaset.apps/marketplace-operator-7686ffb476 1 1 0 47s The deployment is broke by: # oc describe pod/marketplace-operator-7686ffb476-84x6s Warning Failed <invalid> (x10 over 75s) kubelet Error: secret "openshift-reg" not found The deployment is not terminated directly. # oc get deployment -ojson | jq -r '.items[].status.conditions' [ { "lastTransitionTime": "2021-04-23T04:07:30Z", "lastUpdateTime": "2021-04-23T04:07:30Z", "message": "Deployment has minimum availability.", "reason": "MinimumReplicasAvailable", "status": "True", "type": "Available" }, { "lastTransitionTime": "2021-04-23T03:31:14Z", "lastUpdateTime": "2021-04-23T06:38:54Z", "message": "ReplicaSet \"marketplace-operator-7686ffb476\" is progressing.", "reason": "ReplicaSetUpdated", "status": "True", "type": "Progressing" } ] CVO does not log the errors. # grep 'deployment.*marketplace' cvo.log | grep -v Throttling I0423 06:46:33.540152 1 sync_worker.go:767] Running sync for deployment "openshift-marketplace/marketplace-operator" (521 of 678) I0423 06:46:33.954966 1 sync_worker.go:779] Done syncing for deployment "openshift-marketplace/marketplace-operator" (521 of 678) I0423 06:50:05.461871 1 sync_worker.go:767] Running sync for deployment "openshift-marketplace/marketplace-operator" (521 of 678) I0423 06:50:05.851594 1 sync_worker.go:779] Done syncing for deployment "openshift-marketplace/marketplace-operator" (521 of 678) > # oc get deployment -ojson | jq -r '.items[].status.conditions'
> [
> {
> "lastTransitionTime": "2021-04-23T04:07:30Z",
> "lastUpdateTime": "2021-04-23T04:07:30Z",
> "message": "Deployment has minimum availability.",
> "reason": "MinimumReplicasAvailable",
> "status": "True",
> "type": "Available"
> },
> {
> "lastTransitionTime": "2021-04-23T03:31:14Z",
> "lastUpdateTime": "2021-04-23T06:38:54Z",
> "message": "ReplicaSet \"marketplace-operator-7686ffb476\" is
> progressing.",
> "reason": "ReplicaSetUpdated",
> "status": "True",
> "type": "Progressing"
> }
> ]
>
The deployment is timeout eventually.
# oc get deployment -ojson | jq -r '.items[].status.conditions'
[
{
"lastTransitionTime": "2021-04-23T09:35:15Z",
"lastUpdateTime": "2021-04-23T09:35:15Z",
"message": "Deployment has minimum availability.",
"reason": "MinimumReplicasAvailable",
"status": "True",
"type": "Available"
},
{
"lastTransitionTime": "2021-04-23T10:33:25Z",
"lastUpdateTime": "2021-04-23T10:33:25Z",
"message": "ReplicaSet \"marketplace-operator-7686ffb476\" has timed out progressing.",
"reason": "ProgressDeadlineExceeded",
"status": "False",
"type": "Progressing"
}
]
Verifying with 4.8.0-0.nightly-2021-04-22-225832. # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2021-04-22-225832 True False 13m Cluster version is 4.8.0-0.nightly-2021-04-22-225832 Update the marketplace deployment by injecting the env variables: envFrom: - secretRef: name: openshift-reg The deployment has an error. # oc get all NAME READY STATUS RESTARTS AGE pod/certified-operators-59phh 1/1 Running 0 15m pod/community-operators-6pfwq 1/1 Running 0 16m pod/marketplace-operator-548df64d-5nctr 1/1 Running 0 16m pod/marketplace-operator-649ccff7f4-lfvnm 0/1 CreateContainerConfigError 0 7s pod/qe-app-registry-vmrr7 1/1 Running 0 2m13s pod/redhat-marketplace-hxhfn 1/1 Running 0 15m pod/redhat-operators-882cr 1/1 Running 0 15m NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/certified-operators ClusterIP 172.30.30.200 <none> 50051/TCP 39m service/community-operators ClusterIP 172.30.102.1 <none> 50051/TCP 39m service/marketplace-operator-metrics ClusterIP 172.30.208.23 <none> 8383/TCP,8081/TCP 47m service/qe-app-registry ClusterIP 172.30.219.78 <none> 50051/TCP 19m service/redhat-marketplace ClusterIP 172.30.126.70 <none> 50051/TCP 39m service/redhat-operators ClusterIP 172.30.113.251 <none> 50051/TCP 39m NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/marketplace-operator 1/1 1 1 47m NAME DESIRED CURRENT READY AGE replicaset.apps/marketplace-operator-548df64d 1 1 1 47m replicaset.apps/marketplace-operator-649ccff7f4 1 1 0 8s # oc get pod/marketplace-operator-649ccff7f4-lfvnm -ojson | jq -r '.status.containerStatuses[].state' { "waiting": { "message": "secret \"openshift-reg\" not found", "reason": "CreateContainerConfigError" } } # oc get deployment -ojson | jq -r '.items[].status.conditions' [ { "lastTransitionTime": "2021-04-23T08:27:09Z", "lastUpdateTime": "2021-04-23T08:27:09Z", "message": "Deployment has minimum availability.", "reason": "MinimumReplicasAvailable", "status": "True", "type": "Available" }, { "lastTransitionTime": "2021-04-23T07:55:53Z", "lastUpdateTime": "2021-04-23T08:42:59Z", "message": "ReplicaSet \"marketplace-operator-649ccff7f4\" is progressing.", "reason": "ReplicaSetUpdated", "status": "True", "type": "Progressing" } ] After a while, the operator is reconciled. # oc get deployment -ojson | jq -r '.items[].status.conditions' [ { "lastTransitionTime": "2021-04-23T08:27:09Z", "lastUpdateTime": "2021-04-23T08:27:09Z", "message": "Deployment has minimum availability.", "reason": "MinimumReplicasAvailable", "status": "True", "type": "Available" }, { "lastTransitionTime": "2021-04-23T07:55:53Z", "lastUpdateTime": "2021-04-23T08:46:14Z", "message": "ReplicaSet \"marketplace-operator-548df64d\" has successfully progressed.", "reason": "NewReplicaSetAvailable", "status": "True", "type": "Progressing" } ] # oc get all NAME READY STATUS RESTARTS AGE pod/certified-operators-59phh 1/1 Running 0 21m pod/community-operators-6pfwq 1/1 Running 0 22m pod/marketplace-operator-548df64d-5nctr 1/1 Running 0 22m pod/qe-app-registry-vmrr7 1/1 Running 0 8m18s pod/redhat-marketplace-hxhfn 1/1 Running 0 21m pod/redhat-operators-882cr 1/1 Running 0 21m NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/certified-operators ClusterIP 172.30.30.200 <none> 50051/TCP 45m service/community-operators ClusterIP 172.30.102.1 <none> 50051/TCP 45m service/marketplace-operator-metrics ClusterIP 172.30.208.23 <none> 8383/TCP,8081/TCP 53m service/qe-app-registry ClusterIP 172.30.219.78 <none> 50051/TCP 25m service/redhat-marketplace ClusterIP 172.30.126.70 <none> 50051/TCP 45m service/redhat-operators ClusterIP 172.30.113.251 <none> 50051/TCP 45m NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/marketplace-operator 1/1 1 1 53m NAME DESIRED CURRENT READY AGE replicaset.apps/marketplace-operator-548df64d 1 1 1 53m replicaset.apps/marketplace-operator-649ccff7f4 0 0 0 6m13s To verify commit#8132d6a, I changed the image digest to an invalid one and changed the progressDeadlineSeconds to 5. But I'm unlucky and I doesn't get ProgressDeadlineExceeded. @Trevor, could you please hint me how can I get ProgressDeadlineExceeded scenario? Thanks. Possibly cordoning all the nodes and then killing something with replicas > 2 that is managed by the CVO? Looking for candidates: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-aws/1385644983909879808/artifacts/e2e-aws/gather-extra/artifacts/deployments.json | jq -r '.items[] | select(.spec.replicas >= 2).metadata | .namespace + " " + .name' ... openshift-console downloads ... So something like: $ NODES="$(oc get -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' nodes)" $ for NODE in ${NODES}; do oc cordon "${NODE}"; done $ POD="$(oc -n openshift-console get -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' pods | grep downloads | head -n1)" $ oc -n openshift-console delete pod "${POD}" and then wait 10 minutes [1] for the default progressDeadlineSeconds to expire. [1]: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#progress-deadline-seconds Verifying commit#8132d6a. Cordon all of the nodes: $ NODES="$(oc get -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' nodes)" $ for NODE in ${NODES}; do oc cordon "${NODE}"; done Delete a deployment: $ oc delete deployment.apps/downloadselete -n openshift-console Wait 600s for ProgressDeadlineExceeded: $ oc get deployment.apps/downloads -ojson | jq -r '.status.conditions' [ { "lastTransitionTime": "2021-04-25T09:49:57Z", "lastUpdateTime": "2021-04-25T09:49:57Z", "message": "Deployment does not have minimum availability.", "reason": "MinimumReplicasUnavailable", "status": "False", "type": "Available" }, { "lastTransitionTime": "2021-04-25T09:59:58Z", "lastUpdateTime": "2021-04-25T09:59:58Z", "message": "ReplicaSet \"downloads-7fdddb48b4\" has timed out progressing.", "reason": "ProgressDeadlineExceeded", "status": "False", "type": "Progressing" } ] Describe download pod: $ oc describe pod/downloads-7fdddb48b4-wnljj Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 12m default-scheduler 0/6 nodes are available: 6 node(s) were unschedulable. Warning FailedScheduling 12m default-scheduler 0/6 nodes are available: 6 node(s) were unschedulable. ProgressDeadlineExceeded error is not logged in pod event and CVO log file. > Delete a deployment... If you delete the Deployment, it will remove both pods. The CVO will create a replacement Deployment, but with 0 of 2 pods ready, that Deployment will be, as you saw, Available=False as well as ProgressDeadlineExceeded. Comment 6 called for leaving the Deployment alone, but deleting a single pod. I expect that will leave the Deployment Available=True while still getting us ProgressDeadlineExceeded. But in either case, Available=False and ProgressDeadlineExceeded should both be sufficient to get the CVO to mention the issue in ClusterVersion conditions. If it doesn't, can you include the ClusterVersion YAML? Comment #6 does not get ProgressDeadlineExceeded so I delete the deployment. # oc get clusterversion -oyaml apiVersion: v1 items: - apiVersion: config.openshift.io/v1 kind: ClusterVersion metadata: creationTimestamp: "2021-04-23T07:55:27Z" generation: 1 name: version resourceVersion: "1694483" uid: 98e037f9-ed02-491d-894b-1f5965638c00 spec: channel: stable-4.7 clusterID: 93f31b0e-ef15-4b2f-b441-fe5c6c786231 status: availableUpdates: null conditions: - lastTransitionTime: "2021-04-23T08:22:33Z" message: Done applying 4.8.0-0.nightly-2021-04-22-225832 status: "True" type: Available - lastTransitionTime: "2021-04-26T01:33:35Z" message: 'Some cluster operators are still updating: machine-config, monitoring' reason: ClusterOperatorsNotAvailable status: "True" type: Failing - lastTransitionTime: "2021-04-23T08:22:33Z" message: 'Error while reconciling 4.8.0-0.nightly-2021-04-22-225832: some cluster operators have not yet rolled out' reason: ClusterOperatorsNotAvailable status: "False" type: Progressing - lastTransitionTime: "2021-04-23T07:55:27Z" message: 'Unable to retrieve available updates: currently reconciling cluster version 4.8.0-0.nightly-2021-04-22-225832 not found in the "stable-4.7" channel' reason: VersionNotFound status: "False" type: RetrievedUpdates - lastTransitionTime: "2021-04-25T09:50:05Z" message: 'Cluster operator machine-config cannot be upgraded between minor versions: One or more machine config pools are updating, please see `oc get mcp` for further details' reason: PoolUpdating status: "False" type: Upgradeable desired: image: registry.ci.openshift.org/ocp/release@sha256:018e7acbdde984b80c0ed7eaeea7d3f27dc4442dc15e7f9c6de02ffaf9e9eee9 version: 4.8.0-0.nightly-2021-04-22-225832 history: - completionTime: "2021-04-23T08:22:33Z" image: registry.ci.openshift.org/ocp/release@sha256:018e7acbdde984b80c0ed7eaeea7d3f27dc4442dc15e7f9c6de02ffaf9e9eee9 startedTime: "2021-04-23T07:55:27Z" state: Completed verified: false version: 4.8.0-0.nightly-2021-04-22-225832 observedGeneration: 1 versionHash: 14fG0Q-Un8E= kind: List metadata: resourceVersion: "" selfLink: "" # oc -n openshift-console get all NAME READY STATUS RESTARTS AGE pod/console-b5f64d9c7-lncw9 1/1 Running 0 18h pod/console-b5f64d9c7-mk89l 1/1 Running 0 18h pod/downloads-7fdddb48b4-wnljj 0/1 Pending 0 16h pod/downloads-7fdddb48b4-zl2g2 0/1 Pending 0 16h NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/console ClusterIP 172.30.87.18 <none> 443/TCP 2d18h service/downloads ClusterIP 172.30.119.245 <none> 80/TCP 2d18h NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/console 2/2 2 2 2d18h deployment.apps/downloads 0/2 2 0 16h NAME DESIRED CURRENT READY AGE replicaset.apps/console-585b4f86d6 0 0 0 2d17h replicaset.apps/console-5dcbfdc595 0 0 0 18h replicaset.apps/console-6d466d4bcc 0 0 0 2d18h replicaset.apps/console-b5f64d9c7 2 2 2 18h replicaset.apps/downloads-7fdddb48b4 2 2 0 16h NAME HOST/PORT PATH SERVICES PORT TERMINATION WILDCARD route.route.openshift.io/console console-openshift-console.apps.yangyang0423-1.qe.gcp.devcluster.openshift.com console https reencrypt/Redirect None route.route.openshift.io/downloads downloads-openshift-console.apps.yangyang0423-1.qe.gcp.devcluster.openshift.com downloads http edge/Redirect None Hmm. Poking around in a cluster-bot cluster: $ oc get clusterversion -o jsonpath='{.status.desired.version}{"\n"}' version 4.8.0-0.nightly-2021-04-25-195440 $ NODES="$(oc get -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' nodes)" $ for NODE in ${NODES}; do oc adm cordon "${NODE}"; done $ POD="$(oc -n openshift-console get -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' pods | grep downloads | head -n1)" $ oc -n openshift-console delete pod "${POD}" $ oc -n openshift-console get -o json deployment downloads | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message' | sort 2021-04-26T03:38:50Z Progressing=True NewReplicaSetAvailable: ReplicaSet "downloads-8dcc96bf7" has successfully progressed. 2021-04-26T03:59:55Z Available=False MinimumReplicasUnavailable: Deployment does not have minimum availability. Hrm, I didn't even have to wait 10m. ClusterVersion doesn't say anything about this issue: $ oc get -o json clusterversion version | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message' | sort 2021-04-26T03:22:45Z RetrievedUpdates=False NoChannel: The update channel has not been configured. 2021-04-26T03:56:13Z Available=True : Done applying 4.8.0-0.nightly-2021-04-25-195440 2021-04-26T03:56:13Z Failing=False : 2021-04-26T03:56:13Z Progressing=False : Cluster version is 4.8.0-0.nightly-2021-04-25-195440 2021-04-26T04:00:48Z Upgradeable=False PoolUpdating: Cluster operator machine-config cannot be upgraded between minor versions: One or more machine config pools are updating, please see `oc get mcp` for further details I wonder what the deal is with machine-config? $ oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-555bdd56b2ffb5c9b65c53df4805605b False True False 3 0 3 0 36m worker rendered-worker-5d9d6269bbbb1561d7ae1b34c447f4c9 False True False 3 0 3 0 36m 3 of 3 in each pool are updated, but Updated=False and Updating=True because they are cordoned, and therefore not ready. That doesn't feel right. But anyway, back to the CVO: $ POD="$(oc -n openshift-cluster-version get -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' pods)" $ oc -n openshift-cluster-version logs "${POD}" | grep 'Running sync.*state\|downloads\|Result of work' | grep -v 'Throttling request' ... I0426 04:08:49.958758 1 sync_worker.go:549] Running sync 4.8.0-0.nightly-2021-04-25-195440 (force=false) on generation 1 in state Reconciling at attempt 0 I0426 04:08:56.715714 1 sync_worker.go:767] Running sync for customresourcedefinition "consoleclidownloads.console.openshift.io" (51 of 677) I0426 04:08:56.816355 1 sync_worker.go:779] Done syncing for customresourcedefinition "consoleclidownloads.console.openshift.io" (51 of 677) I0426 04:08:57.115298 1 sync_worker.go:767] Running sync for customresourcedefinition "storagestates.migration.k8s.io" (198 of 677) I0426 04:09:15.087322 1 sync_worker.go:767] Running sync for route "openshift-console/downloads" (448 of 677) I0426 04:09:15.122404 1 sync_worker.go:779] Done syncing for route "openshift-console/downloads" (448 of 677) I0426 04:09:15.122460 1 sync_worker.go:767] Running sync for service "openshift-console/downloads" (449 of 677) I0426 04:09:15.321107 1 sync_worker.go:779] Done syncing for service "openshift-console/downloads" (449 of 677) I0426 04:09:18.161172 1 task_graph.go:555] Result of work: [] ... Hrm, no downloads deployment. Ah, because the console folks moved it out from under the CVO to control it directly in [1]. So sorry for the stale choice of deployment. Uncordon so the downloads deployment can recover: $ for NODE in ${NODES}; do oc adm uncordon "${NODE}"; done And go looking for another deployment: $ oc adm release extract --to manifest $ grep -ho 'replicas: .*' $(grep -rl 'kind: Deployment' manifest) | sort | uniq -c 40 replicas: 1 So how to trigger Available=True with ProgressDeadlineExceeded... Maybe if I cordon everyone during an update, we'll be unable to schedule... But I don't want the old pod to go away first, because then we'd be Available=False. I see the default maxSurge is 25% [2]. And maxUnavailable rounds down [3]. So maybe our RollingUpdate deployments are surging. $ grep RollingUpdate manifest/*deployment* manifest/0000_50_olm_07-olm-operator.deployment.ibm-cloud-managed.yaml: type: RollingUpdate manifest/0000_50_olm_07-olm-operator.deployment.yaml: type: RollingUpdate manifest/0000_50_olm_08-catalog-operator.deployment.ibm-cloud-managed.yaml: type: RollingUpdate manifest/0000_50_olm_08-catalog-operator.deployment.yaml: type: RollingUpdate $ grep name manifest/0000_50_olm_07-olm-operator.deployment.yaml | head -n2 name: olm-operator namespace: openshift-operator-lifecycle-manager Ok, so I'm going to cordon all the nodes again, and then touch OLM to trigger a rolling update, which will hopefully surge a new pod, and then complain about that pod being stuck: $ for NODE in ${NODES}; do oc adm cordon "${NODE}"; done $ oc -n openshift-operator-lifecycle-manager get -o json deployment olm-operator | jq .spec.template.metadata.annotations { "workload.openshift.io/management": "{\"effect\": \"PreferredDuringScheduling\"}" } $ oc -n openshift-operator-lifecycle-manager patch deployment olm-operator --type json -p '[{"op": "add", "path": "/spec/template/metadata/annotations/poke", "value": "123"}]' $ oc -n openshift-operator-lifecycle-manager get -o json deployment olm-operator | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message' | sort 2021-04-26T03:23:18Z Progressing=True ReplicaSetUpdated: ReplicaSet "olm-operator-6c99f456cf" is progressing. 2021-04-26T03:31:28Z Available=True MinimumReplicasAvailable: Deployment has minimum availability. $ oc -n openshift-operator-lifecycle-manager get pods NAME READY STATUS RESTARTS AGE ... olm-operator-589957f7bd-z4tlv 1/1 Running 0 88m olm-operator-6c99f456cf-fr8t6 0/1 Pending 0 54s ... Perfect :). Waiting for 10m... Now: $ oc -n openshift-operator-lifecycle-manager get pods | grep olm-operator olm-operator-589957f7bd-z4tlv 1/1 Running 0 97m olm-operator-6c99f456cf-fr8t6 0/1 Pending 0 10m $ oc -n openshift-operator-lifecycle-manager get -o json deployment olm-operator | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message' | sort 2021-04-26T03:31:28Z Available=True MinimumReplicasAvailable: Deployment has minimum availability. 2021-04-26T05:00:31Z Progressing=False ProgressDeadlineExceeded: ReplicaSet "olm-operator-6c99f456cf" has timed out progressing. $ oc -n openshift-cluster-version logs "${POD}" | grep 'Running sync.*state\|deployment.*olm-operator\|Result of work' | grep -v 'Throttling request' | tail I0426 04:53:35.383334 1 task_graph.go:555] Result of work: [] I0426 04:56:26.340740 1 sync_worker.go:549] Running sync 4.8.0-0.nightly-2021-04-25-195440 (force=false) on generation 1 in state Reconciling at attempt 0 I0426 04:56:33.347692 1 sync_worker.go:767] Running sync for customresourcedefinition "storagestates.migration.k8s.io" (198 of 677) I0426 04:56:53.696766 1 sync_worker.go:767] Running sync for deployment "openshift-operator-lifecycle-manager/olm-operator" (501 of 677) I0426 04:56:53.996634 1 sync_worker.go:779] Done syncing for deployment "openshift-operator-lifecycle-manager/olm-operator" (501 of 677) I0426 04:56:54.547242 1 task_graph.go:555] Result of work: [] I0426 04:59:45.504344 1 sync_worker.go:549] Running sync 4.8.0-0.nightly-2021-04-25-195440 (force=false) on generation 1 in state Reconciling at attempt 0 I0426 04:59:52.619062 1 sync_worker.go:767] Running sync for customresourcedefinition "storagestates.migration.k8s.io" (198 of 677) I0426 05:00:13.211034 1 sync_worker.go:767] Running sync for deployment "openshift-operator-lifecycle-manager/olm-operator" (501 of 677) I0426 05:00:13.363510 1 sync_worker.go:779] Done syncing for deployment "openshift-operator-lifecycle-manager/olm-operator" (501 of 677) The CVO sticks there for around five minutes waiting for machine-config to happy up, which never happens, because we're still cordoned. Once that times out, we wait out the cool-down between sync cycles, start in again, and notice the stuck deployment: $ oc -n openshift-cluster-version logs "${POD}" | grep 'Running sync.*in state\|deployment.*olm-operator\|Result of work' | grep -v 'Throttling request' | tail I0426 04:56:53.696766 1 sync_worker.go:767] Running sync for deployment "openshift-operator-lifecycle-manager/olm-operator" (501 of 677) I0426 04:56:53.996634 1 sync_worker.go:779] Done syncing for deployment "openshift-operator-lifecycle-manager/olm-operator" (501 of 677) I0426 04:56:54.547242 1 task_graph.go:555] Result of work: [] I0426 04:59:45.504344 1 sync_worker.go:549] Running sync 4.8.0-0.nightly-2021-04-25-195440 (force=false) on generation 1 in state Reconciling at attempt 0 I0426 05:00:13.211034 1 sync_worker.go:767] Running sync for deployment "openshift-operator-lifecycle-manager/olm-operator" (501 of 677) I0426 05:00:13.363510 1 sync_worker.go:779] Done syncing for deployment "openshift-operator-lifecycle-manager/olm-operator" (501 of 677) I0426 05:05:27.417819 1 task_graph.go:555] Result of work: [Cluster operator machine-config is not available] I0426 05:08:44.045069 1 sync_worker.go:549] Running sync 4.8.0-0.nightly-2021-04-25-195440 (force=false) on generation 1 in state Reconciling at attempt 0 I0426 05:09:11.750990 1 sync_worker.go:767] Running sync for deployment "openshift-operator-lifecycle-manager/olm-operator" (501 of 677) E0426 05:09:11.901500 1 task.go:112] error running apply for deployment "openshift-operator-lifecycle-manager/olm-operator" (501 of 677): deployment openshift-operator-lifecycle-manager/olm-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "olm-operator-6c99f456cf" has timed out progressing. It takes us another 5m to give up on the stuck Deployment (we should probably fail-fast on ProgressDeadlineExceeded to avoid stacking timeouts, but can adjust to that in follow-up work). But eventually we do give up: $ oc -n openshift-cluster-version logs "${POD}" | grep 'Running sync.*in state\|deployment.*olm-operator\|Result of work' | grep -v 'Throttling request' | tail E0426 05:12:19.268254 1 task.go:112] error running apply for deployment "openshift-operator-lifecycle-manager/olm-operator" (501 of 677): deployment openshift-operator-lifecycle-manager/olm-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "olm-operator-6c99f456cf" has timed out progressing. E0426 05:12:37.640209 1 task.go:112] error running apply for deployment "openshift-operator-lifecycle-manager/olm-operator" (501 of 677): deployment openshift-operator-lifecycle-manager/olm-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "olm-operator-6c99f456cf" has timed out progressing. E0426 05:12:57.047203 1 task.go:112] error running apply for deployment "openshift-operator-lifecycle-manager/olm-operator" (501 of 677): deployment openshift-operator-lifecycle-manager/olm-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "olm-operator-6c99f456cf" has timed out progressing. E0426 05:13:22.513803 1 task.go:112] error running apply for deployment "openshift-operator-lifecycle-manager/olm-operator" (501 of 677): deployment openshift-operator-lifecycle-manager/olm-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "olm-operator-6c99f456cf" has timed out progressing. E0426 05:13:45.891628 1 task.go:112] error running apply for deployment "openshift-operator-lifecycle-manager/olm-operator" (501 of 677): deployment openshift-operator-lifecycle-manager/olm-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "olm-operator-6c99f456cf" has timed out progressing. E0426 05:14:00.301317 1 task.go:112] error running apply for deployment "openshift-operator-lifecycle-manager/olm-operator" (501 of 677): deployment openshift-operator-lifecycle-manager/olm-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "olm-operator-6c99f456cf" has timed out progressing. E0426 05:14:23.508240 1 task.go:112] error running apply for deployment "openshift-operator-lifecycle-manager/olm-operator" (501 of 677): deployment openshift-operator-lifecycle-manager/olm-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "olm-operator-6c99f456cf" has timed out progressing. I0426 05:14:25.958826 1 task_graph.go:555] Result of work: [Cluster operator machine-config is not available deployment openshift-operator-lifecycle-manager/olm-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "olm-operator-6c99f456cf" has timed out progressing.] I0426 05:14:25.958900 1 sync_worker.go:939] Update error 501 of 677: WorkloadNotProgressing deployment openshift-operator-lifecycle-manager/olm-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "olm-operator-6c99f456cf" has timed out progressing. (*errors.errorString: deployment openshift-operator-lifecycle-manager/olm-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "olm-operator-6c99f456cf" has timed out progressing.) E0426 05:14:25.958922 1 sync_worker.go:353] unable to synchronize image (waiting 2m50.956499648s): deployment openshift-operator-lifecycle-manager/olm-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "olm-operator-6c99f456cf" has timed out progressing. $ oc get -o json clusterversion version | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message' | sort 2021-04-26T03:22:45Z RetrievedUpdates=False NoChannel: The update channel has not been configured. 2021-04-26T03:56:13Z Available=True : Done applying 4.8.0-0.nightly-2021-04-25-195440 2021-04-26T03:56:13Z Progressing=False WorkloadNotProgressing: Error while reconciling 4.8.0-0.nightly-2021-04-25-195440: the workload openshift-operator-lifecycle-manager/olm-operator cannot roll out 2021-04-26T04:48:57Z Upgradeable=False PoolUpdating: Cluster operator machine-config cannot be upgraded between minor versions: One or more machine config pools are updating, please see `oc get mcp` for further details 2021-04-26T05:05:28Z Failing=True WorkloadNotProgressing: deployment openshift-operator-lifecycle-manager/olm-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "olm-operator-6c99f456cf" has timed out progressing. Hooray, there we go :) [1]: https://github.com/openshift/console-operator/pull/508 [2]: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#max-surge [3]: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#max-unavailable Trevor, appreciate your help and detailed information. Verifying commit#8132d6a. Cordon all of the nodes: $ NODES="$(oc get -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' nodes)" $ for NODE in ${NODES}; do oc cordon "${NODE}"; done Update the catalog-operator deployment: $ oc patch deployment catalog-operator --type json -p '[{"op": "add", "path": "/spec/template/metadata/annotations/poke", "value": "123"}]' Check new catalog-operator pod is pending: $ oc get po NAME READY STATUS RESTARTS AGE catalog-operator-675bc6d7cb-kx5bm 1/1 Running 0 3d1h catalog-operator-7b46fdf5-txzhx 0/1 Pending 0 15m olm-operator-65b47b65f5-sj7xl 1/1 Running 0 3d1h packageserver-676bc98c9-kgv5m 1/1 Running 0 3d1h packageserver-676bc98c9-xqvsp 1/1 Running 0 3d1h Wait 10m... check the deployment state: $ oc get deployment catalog-operator -ojson | jq -r '.status.conditions[]' { "lastTransitionTime": "2021-04-23T08:26:40Z", "lastUpdateTime": "2021-04-23T08:26:40Z", "message": "Deployment has minimum availability.", "reason": "MinimumReplicasAvailable", "status": "True", "type": "Available" } { "lastTransitionTime": "2021-04-26T10:04:38Z", "lastUpdateTime": "2021-04-26T10:04:38Z", "message": "ReplicaSet \"catalog-operator-7b46fdf5\" has timed out progressing.", "reason": "ProgressDeadlineExceeded", "status": "False", "type": "Progressing" } After a while, check the CVO conditions: $ oc get clusterversion -o json | jq -r '.items[].status.conditions[]' { "lastTransitionTime": "2021-04-23T08:22:33Z", "message": "Done applying 4.8.0-0.nightly-2021-04-22-225832", "status": "True", "type": "Available" } { "lastTransitionTime": "2021-04-26T10:14:20Z", "message": "deployment openshift-operator-lifecycle-manager/catalog-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet \"catalog-operator-7b46fdf5\" has timed out progressing.", "reason": "WorkloadNotProgressing", "status": "True", "type": "Failing" } { "lastTransitionTime": "2021-04-23T08:22:33Z", "message": "Error while reconciling 4.8.0-0.nightly-2021-04-22-225832: the workload openshift-operator-lifecycle-manager/catalog-operator cannot roll out", "reason": "WorkloadNotProgressing", "status": "False", "type": "Progressing" } { "lastTransitionTime": "2021-04-23T07:55:27Z", "message": "Unable to retrieve available updates: currently reconciling cluster version 4.8.0-0.nightly-2021-04-22-225832 not found in the \"stable-4.7\" channel", "reason": "VersionNotFound", "status": "False", "type": "RetrievedUpdates" } { "lastTransitionTime": "2021-04-26T08:05:10Z", "message": "Cluster operator machine-config cannot be upgraded between minor versions: One or more machine config pools are updating, please see `oc get mcp` for further details", "reason": "PoolUpdating", "status": "False", "type": "Upgradeable" } Check the CVO log: $ oc -n openshift-cluster-version logs cluster-version-operator-6f4c448795-54z97 | grep 'deployment.*catalog\|Result of work' | tail -5 E0426 11:42:19.121497 1 task.go:112] error running apply for deployment "openshift-operator-lifecycle-manager/catalog-operator" (502 of 677): deployment openshift-operator-lifecycle-manager/catalog-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "catalog-operator-7b46fdf5" has timed out progressing. E0426 11:42:41.875235 1 task.go:112] error running apply for deployment "openshift-operator-lifecycle-manager/catalog-operator" (502 of 677): deployment openshift-operator-lifecycle-manager/catalog-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "catalog-operator-7b46fdf5" has timed out progressing. I0426 11:42:45.607570 1 task_graph.go:555] Result of work: [deployment openshift-operator-lifecycle-manager/catalog-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "catalog-operator-7b46fdf5" has timed out progressing. Cluster operator monitoring is not available] I0426 11:42:45.607688 1 sync_worker.go:939] Update error 502 of 677: WorkloadNotProgressing deployment openshift-operator-lifecycle-manager/catalog-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "catalog-operator-7b46fdf5" has timed out progressing. (*errors.errorString: deployment openshift-operator-lifecycle-manager/catalog-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "catalog-operator-7b46fdf5" has timed out progressing.) E0426 11:42:45.607736 1 sync_worker.go:353] unable to synchronize image (waiting 2m50.956499648s): deployment openshift-operator-lifecycle-manager/catalog-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "catalog-operator-7b46fdf5" has timed out progressing. The 2 commits work as design. Moving it to verified state. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |