Bug 1951339
| Summary: | Cluster-version operator does not manage operand container environments when manifest lacks opinions | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | W. Trevor King <wking> |
| Component: | Cluster Version Operator | Assignee: | W. Trevor King <wking> |
| Status: | CLOSED ERRATA | QA Contact: | Yang Yang <yanyang> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 4.1.z | CC: | aos-bugs, jokerman, mbargenq, yanyang |
| Target Milestone: | --- | ||
| Target Release: | 4.8.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: |
Cause: The cluster-version operator did not reconcile env and envFrom for manifests which did not set those properties.
Consequence: Cluster administrators were able to manipulate those properties, potentially adding broken values, without the cluster-version operator stomping their changes.
Fix: The cluster-version operator now clears env and envFrom if they are unset in the manifest.
Result: The cluster will automatically recover from invalid cluster-admin injections to these properties.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-07-27 23:01:52 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Reproducing it with 4.8.0-fc.0.
Steps to reproduce:
1. Install a cluster with 4.8.0-fc.0
2. Update the marketplace deployment by inject env variable for containers:
envFrom:
- secretRef:
name: openshift-reg
3. Check the results:
marketplace-operator is recreated with config error
# oc get all
NAME READY STATUS RESTARTS AGE
pod/certified-operators-lddkh 1/1 Running 0 155m
pod/community-operators-82cdb 1/1 Running 0 16m
pod/marketplace-operator-587b7bf94d-2sg5n 1/1 Running 0 153m
pod/marketplace-operator-7686ffb476-84x6s 0/1 CreateContainerConfigError 0 47s
pod/qe-app-registry-24l67 1/1 Running 0 156m
pod/redhat-marketplace-xx8jl 1/1 Running 0 156m
pod/redhat-operators-rccr2 1/1 Running 0 156m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/certified-operators ClusterIP 172.30.236.173 <none> 50051/TCP 179m
service/community-operators ClusterIP 172.30.65.242 <none> 50051/TCP 179m
service/marketplace-operator-metrics ClusterIP 172.30.177.155 <none> 8383/TCP,8081/TCP 3h8m
service/qe-app-registry ClusterIP 172.30.163.173 <none> 50051/TCP 160m
service/redhat-marketplace ClusterIP 172.30.32.12 <none> 50051/TCP 179m
service/redhat-operators ClusterIP 172.30.131.49 <none> 50051/TCP 179m
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/marketplace-operator 1/1 1 1 3h8m
NAME DESIRED CURRENT READY AGE
replicaset.apps/marketplace-operator-587b7bf94d 1 1 1 3h8m
replicaset.apps/marketplace-operator-7686ffb476 1 1 0 47s
The deployment is broke by:
# oc describe pod/marketplace-operator-7686ffb476-84x6s
Warning Failed <invalid> (x10 over 75s) kubelet Error: secret "openshift-reg" not found
The deployment is not terminated directly.
# oc get deployment -ojson | jq -r '.items[].status.conditions'
[
{
"lastTransitionTime": "2021-04-23T04:07:30Z",
"lastUpdateTime": "2021-04-23T04:07:30Z",
"message": "Deployment has minimum availability.",
"reason": "MinimumReplicasAvailable",
"status": "True",
"type": "Available"
},
{
"lastTransitionTime": "2021-04-23T03:31:14Z",
"lastUpdateTime": "2021-04-23T06:38:54Z",
"message": "ReplicaSet \"marketplace-operator-7686ffb476\" is progressing.",
"reason": "ReplicaSetUpdated",
"status": "True",
"type": "Progressing"
}
]
CVO does not log the errors.
# grep 'deployment.*marketplace' cvo.log | grep -v Throttling
I0423 06:46:33.540152 1 sync_worker.go:767] Running sync for deployment "openshift-marketplace/marketplace-operator" (521 of 678)
I0423 06:46:33.954966 1 sync_worker.go:779] Done syncing for deployment "openshift-marketplace/marketplace-operator" (521 of 678)
I0423 06:50:05.461871 1 sync_worker.go:767] Running sync for deployment "openshift-marketplace/marketplace-operator" (521 of 678)
I0423 06:50:05.851594 1 sync_worker.go:779] Done syncing for deployment "openshift-marketplace/marketplace-operator" (521 of 678)
> # oc get deployment -ojson | jq -r '.items[].status.conditions'
> [
> {
> "lastTransitionTime": "2021-04-23T04:07:30Z",
> "lastUpdateTime": "2021-04-23T04:07:30Z",
> "message": "Deployment has minimum availability.",
> "reason": "MinimumReplicasAvailable",
> "status": "True",
> "type": "Available"
> },
> {
> "lastTransitionTime": "2021-04-23T03:31:14Z",
> "lastUpdateTime": "2021-04-23T06:38:54Z",
> "message": "ReplicaSet \"marketplace-operator-7686ffb476\" is
> progressing.",
> "reason": "ReplicaSetUpdated",
> "status": "True",
> "type": "Progressing"
> }
> ]
>
The deployment is timeout eventually.
# oc get deployment -ojson | jq -r '.items[].status.conditions'
[
{
"lastTransitionTime": "2021-04-23T09:35:15Z",
"lastUpdateTime": "2021-04-23T09:35:15Z",
"message": "Deployment has minimum availability.",
"reason": "MinimumReplicasAvailable",
"status": "True",
"type": "Available"
},
{
"lastTransitionTime": "2021-04-23T10:33:25Z",
"lastUpdateTime": "2021-04-23T10:33:25Z",
"message": "ReplicaSet \"marketplace-operator-7686ffb476\" has timed out progressing.",
"reason": "ProgressDeadlineExceeded",
"status": "False",
"type": "Progressing"
}
]
Verifying with 4.8.0-0.nightly-2021-04-22-225832.
# oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.8.0-0.nightly-2021-04-22-225832 True False 13m Cluster version is 4.8.0-0.nightly-2021-04-22-225832
Update the marketplace deployment by injecting the env variables:
envFrom:
- secretRef:
name: openshift-reg
The deployment has an error.
# oc get all
NAME READY STATUS RESTARTS AGE
pod/certified-operators-59phh 1/1 Running 0 15m
pod/community-operators-6pfwq 1/1 Running 0 16m
pod/marketplace-operator-548df64d-5nctr 1/1 Running 0 16m
pod/marketplace-operator-649ccff7f4-lfvnm 0/1 CreateContainerConfigError 0 7s
pod/qe-app-registry-vmrr7 1/1 Running 0 2m13s
pod/redhat-marketplace-hxhfn 1/1 Running 0 15m
pod/redhat-operators-882cr 1/1 Running 0 15m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/certified-operators ClusterIP 172.30.30.200 <none> 50051/TCP 39m
service/community-operators ClusterIP 172.30.102.1 <none> 50051/TCP 39m
service/marketplace-operator-metrics ClusterIP 172.30.208.23 <none> 8383/TCP,8081/TCP 47m
service/qe-app-registry ClusterIP 172.30.219.78 <none> 50051/TCP 19m
service/redhat-marketplace ClusterIP 172.30.126.70 <none> 50051/TCP 39m
service/redhat-operators ClusterIP 172.30.113.251 <none> 50051/TCP 39m
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/marketplace-operator 1/1 1 1 47m
NAME DESIRED CURRENT READY AGE
replicaset.apps/marketplace-operator-548df64d 1 1 1 47m
replicaset.apps/marketplace-operator-649ccff7f4 1 1 0 8s
# oc get pod/marketplace-operator-649ccff7f4-lfvnm -ojson | jq -r '.status.containerStatuses[].state'
{
"waiting": {
"message": "secret \"openshift-reg\" not found",
"reason": "CreateContainerConfigError"
}
}
# oc get deployment -ojson | jq -r '.items[].status.conditions'
[
{
"lastTransitionTime": "2021-04-23T08:27:09Z",
"lastUpdateTime": "2021-04-23T08:27:09Z",
"message": "Deployment has minimum availability.",
"reason": "MinimumReplicasAvailable",
"status": "True",
"type": "Available"
},
{
"lastTransitionTime": "2021-04-23T07:55:53Z",
"lastUpdateTime": "2021-04-23T08:42:59Z",
"message": "ReplicaSet \"marketplace-operator-649ccff7f4\" is progressing.",
"reason": "ReplicaSetUpdated",
"status": "True",
"type": "Progressing"
}
]
After a while, the operator is reconciled.
# oc get deployment -ojson | jq -r '.items[].status.conditions'
[
{
"lastTransitionTime": "2021-04-23T08:27:09Z",
"lastUpdateTime": "2021-04-23T08:27:09Z",
"message": "Deployment has minimum availability.",
"reason": "MinimumReplicasAvailable",
"status": "True",
"type": "Available"
},
{
"lastTransitionTime": "2021-04-23T07:55:53Z",
"lastUpdateTime": "2021-04-23T08:46:14Z",
"message": "ReplicaSet \"marketplace-operator-548df64d\" has successfully progressed.",
"reason": "NewReplicaSetAvailable",
"status": "True",
"type": "Progressing"
}
]
# oc get all
NAME READY STATUS RESTARTS AGE
pod/certified-operators-59phh 1/1 Running 0 21m
pod/community-operators-6pfwq 1/1 Running 0 22m
pod/marketplace-operator-548df64d-5nctr 1/1 Running 0 22m
pod/qe-app-registry-vmrr7 1/1 Running 0 8m18s
pod/redhat-marketplace-hxhfn 1/1 Running 0 21m
pod/redhat-operators-882cr 1/1 Running 0 21m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/certified-operators ClusterIP 172.30.30.200 <none> 50051/TCP 45m
service/community-operators ClusterIP 172.30.102.1 <none> 50051/TCP 45m
service/marketplace-operator-metrics ClusterIP 172.30.208.23 <none> 8383/TCP,8081/TCP 53m
service/qe-app-registry ClusterIP 172.30.219.78 <none> 50051/TCP 25m
service/redhat-marketplace ClusterIP 172.30.126.70 <none> 50051/TCP 45m
service/redhat-operators ClusterIP 172.30.113.251 <none> 50051/TCP 45m
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/marketplace-operator 1/1 1 1 53m
NAME DESIRED CURRENT READY AGE
replicaset.apps/marketplace-operator-548df64d 1 1 1 53m
replicaset.apps/marketplace-operator-649ccff7f4 0 0 0 6m13s
To verify commit#8132d6a, I changed the image digest to an invalid one and changed the progressDeadlineSeconds to 5. But I'm unlucky and I doesn't get ProgressDeadlineExceeded. @Trevor, could you please hint me how can I get ProgressDeadlineExceeded scenario? Thanks. Possibly cordoning all the nodes and then killing something with replicas > 2 that is managed by the CVO? Looking for candidates: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-aws/1385644983909879808/artifacts/e2e-aws/gather-extra/artifacts/deployments.json | jq -r '.items[] | select(.spec.replicas >= 2).metadata | .namespace + " " + .name' ... openshift-console downloads ... So something like: $ NODES="$(oc get -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' nodes)" $ for NODE in ${NODES}; do oc cordon "${NODE}"; done $ POD="$(oc -n openshift-console get -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' pods | grep downloads | head -n1)" $ oc -n openshift-console delete pod "${POD}" and then wait 10 minutes [1] for the default progressDeadlineSeconds to expire. [1]: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#progress-deadline-seconds Verifying commit#8132d6a.
Cordon all of the nodes:
$ NODES="$(oc get -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' nodes)"
$ for NODE in ${NODES}; do oc cordon "${NODE}"; done
Delete a deployment:
$ oc delete deployment.apps/downloadselete -n openshift-console
Wait 600s for ProgressDeadlineExceeded:
$ oc get deployment.apps/downloads -ojson | jq -r '.status.conditions'
[
{
"lastTransitionTime": "2021-04-25T09:49:57Z",
"lastUpdateTime": "2021-04-25T09:49:57Z",
"message": "Deployment does not have minimum availability.",
"reason": "MinimumReplicasUnavailable",
"status": "False",
"type": "Available"
},
{
"lastTransitionTime": "2021-04-25T09:59:58Z",
"lastUpdateTime": "2021-04-25T09:59:58Z",
"message": "ReplicaSet \"downloads-7fdddb48b4\" has timed out progressing.",
"reason": "ProgressDeadlineExceeded",
"status": "False",
"type": "Progressing"
}
]
Describe download pod:
$ oc describe pod/downloads-7fdddb48b4-wnljj
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 12m default-scheduler 0/6 nodes are available: 6 node(s) were unschedulable.
Warning FailedScheduling 12m default-scheduler 0/6 nodes are available: 6 node(s) were unschedulable.
ProgressDeadlineExceeded error is not logged in pod event and CVO log file.
> Delete a deployment... If you delete the Deployment, it will remove both pods. The CVO will create a replacement Deployment, but with 0 of 2 pods ready, that Deployment will be, as you saw, Available=False as well as ProgressDeadlineExceeded. Comment 6 called for leaving the Deployment alone, but deleting a single pod. I expect that will leave the Deployment Available=True while still getting us ProgressDeadlineExceeded. But in either case, Available=False and ProgressDeadlineExceeded should both be sufficient to get the CVO to mention the issue in ClusterVersion conditions. If it doesn't, can you include the ClusterVersion YAML? Comment #6 does not get ProgressDeadlineExceeded so I delete the deployment. # oc get clusterversion -oyaml apiVersion: v1 items: - apiVersion: config.openshift.io/v1 kind: ClusterVersion metadata: creationTimestamp: "2021-04-23T07:55:27Z" generation: 1 name: version resourceVersion: "1694483" uid: 98e037f9-ed02-491d-894b-1f5965638c00 spec: channel: stable-4.7 clusterID: 93f31b0e-ef15-4b2f-b441-fe5c6c786231 status: availableUpdates: null conditions: - lastTransitionTime: "2021-04-23T08:22:33Z" message: Done applying 4.8.0-0.nightly-2021-04-22-225832 status: "True" type: Available - lastTransitionTime: "2021-04-26T01:33:35Z" message: 'Some cluster operators are still updating: machine-config, monitoring' reason: ClusterOperatorsNotAvailable status: "True" type: Failing - lastTransitionTime: "2021-04-23T08:22:33Z" message: 'Error while reconciling 4.8.0-0.nightly-2021-04-22-225832: some cluster operators have not yet rolled out' reason: ClusterOperatorsNotAvailable status: "False" type: Progressing - lastTransitionTime: "2021-04-23T07:55:27Z" message: 'Unable to retrieve available updates: currently reconciling cluster version 4.8.0-0.nightly-2021-04-22-225832 not found in the "stable-4.7" channel' reason: VersionNotFound status: "False" type: RetrievedUpdates - lastTransitionTime: "2021-04-25T09:50:05Z" message: 'Cluster operator machine-config cannot be upgraded between minor versions: One or more machine config pools are updating, please see `oc get mcp` for further details' reason: PoolUpdating status: "False" type: Upgradeable desired: image: registry.ci.openshift.org/ocp/release@sha256:018e7acbdde984b80c0ed7eaeea7d3f27dc4442dc15e7f9c6de02ffaf9e9eee9 version: 4.8.0-0.nightly-2021-04-22-225832 history: - completionTime: "2021-04-23T08:22:33Z" image: registry.ci.openshift.org/ocp/release@sha256:018e7acbdde984b80c0ed7eaeea7d3f27dc4442dc15e7f9c6de02ffaf9e9eee9 startedTime: "2021-04-23T07:55:27Z" state: Completed verified: false version: 4.8.0-0.nightly-2021-04-22-225832 observedGeneration: 1 versionHash: 14fG0Q-Un8E= kind: List metadata: resourceVersion: "" selfLink: "" # oc -n openshift-console get all NAME READY STATUS RESTARTS AGE pod/console-b5f64d9c7-lncw9 1/1 Running 0 18h pod/console-b5f64d9c7-mk89l 1/1 Running 0 18h pod/downloads-7fdddb48b4-wnljj 0/1 Pending 0 16h pod/downloads-7fdddb48b4-zl2g2 0/1 Pending 0 16h NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/console ClusterIP 172.30.87.18 <none> 443/TCP 2d18h service/downloads ClusterIP 172.30.119.245 <none> 80/TCP 2d18h NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/console 2/2 2 2 2d18h deployment.apps/downloads 0/2 2 0 16h NAME DESIRED CURRENT READY AGE replicaset.apps/console-585b4f86d6 0 0 0 2d17h replicaset.apps/console-5dcbfdc595 0 0 0 18h replicaset.apps/console-6d466d4bcc 0 0 0 2d18h replicaset.apps/console-b5f64d9c7 2 2 2 18h replicaset.apps/downloads-7fdddb48b4 2 2 0 16h NAME HOST/PORT PATH SERVICES PORT TERMINATION WILDCARD route.route.openshift.io/console console-openshift-console.apps.yangyang0423-1.qe.gcp.devcluster.openshift.com console https reencrypt/Redirect None route.route.openshift.io/downloads downloads-openshift-console.apps.yangyang0423-1.qe.gcp.devcluster.openshift.com downloads http edge/Redirect None Hmm. Poking around in a cluster-bot cluster:
$ oc get clusterversion -o jsonpath='{.status.desired.version}{"\n"}' version
4.8.0-0.nightly-2021-04-25-195440
$ NODES="$(oc get -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' nodes)"
$ for NODE in ${NODES}; do oc adm cordon "${NODE}"; done
$ POD="$(oc -n openshift-console get -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' pods | grep downloads | head -n1)"
$ oc -n openshift-console delete pod "${POD}"
$ oc -n openshift-console get -o json deployment downloads | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message' | sort
2021-04-26T03:38:50Z Progressing=True NewReplicaSetAvailable: ReplicaSet "downloads-8dcc96bf7" has successfully progressed.
2021-04-26T03:59:55Z Available=False MinimumReplicasUnavailable: Deployment does not have minimum availability.
Hrm, I didn't even have to wait 10m. ClusterVersion doesn't say anything about this issue:
$ oc get -o json clusterversion version | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message' | sort
2021-04-26T03:22:45Z RetrievedUpdates=False NoChannel: The update channel has not been configured.
2021-04-26T03:56:13Z Available=True : Done applying 4.8.0-0.nightly-2021-04-25-195440
2021-04-26T03:56:13Z Failing=False :
2021-04-26T03:56:13Z Progressing=False : Cluster version is 4.8.0-0.nightly-2021-04-25-195440
2021-04-26T04:00:48Z Upgradeable=False PoolUpdating: Cluster operator machine-config cannot be upgraded between minor versions: One or more machine config pools are updating, please see `oc get mcp` for further details
I wonder what the deal is with machine-config?
$ oc get mcp
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
master rendered-master-555bdd56b2ffb5c9b65c53df4805605b False True False 3 0 3 0 36m
worker rendered-worker-5d9d6269bbbb1561d7ae1b34c447f4c9 False True False 3 0 3 0 36m
3 of 3 in each pool are updated, but Updated=False and Updating=True because they are cordoned, and therefore not ready. That doesn't feel right. But anyway, back to the CVO:
$ POD="$(oc -n openshift-cluster-version get -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' pods)"
$ oc -n openshift-cluster-version logs "${POD}" | grep 'Running sync.*state\|downloads\|Result of work' | grep -v 'Throttling request'
...
I0426 04:08:49.958758 1 sync_worker.go:549] Running sync 4.8.0-0.nightly-2021-04-25-195440 (force=false) on generation 1 in state Reconciling at attempt 0
I0426 04:08:56.715714 1 sync_worker.go:767] Running sync for customresourcedefinition "consoleclidownloads.console.openshift.io" (51 of 677)
I0426 04:08:56.816355 1 sync_worker.go:779] Done syncing for customresourcedefinition "consoleclidownloads.console.openshift.io" (51 of 677)
I0426 04:08:57.115298 1 sync_worker.go:767] Running sync for customresourcedefinition "storagestates.migration.k8s.io" (198 of 677)
I0426 04:09:15.087322 1 sync_worker.go:767] Running sync for route "openshift-console/downloads" (448 of 677)
I0426 04:09:15.122404 1 sync_worker.go:779] Done syncing for route "openshift-console/downloads" (448 of 677)
I0426 04:09:15.122460 1 sync_worker.go:767] Running sync for service "openshift-console/downloads" (449 of 677)
I0426 04:09:15.321107 1 sync_worker.go:779] Done syncing for service "openshift-console/downloads" (449 of 677)
I0426 04:09:18.161172 1 task_graph.go:555] Result of work: []
...
Hrm, no downloads deployment. Ah, because the console folks moved it out from under the CVO to control it directly in [1]. So sorry for the stale choice of deployment. Uncordon so the downloads deployment can recover:
$ for NODE in ${NODES}; do oc adm uncordon "${NODE}"; done
And go looking for another deployment:
$ oc adm release extract --to manifest
$ grep -ho 'replicas: .*' $(grep -rl 'kind: Deployment' manifest) | sort | uniq -c
40 replicas: 1
So how to trigger Available=True with ProgressDeadlineExceeded... Maybe if I cordon everyone during an update, we'll be unable to schedule... But I don't want the old pod to go away first, because then we'd be Available=False. I see the default maxSurge is 25% [2]. And maxUnavailable rounds down [3]. So maybe our RollingUpdate deployments are surging.
$ grep RollingUpdate manifest/*deployment*
manifest/0000_50_olm_07-olm-operator.deployment.ibm-cloud-managed.yaml: type: RollingUpdate
manifest/0000_50_olm_07-olm-operator.deployment.yaml: type: RollingUpdate
manifest/0000_50_olm_08-catalog-operator.deployment.ibm-cloud-managed.yaml: type: RollingUpdate
manifest/0000_50_olm_08-catalog-operator.deployment.yaml: type: RollingUpdate
$ grep name manifest/0000_50_olm_07-olm-operator.deployment.yaml | head -n2
name: olm-operator
namespace: openshift-operator-lifecycle-manager
Ok, so I'm going to cordon all the nodes again, and then touch OLM to trigger a rolling update, which will hopefully surge a new pod, and then complain about that pod being stuck:
$ for NODE in ${NODES}; do oc adm cordon "${NODE}"; done
$ oc -n openshift-operator-lifecycle-manager get -o json deployment olm-operator | jq .spec.template.metadata.annotations
{
"workload.openshift.io/management": "{\"effect\": \"PreferredDuringScheduling\"}"
}
$ oc -n openshift-operator-lifecycle-manager patch deployment olm-operator --type json -p '[{"op": "add", "path": "/spec/template/metadata/annotations/poke", "value": "123"}]'
$ oc -n openshift-operator-lifecycle-manager get -o json deployment olm-operator | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message' | sort
2021-04-26T03:23:18Z Progressing=True ReplicaSetUpdated: ReplicaSet "olm-operator-6c99f456cf" is progressing.
2021-04-26T03:31:28Z Available=True MinimumReplicasAvailable: Deployment has minimum availability.
$ oc -n openshift-operator-lifecycle-manager get pods
NAME READY STATUS RESTARTS AGE
...
olm-operator-589957f7bd-z4tlv 1/1 Running 0 88m
olm-operator-6c99f456cf-fr8t6 0/1 Pending 0 54s
...
Perfect :). Waiting for 10m... Now:
$ oc -n openshift-operator-lifecycle-manager get pods | grep olm-operator
olm-operator-589957f7bd-z4tlv 1/1 Running 0 97m
olm-operator-6c99f456cf-fr8t6 0/1 Pending 0 10m
$ oc -n openshift-operator-lifecycle-manager get -o json deployment olm-operator | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message' | sort
2021-04-26T03:31:28Z Available=True MinimumReplicasAvailable: Deployment has minimum availability.
2021-04-26T05:00:31Z Progressing=False ProgressDeadlineExceeded: ReplicaSet "olm-operator-6c99f456cf" has timed out progressing.
$ oc -n openshift-cluster-version logs "${POD}" | grep 'Running sync.*state\|deployment.*olm-operator\|Result of work' | grep -v 'Throttling request' | tail
I0426 04:53:35.383334 1 task_graph.go:555] Result of work: []
I0426 04:56:26.340740 1 sync_worker.go:549] Running sync 4.8.0-0.nightly-2021-04-25-195440 (force=false) on generation 1 in state Reconciling at attempt 0
I0426 04:56:33.347692 1 sync_worker.go:767] Running sync for customresourcedefinition "storagestates.migration.k8s.io" (198 of 677)
I0426 04:56:53.696766 1 sync_worker.go:767] Running sync for deployment "openshift-operator-lifecycle-manager/olm-operator" (501 of 677)
I0426 04:56:53.996634 1 sync_worker.go:779] Done syncing for deployment "openshift-operator-lifecycle-manager/olm-operator" (501 of 677)
I0426 04:56:54.547242 1 task_graph.go:555] Result of work: []
I0426 04:59:45.504344 1 sync_worker.go:549] Running sync 4.8.0-0.nightly-2021-04-25-195440 (force=false) on generation 1 in state Reconciling at attempt 0
I0426 04:59:52.619062 1 sync_worker.go:767] Running sync for customresourcedefinition "storagestates.migration.k8s.io" (198 of 677)
I0426 05:00:13.211034 1 sync_worker.go:767] Running sync for deployment "openshift-operator-lifecycle-manager/olm-operator" (501 of 677)
I0426 05:00:13.363510 1 sync_worker.go:779] Done syncing for deployment "openshift-operator-lifecycle-manager/olm-operator" (501 of 677)
The CVO sticks there for around five minutes waiting for machine-config to happy up, which never happens, because we're still cordoned. Once that times out, we wait out the cool-down between sync cycles, start in again, and notice the stuck deployment:
$ oc -n openshift-cluster-version logs "${POD}" | grep 'Running sync.*in state\|deployment.*olm-operator\|Result of work' | grep -v 'Throttling request' | tail
I0426 04:56:53.696766 1 sync_worker.go:767] Running sync for deployment "openshift-operator-lifecycle-manager/olm-operator" (501 of 677)
I0426 04:56:53.996634 1 sync_worker.go:779] Done syncing for deployment "openshift-operator-lifecycle-manager/olm-operator" (501 of 677)
I0426 04:56:54.547242 1 task_graph.go:555] Result of work: []
I0426 04:59:45.504344 1 sync_worker.go:549] Running sync 4.8.0-0.nightly-2021-04-25-195440 (force=false) on generation 1 in state Reconciling at attempt 0
I0426 05:00:13.211034 1 sync_worker.go:767] Running sync for deployment "openshift-operator-lifecycle-manager/olm-operator" (501 of 677)
I0426 05:00:13.363510 1 sync_worker.go:779] Done syncing for deployment "openshift-operator-lifecycle-manager/olm-operator" (501 of 677)
I0426 05:05:27.417819 1 task_graph.go:555] Result of work: [Cluster operator machine-config is not available]
I0426 05:08:44.045069 1 sync_worker.go:549] Running sync 4.8.0-0.nightly-2021-04-25-195440 (force=false) on generation 1 in state Reconciling at attempt 0
I0426 05:09:11.750990 1 sync_worker.go:767] Running sync for deployment "openshift-operator-lifecycle-manager/olm-operator" (501 of 677)
E0426 05:09:11.901500 1 task.go:112] error running apply for deployment "openshift-operator-lifecycle-manager/olm-operator" (501 of 677): deployment openshift-operator-lifecycle-manager/olm-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "olm-operator-6c99f456cf" has timed out progressing.
It takes us another 5m to give up on the stuck Deployment (we should probably fail-fast on ProgressDeadlineExceeded to avoid stacking timeouts, but can adjust to that in follow-up work). But eventually we do give up:
$ oc -n openshift-cluster-version logs "${POD}" | grep 'Running sync.*in state\|deployment.*olm-operator\|Result of work' | grep -v 'Throttling request' | tail
E0426 05:12:19.268254 1 task.go:112] error running apply for deployment "openshift-operator-lifecycle-manager/olm-operator" (501 of 677): deployment openshift-operator-lifecycle-manager/olm-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "olm-operator-6c99f456cf" has timed out progressing.
E0426 05:12:37.640209 1 task.go:112] error running apply for deployment "openshift-operator-lifecycle-manager/olm-operator" (501 of 677): deployment openshift-operator-lifecycle-manager/olm-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "olm-operator-6c99f456cf" has timed out progressing.
E0426 05:12:57.047203 1 task.go:112] error running apply for deployment "openshift-operator-lifecycle-manager/olm-operator" (501 of 677): deployment openshift-operator-lifecycle-manager/olm-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "olm-operator-6c99f456cf" has timed out progressing.
E0426 05:13:22.513803 1 task.go:112] error running apply for deployment "openshift-operator-lifecycle-manager/olm-operator" (501 of 677): deployment openshift-operator-lifecycle-manager/olm-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "olm-operator-6c99f456cf" has timed out progressing.
E0426 05:13:45.891628 1 task.go:112] error running apply for deployment "openshift-operator-lifecycle-manager/olm-operator" (501 of 677): deployment openshift-operator-lifecycle-manager/olm-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "olm-operator-6c99f456cf" has timed out progressing.
E0426 05:14:00.301317 1 task.go:112] error running apply for deployment "openshift-operator-lifecycle-manager/olm-operator" (501 of 677): deployment openshift-operator-lifecycle-manager/olm-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "olm-operator-6c99f456cf" has timed out progressing.
E0426 05:14:23.508240 1 task.go:112] error running apply for deployment "openshift-operator-lifecycle-manager/olm-operator" (501 of 677): deployment openshift-operator-lifecycle-manager/olm-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "olm-operator-6c99f456cf" has timed out progressing.
I0426 05:14:25.958826 1 task_graph.go:555] Result of work: [Cluster operator machine-config is not available deployment openshift-operator-lifecycle-manager/olm-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "olm-operator-6c99f456cf" has timed out progressing.]
I0426 05:14:25.958900 1 sync_worker.go:939] Update error 501 of 677: WorkloadNotProgressing deployment openshift-operator-lifecycle-manager/olm-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "olm-operator-6c99f456cf" has timed out progressing. (*errors.errorString: deployment openshift-operator-lifecycle-manager/olm-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "olm-operator-6c99f456cf" has timed out progressing.)
E0426 05:14:25.958922 1 sync_worker.go:353] unable to synchronize image (waiting 2m50.956499648s): deployment openshift-operator-lifecycle-manager/olm-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "olm-operator-6c99f456cf" has timed out progressing.
$ oc get -o json clusterversion version | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message' | sort
2021-04-26T03:22:45Z RetrievedUpdates=False NoChannel: The update channel has not been configured.
2021-04-26T03:56:13Z Available=True : Done applying 4.8.0-0.nightly-2021-04-25-195440
2021-04-26T03:56:13Z Progressing=False WorkloadNotProgressing: Error while reconciling 4.8.0-0.nightly-2021-04-25-195440: the workload openshift-operator-lifecycle-manager/olm-operator cannot roll out
2021-04-26T04:48:57Z Upgradeable=False PoolUpdating: Cluster operator machine-config cannot be upgraded between minor versions: One or more machine config pools are updating, please see `oc get mcp` for further details
2021-04-26T05:05:28Z Failing=True WorkloadNotProgressing: deployment openshift-operator-lifecycle-manager/olm-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "olm-operator-6c99f456cf" has timed out progressing.
Hooray, there we go :)
[1]: https://github.com/openshift/console-operator/pull/508
[2]: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#max-surge
[3]: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#max-unavailable
Trevor, appreciate your help and detailed information.
Verifying commit#8132d6a.
Cordon all of the nodes:
$ NODES="$(oc get -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' nodes)"
$ for NODE in ${NODES}; do oc cordon "${NODE}"; done
Update the catalog-operator deployment:
$ oc patch deployment catalog-operator --type json -p '[{"op": "add", "path": "/spec/template/metadata/annotations/poke", "value": "123"}]'
Check new catalog-operator pod is pending:
$ oc get po
NAME READY STATUS RESTARTS AGE
catalog-operator-675bc6d7cb-kx5bm 1/1 Running 0 3d1h
catalog-operator-7b46fdf5-txzhx 0/1 Pending 0 15m
olm-operator-65b47b65f5-sj7xl 1/1 Running 0 3d1h
packageserver-676bc98c9-kgv5m 1/1 Running 0 3d1h
packageserver-676bc98c9-xqvsp 1/1 Running 0 3d1h
Wait 10m... check the deployment state:
$ oc get deployment catalog-operator -ojson | jq -r '.status.conditions[]'
{
"lastTransitionTime": "2021-04-23T08:26:40Z",
"lastUpdateTime": "2021-04-23T08:26:40Z",
"message": "Deployment has minimum availability.",
"reason": "MinimumReplicasAvailable",
"status": "True",
"type": "Available"
}
{
"lastTransitionTime": "2021-04-26T10:04:38Z",
"lastUpdateTime": "2021-04-26T10:04:38Z",
"message": "ReplicaSet \"catalog-operator-7b46fdf5\" has timed out progressing.",
"reason": "ProgressDeadlineExceeded",
"status": "False",
"type": "Progressing"
}
After a while, check the CVO conditions:
$ oc get clusterversion -o json | jq -r '.items[].status.conditions[]'
{
"lastTransitionTime": "2021-04-23T08:22:33Z",
"message": "Done applying 4.8.0-0.nightly-2021-04-22-225832",
"status": "True",
"type": "Available"
}
{
"lastTransitionTime": "2021-04-26T10:14:20Z",
"message": "deployment openshift-operator-lifecycle-manager/catalog-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet \"catalog-operator-7b46fdf5\" has timed out progressing.",
"reason": "WorkloadNotProgressing",
"status": "True",
"type": "Failing"
}
{
"lastTransitionTime": "2021-04-23T08:22:33Z",
"message": "Error while reconciling 4.8.0-0.nightly-2021-04-22-225832: the workload openshift-operator-lifecycle-manager/catalog-operator cannot roll out",
"reason": "WorkloadNotProgressing",
"status": "False",
"type": "Progressing"
}
{
"lastTransitionTime": "2021-04-23T07:55:27Z",
"message": "Unable to retrieve available updates: currently reconciling cluster version 4.8.0-0.nightly-2021-04-22-225832 not found in the \"stable-4.7\" channel",
"reason": "VersionNotFound",
"status": "False",
"type": "RetrievedUpdates"
}
{
"lastTransitionTime": "2021-04-26T08:05:10Z",
"message": "Cluster operator machine-config cannot be upgraded between minor versions: One or more machine config pools are updating, please see `oc get mcp` for further details",
"reason": "PoolUpdating",
"status": "False",
"type": "Upgradeable"
}
Check the CVO log:
$ oc -n openshift-cluster-version logs cluster-version-operator-6f4c448795-54z97 | grep 'deployment.*catalog\|Result of work' | tail -5
E0426 11:42:19.121497 1 task.go:112] error running apply for deployment "openshift-operator-lifecycle-manager/catalog-operator" (502 of 677): deployment openshift-operator-lifecycle-manager/catalog-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "catalog-operator-7b46fdf5" has timed out progressing.
E0426 11:42:41.875235 1 task.go:112] error running apply for deployment "openshift-operator-lifecycle-manager/catalog-operator" (502 of 677): deployment openshift-operator-lifecycle-manager/catalog-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "catalog-operator-7b46fdf5" has timed out progressing.
I0426 11:42:45.607570 1 task_graph.go:555] Result of work: [deployment openshift-operator-lifecycle-manager/catalog-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "catalog-operator-7b46fdf5" has timed out progressing. Cluster operator monitoring is not available]
I0426 11:42:45.607688 1 sync_worker.go:939] Update error 502 of 677: WorkloadNotProgressing deployment openshift-operator-lifecycle-manager/catalog-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "catalog-operator-7b46fdf5" has timed out progressing. (*errors.errorString: deployment openshift-operator-lifecycle-manager/catalog-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "catalog-operator-7b46fdf5" has timed out progressing.)
E0426 11:42:45.607736 1 sync_worker.go:353] unable to synchronize image (waiting 2m50.956499648s): deployment openshift-operator-lifecycle-manager/catalog-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "catalog-operator-7b46fdf5" has timed out progressing.
The 2 commits work as design. Moving it to verified state.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |
Seen in the wild, a cluster admin stuck an update by inserting: $ yaml2json <namespaces/openshift-marketplace/apps/deployments.yaml | jq -r '.items[].spec.template.spec.containers[].envFrom[]' { "secretRef": { "name": "openshift-reg" } } This broke the deploy: $ yaml2json <namespaces/openshift-marketplace/pods/marketplace-operator-f7cc88d59-hhh75/marketplace-operator-f7cc88d59-hhh75.yaml | jq -r '.status.containerStatuses[].state' { "waiting": { "message": "secret \"openshift-reg\" not found", "reason": "CreateContainerConfigError" } } We should reconcile the environment settings even if the manifest has no opinions on their content. Also, the CVO should have complained about the Deployment failing to reconcile, but it does not: $ grep 'deployment.*marketplace' namespaces/openshift-cluster-version/pods/cluster-version-operator-5d8d8bf546-qgbsd/cluster-version-operator/cluster-version-operator/logs/current.log | grep -v Throttling 2021-04-19T17:23:26.881236307Z I0419 17:23:26.881220 1 sync_worker.go:769] Running sync for deployment "openshift-marketplace/marketplace-operator" (512 of 668) 2021-04-19T17:23:29.926595332Z I0419 17:23:29.926544 1 sync_worker.go:781] Done syncing for deployment "openshift-marketplace/marketplace-operator" (512 of 668) ... 2021-04-19T20:37:18.117408461Z I0419 20:37:18.117372 1 sync_worker.go:769] Running sync for deployment "openshift-marketplace/marketplace-operator" (512 of 668) 2021-04-19T20:37:21.312143536Z I0419 20:37:21.312085 1 sync_worker.go:781] Done syncing for deployment "openshift-marketplace/marketplace-operator" (512 of 668) Checking the Deployment status: $ yaml2json <namespaces/openshift-marketplace/apps/deployments.yaml | jq -r '.items[].status.conditions' [ { "lastTransitionTime": "2021-04-12T22:04:41Z", "lastUpdateTime": "2021-04-12T22:04:41Z", "message": "Deployment has minimum availability.", "reason": "MinimumReplicasAvailable", "status": "True", "type": "Available" }, { "lastTransitionTime": "2021-04-19T17:33:30Z", "lastUpdateTime": "2021-04-19T17:33:30Z", "message": "ReplicaSet \"marketplace-operator-f7cc88d59\" has timed out progressing.", "reason": "ProgressDeadlineExceeded", "status": "False", "type": "Progressing" } ] We should be complaining about that ProgressDeadlineExceeded directly as well.