This bug was initially created as a copy of Bug #1862524 I am copying this bug because: Currently the CVO marks an upgrade as failed whenever an operator takes longer than 20 minutes to rollout. It's very common on clusters of any size to take more than 20 minutes to rollout operators which have daemonsets running on all hosts, in particular MCO, network, and dns operators. By moving this to 40 minutes we'll significantly reduce the noise so we can focus on upgrades which have real problems. There's follow up to make more significant implementation changes here but we'll push those out more slowly https://issues.redhat.com/browse/OTA-247
I'm verifying it by setting up a cluster which has 3 master nodes and 20 worker nodes with 4.5.14. Then trigger an upgrade to 4.6.0-0.nightly-2020-10-08-210814. Upgrade goes successfully. It took 27 mins to get network operator upgraded.
Hi Scott, What's use case that leads to this failure? I'm not able to reproduce it. I set up a cluster consisted of 3 master nodes and 20 worker nodes on 4.5.12 and did an upgrade to 4.6.0-fc.8. Upgrade was successful even if it took more than 20 minutes to get network and machine-config upgraded.
It would be whenever any one operator takes longer than 20 minutes to upgrade. I would certainly have expected a cluster with 20 workers to have exceeded 20 minutes on those operators unless maybe you were using a local registry that was exceptionally fast? I was able to exceed 20 minutes in a 12 node cluster in AWS installed from the CI registry. I'll see if I can confirm it when upgrading between those two versions.
The updates will succeed with or without this patch. The distinction is that before this patch, a single operator taking longer than 20m would lead to ClusterVersion reporting Failing=True. With this patch, ClusterVersion will only report Failing=True if an operator takes more than 40m. Otherwise cluster status, including the timing of the eventual successful completion, will be identical. The use case is to reduce false-alarm Failing=True reports for things like the network operator that are expected to take a while during healthy updates.
Thanks Scott and Trevor for clarification. I attempted to verify it with the following procedures: 1. Install an AWS cluster with 3 master nodes and 20 worker nodes on 4.5.14 2. Do a cluster upgrade to 4.6.0-0.nightly-2020-10-10-041109 3. Check Failing status CVO is reporting Failing=True when an operator took around 13m. Attaching the screenshots.
Created attachment 1720840 [details] kubeapi failure
Created attachment 1720841 [details] network failure
Created attachment 1720854 [details] machine-config failure
The cli I used to upgrade the cluster is # oc adm upgrade --to-image='registry.svc.ci.openshift.org/ocp/release@sha256:a33eff631dd29cfe412d9cdcfaec68844a18b902e657708e16dbd85203da4eed' --force --allow-explicit-upgrade The must-gather tarball is online: https://drive.google.com/file/d/1XfGhdudsKotzevRHOH9VjKUd5_0OiSrY/view?usp=sharing
From comment 11's must-gather: $ yaml2json <cluster-scoped-resources/config.openshift.io/clusterversions/version.yaml | jq -r '.status.history[] | .startedTime + " " + .completionTime + " " + .version + " " + .state + " " + (.verified | tostring)' 2020-10-12T05:52:30Z 2020-10-12T07:19:30Z 4.6.0-0.nightly-2020-10-10-041109 Completed false 2020-10-12T04:53:42Z 2020-10-12T05:21:41Z 4.5.14 Completed false All nightlies should have acceptable signatures on them, so no need to --force the update. Or maybe ART stops publishing nightly signatures once they begin cutting RCs? From the screenshots, timestamps for the Failing=True were: * 06:05:21Z Kube-API-server operator still updating * 06:36:51Z Network operator still updating * 07:08:05Z Machine-config operator still updating There won't be anything in the captured CVO logs about that: $ head -n1 namespaces/openshift-cluster-version/pods/cluster-version-operator-6fbfc576b9-4szqt/cluster-version-operator/cluster-version-operator/logs/current.log 2020-10-12T07:12:27.038577910Z I1012 07:12:27.022793 1 start.go:21] ClusterVersionOperator 4.6.0-202010100331.p0-c35a4e1 Maybe we should grow update CI around these clusters with more compute nodes, so we can get the usual event monitoring and Prometheus captures we collect in CI...
Setting target release to the active development branch (4.7.0). For any fixes, where required and requested, cloned BZs will be created for those release maintenance streams where appropriate once they are identified. https://coreos.slack.com/archives/CEGKQ43CP/p1602592336386900
*** Bug 1825006 has been marked as a duplicate of this bug. ***
Created attachment 1721660 [details] Gathered CVO logs from a 4.5.15 -> 4.6.0-rc.4 update Attempt at a reproducer: cluster-bot: launch 4.5.15 $ oc patch clusterversion version --type json -p '[{"op": "add", "path": "/spec/channel", "value": "candidate-4.6"}]' $ oc -n openshift-machine-api get machinesets NAME DESIRED CURRENT READY AVAILABLE AGE ci-ln-p3gs8vk-f76d1-75z4s-worker-b 1 1 1 1 37m ci-ln-p3gs8vk-f76d1-75z4s-worker-c 1 1 1 1 37m ci-ln-p3gs8vk-f76d1-75z4s-worker-d 1 1 1 1 37m $ oc -n openshift-machine-api patch machineset ci-ln-p3gs8vk-f76d1-75z4s-worker-b --type json -p '[{"op": "add", "path": "/spec/replicas", "value": 18}]' Wait for the MachineSet to fill the request... $ oc -n openshift-machine-api get --watch machineset ci-ln-p3gs8vk-f76d1-75z4s-worker-b NAME DESIRED CURRENT READY AVAILABLE AGE ci-ln-p3gs8vk-f76d1-75z4s-worker-b 18 18 1 1 40m ... ci-ln-p3gs8vk-f76d1-75z4s-worker-b 18 18 18 18 42m Start tailing the CVO: $ while sleep 1; do POD="$(oc -n openshift-cluster-version get -o jsonpath='{.items[].metadata.name}{"\n"}' pods)" && oc -n openshift-cluster-version logs -f "${POD}" > "$(date --utc --iso=s)-${POD}.log"; done In another terminal, initiate the update: $ oc adm upgrade --to 4.6.0-rc.4 A ways in: $ oc get -o json clusterversion version | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message' 2020-10-14T22:58:20Z RetrievedUpdates=True : 2020-10-14T22:53:33Z Available=True : Done applying 4.5.15 2020-10-15T00:10:46Z Failing=True ClusterOperatorNotAvailable: Cluster operator network is still updating 2020-10-14T23:17:33Z Progressing=True ClusterOperatorNotAvailable: Unable to apply 4.6.0-rc.4: the cluster operator network has not yet successfully rolled out At that point: $ grep 'Running sync.*in state\|Running sync for clusteroperator.*network\|error running apply for clusteroperator.*network\|Done syncing for clusteroperator.*network' 2020-10-14T23:29:57+0000-cluster-version-operator-655f8f6749-dmffp.log I1014 23:18:15.432717 1 sync_worker.go:494] Running sync 4.6.0-rc.4 (force=false) on generation 3 in state Updating at attempt 0 I1014 23:24:21.153789 1 sync_worker.go:494] Running sync 4.6.0-rc.4 (force=false) on generation 3 in state Updating at attempt 1 I1014 23:30:50.343295 1 sync_worker.go:494] Running sync 4.6.0-rc.4 (force=false) on generation 3 in state Updating at attempt 2 I1014 23:38:09.389702 1 sync_worker.go:494] Running sync 4.6.0-rc.4 (force=false) on generation 3 in state Updating at attempt 3 I1014 23:47:15.730610 1 sync_worker.go:494] Running sync 4.6.0-rc.4 (force=false) on generation 3 in state Updating at attempt 4 I1014 23:47:54.142905 1 sync_worker.go:701] Running sync for clusteroperator "network" (493 of 618) E1014 23:52:57.646403 1 task.go:81] error running apply for clusteroperator "network" (493 of 618): Cluster operator network is still updating I1014 23:56:02.416775 1 sync_worker.go:494] Running sync 4.6.0-rc.4 (force=false) on generation 3 in state Updating at attempt 5 I1014 23:56:46.828845 1 sync_worker.go:701] Running sync for clusteroperator "network" (493 of 618) E1015 00:01:44.328551 1 task.go:81] error running apply for clusteroperator "network" (493 of 618): Cluster operator network is still updating I1015 00:04:53.587571 1 sync_worker.go:494] Running sync 4.6.0-rc.4 (force=false) on generation 3 in state Updating at attempt 6 I1015 00:05:31.899081 1 sync_worker.go:701] Running sync for clusteroperator "network" (493 of 618) E1015 00:10:35.500661 1 task.go:81] error running apply for clusteroperator "network" (493 of 618): Cluster operator network is still updating I1015 00:13:51.624648 1 sync_worker.go:494] Running sync 4.6.0-rc.4 (force=false) on generation 3 in state Updating at attempt 7 So that is 23:47Z through 00:10Z, which is indeed short of the target 40m. And later: $ oc get -o json clusterversion version | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message' 2020-10-14T22:58:20Z RetrievedUpdates=True : 2020-10-14T22:53:33Z Available=True : Done applying 4.5.15 2020-10-15T00:19:46Z Failing=True ClusterOperatorNotAvailable: Cluster operator machine-config is still updating 2020-10-14T23:17:33Z Progressing=True ClusterOperatorNotAvailable: Unable to apply 4.6.0-rc.4: the cluster operator machine-config has not yet successfully rolled out $ oc get -o json clusteroperator machine-config | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'2020-10-15T00:16:43Z Progressing=True : Working towards 4.6.0-rc.4 2020-10-15T00:16:44Z Degraded=False : 2020-10-15T00:14:45Z Available=False : Cluster not available for 4.5.15 2020-10-14T22:24:10Z Upgradeable=True AsExpected: $ grep 'Running sync.*in state\|Running sync for clusteroperator.*machine-config\|error running apply for clusteroperator.*machine-config\|Done syncing for clusteroperator.*machine-config' 2020-10-14T23:29:57+0000-cluster-version-operator-655f8f6749-dmffp.log I1014 23:18:15.432717 1 sync_worker.go:494] Running sync 4.6.0-rc.4 (force=false) on generation 3 in state Updating at attempt 0 I1014 23:24:21.153789 1 sync_worker.go:494] Running sync 4.6.0-rc.4 (force=false) on generation 3 in state Updating at attempt 1 I1014 23:30:50.343295 1 sync_worker.go:494] Running sync 4.6.0-rc.4 (force=false) on generation 3 in state Updating at attempt 2 I1014 23:38:09.389702 1 sync_worker.go:494] Running sync 4.6.0-rc.4 (force=false) on generation 3 in state Updating at attempt 3 I1014 23:47:15.730610 1 sync_worker.go:494] Running sync 4.6.0-rc.4 (force=false) on generation 3 in state Updating at attempt 4 I1014 23:56:02.416775 1 sync_worker.go:494] Running sync 4.6.0-rc.4 (force=false) on generation 3 in state Updating at attempt 5 I1015 00:04:53.587571 1 sync_worker.go:494] Running sync 4.6.0-rc.4 (force=false) on generation 3 in state Updating at attempt 6 I1015 00:13:51.624648 1 sync_worker.go:494] Running sync 4.6.0-rc.4 (force=false) on generation 3 in state Updating at attempt 7 I1015 00:14:38.875574 1 sync_worker.go:701] Running sync for clusteroperator "machine-config" (522 of 618) I've attached the resulting logs, after removing empty files and ones that only contained a subset of another (e.g. the first of two files for the same pod, after the first stream was interrupted by an API-server hiccup or whatever).
So: $ grep 'Running sync for clusteroperator.*machine-config\|error running apply for clusteroperator.*machine-config\|Done syncing for clusteroperator.*machine-config' 2020-10-15T00:17:37+0000-cluster-version-operator-655f8f6749-dmffp.log I1015 00:14:38.875574 1 sync_worker.go:701] Running sync for clusteroperator "machine-config" (522 of 618) E1015 00:19:33.537783 1 task.go:81] error running apply for clusteroperator "machine-config" (522 of 618): Cluster operator machine-config is still updating goes right into: 2020-10-15T00:19:46Z Failing=True ClusterOperatorNotAvailable: Cluster operator machine-config is still updating without any of the intended grace period. Looking at the linked PR, the grace period is being compared to LastProgress. Looking at the code: $ git --no-pager log --oneline -1 0fa9b047dc (HEAD -> master, wking/master, origin/release-4.8, origin/release-4.7, origin/master, origin/HEAD) Merge pull request #472 from wking/demote-status-change-logging $ git grep LastProgress | grep -v test pkg/cvo/status.go: if len(history) == 0 || status.Failure == nil || status.Reconciling || status.LastProgress.IsZero() { pkg/cvo/status.go: if now.Sub(status.LastProgress) > 40*time.Minute || now.Sub(history[0].StartedTime.Time) > time.Hour { pkg/cvo/sync_worker.go: LastProgress time.Time pkg/cvo/sync_worker.go: status.LastProgress = time.Now() Ah, because we were over an hour out from the 23:17Z update start time. And there's really nothing in that logic that is specific to ClusterOperator. So... this will take some shuffling...
I didn't have time to refactor to address the issue discussed in comment 17. Hopefully next sprint.
@wking, wondering if this failure is the same as this bug? https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.7-e2e-aws-proxy/1331341959318474752 this log looks like it's hitting the 40m timeout discussed in this bz: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.7-e2e-aws-proxy/1331341959318474752/artifacts/e2e-aws-proxy/ipi-install-install/container-logs/test.log
(In reply to jamo luhrsen from comment #19) > ... wondering if this failure is the same as this bug?... Nope. From your job: level=info msg=Waiting up to 40m0s for the cluster at https://api.ci-op-q4qfig05-2659c.origin-ci-int-aws.dev.rhcloud.com:6443 to initialize... ... level=fatal msg=failed to initialize the cluster: Cluster operator machine-config is still updating So you are hitting the installer's 40m timeout for going from bootstrap-teardown to install-complete [1]. [1]: https://github.com/openshift/installer/blob/095b6e6e67fd105b53f1ced7bb783aa1eaf1847f/cmd/openshift-install/create.go#L353
I think the proxy vs. machine-config thing is bug 1899979.
(In reply to W. Trevor King from comment #21) > I think the proxy vs. machine-config thing is bug 1899979. Thank you. I'll update 1899979 with the job info.
(In reply to jamo luhrsen from comment #22) > (In reply to W. Trevor King from comment #21) > > I think the proxy vs. machine-config thing is bug 1899979. > > Thank you. I'll update 1899979 with the job info. Actually, I think it looks more like bug 1901034, which I just updated with a few details.
End of sprint; didn't get to this one. Failing=True when things aren't that bad is annoying though, so hopefully next sprint. Comment 17 breaks down the current issue, which is looking at status.LastProgress and history, but not distinguishing between long-pending manifest and failing manifest.
Raising the priority to high as this will help us bring more clarity with respect to actual update failures in telemetry/datahub.
*** Bug 1888754 has been marked as a duplicate of this bug. ***
I'm verifying it by upgrading from 4.6.16 to 4.7.0-0.nightly-2021-02-04-031352. 1. Network operator does not reach 40 minutes timeout but failing=ture fired but finally it got ready. Is it expected behaviour? In the middle of upgrade, we can see failing=true fires status: availableUpdates: null conditions: - lastTransitionTime: "2021-02-05T02:13:16Z" message: Done applying 4.6.16 status: "True" type: Available - lastTransitionTime: "2021-02-05T04:55:09Z" message: Cluster operator network is not available reason: ClusterOperatorNotAvailable status: "True" type: Failing - lastTransitionTime: "2021-02-05T04:19:53Z" message: 'Unable to apply 4.7.0-0.nightly-2021-02-04-031352: the cluster operator network has not yet successfully rolled out' reason: ClusterOperatorNotAvailable status: "True" type: Progressing The upgrade ends up with success. status: availableUpdates: null conditions: - lastTransitionTime: "2021-02-05T02:13:16Z" message: Done applying 4.7.0-0.nightly-2021-02-04-031352 status: "True" type: Available - lastTransitionTime: "2021-02-05T05:07:54Z" status: "False" type: Failing - lastTransitionTime: "2021-02-05T05:40:37Z" message: Cluster version is 4.7.0-0.nightly-2021-02-04-031352 status: "False" type: Progressing 2. While the clusterversion reports progressing=false and the target version is available, 2 operators still reports progressing=true. Does it deserve a fix? # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.0-0.nightly-2021-02-04-031352 True False 60s Cluster version is 4.7.0-0.nightly-2021-02-04-031352 # oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.7.0-0.nightly-2021-02-04-031352 True False False 2m42s baremetal 4.7.0-0.nightly-2021-02-04-031352 True False False 60m cloud-credential 4.7.0-0.nightly-2021-02-04-031352 True False False 4h5m cluster-autoscaler 4.7.0-0.nightly-2021-02-04-031352 True False False 3h59m config-operator 4.7.0-0.nightly-2021-02-04-031352 True False False 4h console 4.7.0-0.nightly-2021-02-04-031352 True False False 9m16s csi-snapshot-controller 4.7.0-0.nightly-2021-02-04-031352 True False False 23m dns 4.7.0-0.nightly-2021-02-04-031352 True False False 3h59m etcd 4.7.0-0.nightly-2021-02-04-031352 True False False 3h44m image-registry 4.7.0-0.nightly-2021-02-04-031352 True False False 3h24m ingress 4.7.0-0.nightly-2021-02-04-031352 True False False 3h37m insights 4.7.0-0.nightly-2021-02-04-031352 True False False 4h kube-apiserver 4.7.0-0.nightly-2021-02-04-031352 True False False 3h44m kube-controller-manager 4.7.0-0.nightly-2021-02-04-031352 True False False 3h58m kube-scheduler 4.7.0-0.nightly-2021-02-04-031352 True False False 3h58m kube-storage-version-migrator 4.7.0-0.nightly-2021-02-04-031352 True False False 16m machine-api 4.7.0-0.nightly-2021-02-04-031352 True False False 3h55m machine-approver 4.7.0-0.nightly-2021-02-04-031352 True False False 4h machine-config 4.7.0-0.nightly-2021-02-04-031352 True False False 2m38s marketplace 4.7.0-0.nightly-2021-02-04-031352 True False False 15m monitoring 4.7.0-0.nightly-2021-02-04-031352 True False False 56m network 4.7.0-0.nightly-2021-02-04-031352 True True False 31m node-tuning 4.7.0-0.nightly-2021-02-04-031352 True False False 58m openshift-apiserver 4.7.0-0.nightly-2021-02-04-031352 True False False 2m54s openshift-controller-manager 4.7.0-0.nightly-2021-02-04-031352 True False False 3h59m openshift-samples 4.7.0-0.nightly-2021-02-04-031352 True False False 59m operator-lifecycle-manager 4.7.0-0.nightly-2021-02-04-031352 True False False 4h operator-lifecycle-manager-catalog 4.7.0-0.nightly-2021-02-04-031352 True False False 4h operator-lifecycle-manager-packageserver 4.7.0-0.nightly-2021-02-04-031352 True False False 9m3s service-ca 4.7.0-0.nightly-2021-02-04-031352 True False False 4h storage 4.7.0-0.nightly-2021-02-04-031352 True True False 15m
Please ignore comment#30 Trying to verify it by upgrading from 4.6.16 to 4.7.0-0.nightly-2021-02-04-031352 Install a gcp cluster with 3 workers and then scale out worker nodes to 20 While cluster is upgrading, do $ while sleep 2; do oc get clusterversion -oyaml | grep Failing -a5; done - lastTransitionTime: "2021-02-05T10:34:15Z" message: Cluster operator network is not available reason: ClusterOperatorNotAvailable status: "True" type: Failing - lastTransitionTime: "2021-02-05T09:50:45Z" message: 'Unable to apply 4.7.0-0.nightly-2021-02-04-031352: the cluster operator network has not yet successfully rolled out' reason: ClusterOperatorNotAvailable status: "True" type: Progressing # oc logs pod/cluster-version-operator-595847559-h2jl8 | grep 'Running sync.*in state\|Running sync for clusteroperator.*network\|error running apply for clusteroperator.*network\|Done syncing for clusteroperator.*network' I0205 09:51:14.513873 1 sync_worker.go:549] Running sync registry.ci.openshift.org/ocp/release@sha256:f024d23bac16c1243025048bfaa27c064be7cba74717d1deafbd60f4c1bc01c4 (force=false) on generation 2 in state Updating at attempt 0 I0205 09:57:18.731310 1 sync_worker.go:549] Running sync registry.ci.openshift.org/ocp/release@sha256:f024d23bac16c1243025048bfaa27c064be7cba74717d1deafbd60f4c1bc01c4 (force=false) on generation 2 in state Updating at attempt 1 I0205 10:03:48.470422 1 sync_worker.go:549] Running sync registry.ci.openshift.org/ocp/release@sha256:f024d23bac16c1243025048bfaa27c064be7cba74717d1deafbd60f4c1bc01c4 (force=false) on generation 2 in state Updating at attempt 2 I0205 10:10:57.148100 1 sync_worker.go:549] Running sync registry.ci.openshift.org/ocp/release@sha256:f024d23bac16c1243025048bfaa27c064be7cba74717d1deafbd60f4c1bc01c4 (force=false) on generation 2 in state Updating at attempt 3 I0205 10:19:46.907951 1 sync_worker.go:549] Running sync registry.ci.openshift.org/ocp/release@sha256:f024d23bac16c1243025048bfaa27c064be7cba74717d1deafbd60f4c1bc01c4 (force=false) on generation 2 in state Updating at attempt 4 I0205 10:28:30.790725 1 sync_worker.go:549] Running sync registry.ci.openshift.org/ocp/release@sha256:f024d23bac16c1243025048bfaa27c064be7cba74717d1deafbd60f4c1bc01c4 (force=false) on generation 2 in state Updating at attempt 5 I0205 10:30:03.104291 1 sync_worker.go:765] Running sync for clusteroperator "network" (532 of 668) E0205 10:34:12.704584 1 task.go:112] error running apply for clusteroperator "network" (532 of 668): Cluster operator network is not available I0205 10:37:13.280623 1 sync_worker.go:549] Running sync registry.ci.openshift.org/ocp/release@sha256:f024d23bac16c1243025048bfaa27c064be7cba74717d1deafbd60f4c1bc01c4 (force=false) on generation 2 in state Updating at attempt 6 I0205 10:37:55.240904 1 sync_worker.go:765] Running sync for clusteroperator "network" (532 of 668) E0205 10:42:55.194336 1 task.go:112] error running apply for clusteroperator "network" (532 of 668): Cluster operator network is not available I0205 10:45:47.166156 1 sync_worker.go:549] Running sync registry.ci.openshift.org/ocp/release@sha256:f024d23bac16c1243025048bfaa27c064be7cba74717d1deafbd60f4c1bc01c4 (force=false) on generation 2 in state Updating at attempt 7 I0205 10:46:29.126361 1 sync_worker.go:765] Running sync for clusteroperator "network" (532 of 668) E0205 10:51:29.079421 1 task.go:112] error running apply for clusteroperator "network" (532 of 668): Cluster operator network is not available I0205 10:54:41.979907 1 sync_worker.go:549] Running sync registry.ci.openshift.org/ocp/release@sha256:f024d23bac16c1243025048bfaa27c064be7cba74717d1deafbd60f4c1bc01c4 (force=false) on generation 2 in state Updating at attempt 8 I0205 10:55:23.939235 1 sync_worker.go:765] Running sync for clusteroperator "network" (532 of 668) I0205 10:55:23.939533 1 sync_worker.go:777] Done syncing for clusteroperator "network" (532 of 668) Every time when error running apply for clusteroperator "network" out, Failing=true fires. Wondering when the co network starts to sync, is it at the same time of updating at attempt 0?
(In reply to Yang Yang from comment #31) > Please ignore comment#30 > > Trying to verify it by upgrading from 4.6.16 to > 4.7.0-0.nightly-2021-02-04-031352 > > Install a gcp cluster with 3 workers and then scale out worker nodes to 20 > If a CO is Available=False as network was in your test then Failing=True. Not sure about your question but can you provide the CVO log from that run?
Created attachment 1755411 [details] CVO logs for upgrading from 4.6.16 to 4.7.0-0.nightly-2021-02-04-031352
(In reply to Jack Ottofaro from comment #32) > (In reply to Yang Yang from comment #31) > > Please ignore comment#30 > > > > Trying to verify it by upgrading from 4.6.16 to > > 4.7.0-0.nightly-2021-02-04-031352 > > > > Install a gcp cluster with 3 workers and then scale out worker nodes to 20 > > > If a CO is Available=False as network was in your test then Failing=True. > Not sure about your question but can you provide the CVO log from that run? In my testing, we can see, I0205 10:30:03.104291 1 sync_worker.go:765] Running sync for clusteroperator "network" (532 of 668) E0205 10:34:12.704584 1 task.go:112] error running apply for clusteroperator "network" (532 of 668): Cluster operator network is not available Does that mean the CO network started to sync at 10:30:03.104291 and finished at 10:34:12.704584? If that's true, when the network reported ClusterOperatorNotAvailable, it only took 4 minutes to sync, but failing=True fired. From your PR statement, ClusterOperatorNotAvailable will not result in cluster Failing=True unless sync'ing time has exceeded 40 minutes. Does the sync'ing time refer to a CO's sync'ing time?
Nit: attachment from comment 33 has a bunch of empty files; no need to include those. And there are a few sequentially-growing files for cluster-version-operator-595847559-h2jl8, all starting at 09:51:13.692221. The final, largest one of those contains all the lines from the earlier, smaller two, as well as additional lines, so it's the only file from that pod which needs to be included. All the files for cluster-version-operator-b5997dd8c-9xgjx are just "unable to retrieve container logs for cri-o://...", which isn't all that useful. Not clear what was going on with that pod. Events or Pod YAML might say, but probably not all that important. So the only two files which impact this bug are: 2021-02-05T09:58:21+0000-cluster-version-operator-595847559-h2jl8.log 2021-02-05T11:03:07+0000-cluster-version-operator-595847559-2sczw.log Timestamps in the filenames are when the gathers started. Getting times for when the pods started: $ head -n1 2021-02-05T09:58:21* 2021-02-05T11:03:07* ==> 2021-02-05T09:58:21+0000-cluster-version-operator-595847559-h2jl8.log <== I0205 09:51:13.692221 1 start.go:21] ClusterVersionOperator 4.7.0-202102032256.p0-9f3368d ==> 2021-02-05T11:03:07+0000-cluster-version-operator-595847559-2sczw.log <== I0205 11:09:10.872857 1 start.go:21] ClusterVersionOperator 4.7.0-202102032256.p0-9f3368d So your 10:30 bits in comment 34 are from the h2jl8 log. [1] will try to explain the CVO's major moves, but digging in by hand, because it's easier to paste text into bugs than share screenshots, with notes inline: $ grep 'Running sync.*in state\|Result of work\|clusteroperator "network"' 2021-02-05T09:58:21* | grep -v Precreated I0205 09:51:14.513873 1 sync_worker.go:549] Running sync registry.ci.openshift.org/ocp/release@sha256:f024d23bac16c1243025048bfaa27c064be7cba74717d1deafbd60f4c1bc01c4 (force=false) on generation 2 in state Updating at attempt 0 I0205 09:51:15.393281 1 task_graph.go:555] Result of work: [] I0205 09:56:56.427293 1 task_graph.go:555] Result of work: [Cluster operator kube-apiserver is still updating] Attempt 0 timed out waiting for kube-apiserver. CVO isn't mad, but it's just going to start the sync again to ensure the earlier manifests haven't drifted since it last checked them. I0205 09:57:18.731310 1 sync_worker.go:549] Running sync registry.ci.openshift.org/ocp/release@sha256:f024d23bac16c1243025048bfaa27c064be7cba74717d1deafbd60f4c1bc01c4 (force=false) on generation 2 in state Updating at attempt 1 I0205 09:57:18.958491 1 task_graph.go:555] Result of work: [] I0205 10:03:00.646547 1 task_graph.go:555] Result of work: [Cluster operator kube-apiserver is still updating] Still waiting for kube-apiserver after attempt 1. I0205 10:03:48.470422 1 sync_worker.go:549] Running sync registry.ci.openshift.org/ocp/release@sha256:f024d23bac16c1243025048bfaa27c064be7cba74717d1deafbd60f4c1bc01c4 (force=false) on generation 2 in state Updating at attempt 2 I0205 10:03:48.642119 1 task_graph.go:555] Result of work: [] I0205 10:09:30.398527 1 task_graph.go:555] Result of work: [Cluster operator config-operator is still updating Cluster operator machine-api is still updating Cluster operator openshift-apiserver is still updating] Attempt 2 got farther, now we're waiting for the machine-api and openshift-apiserver. I0205 10:10:57.148100 1 sync_worker.go:549] Running sync registry.ci.openshift.org/ocp/release@sha256:f024d23bac16c1243025048bfaa27c064be7cba74717d1deafbd60f4c1bc01c4 (force=false) on generation 2 in state Updating at attempt 3 I0205 10:10:57.380076 1 task_graph.go:555] Result of work: [] I0205 10:16:39.089117 1 task_graph.go:555] Result of work: [Cluster operator cluster-autoscaler is still updating Cluster operator image-registry is still updating Cluster operator monitoring is still updating Cluster operator console is still updating Cluster operator operator-lifecycle-manager-packageserver is still updating Cluster operator marketplace is still updating] And farther, now we're waiting for cluster-autoscaler, etc. I0205 10:19:46.907951 1 sync_worker.go:549] Running sync registry.ci.openshift.org/ocp/release@sha256:f024d23bac16c1243025048bfaa27c064be7cba74717d1deafbd60f4c1bc01c4 (force=false) on generation 2 in state Updating at attempt 4 I0205 10:19:47.140019 1 task_graph.go:555] Result of work: [] I0205 10:25:28.821398 1 task_graph.go:555] Result of work: [Cluster operator insights is degraded] CVO starts to get suspicious. Should't be Failing yet, but we will eventually go Failing=True if insights stays degraded for long enough. We probably just got interested in the insights operator in this attempt 4 (having been blocked by other, earlier manifests in previous attempts), so it probably hasn't been long enough for us to actually be mad at it yet. I0205 10:28:30.790725 1 sync_worker.go:549] Running sync registry.ci.openshift.org/ocp/release@sha256:f024d23bac16c1243025048bfaa27c064be7cba74717d1deafbd60f4c1bc01c4 (force=false) on generation 2 in state Updating at attempt 5 I0205 10:28:31.013985 1 task_graph.go:555] Result of work: [] I0205 10:30:03.104291 1 sync_worker.go:765] Running sync for clusteroperator "network" (532 of 668) E0205 10:34:12.704584 1 task.go:112] error running apply for clusteroperator "network" (532 of 668): Cluster operator network is not available I0205 10:34:12.704720 1 task_graph.go:555] Result of work: [Cluster operator network is not available] First round where we got far enough in to be curious about the network operator, which is when the network operator's 40 minute clock starts. But the clock doesn't matter, because Available=False is seriously bad. We should immediately go Failing=True, event though we are only 4m into the 40 minute clock. I0205 10:37:13.280623 1 sync_worker.go:549] Running sync registry.ci.openshift.org/ocp/release@sha256:f024d23bac16c1243025048bfaa27c064be7cba74717d1deafbd60f4c1bc01c4 (force=false) on generation 2 in state Updating at attempt 6 I0205 10:37:13.576075 1 task_graph.go:555] Result of work: [] I0205 10:37:55.240904 1 sync_worker.go:765] Running sync for clusteroperator "network" (532 of 668) E0205 10:42:55.194336 1 task.go:112] error running apply for clusteroperator "network" (532 of 668): Cluster operator network is not available I0205 10:42:55.194430 1 task_graph.go:555] Result of work: [Cluster operator network is not available] Still seriously bad, so still Failing=True, even though we're now 8m into the network operator's 40 minute clock. I0205 10:45:47.166156 1 sync_worker.go:549] Running sync registry.ci.openshift.org/ocp/release@sha256:f024d23bac16c1243025048bfaa27c064be7cba74717d1deafbd60f4c1bc01c4 (force=false) on generation 2 in state Updating at attempt 7 I0205 10:45:47.460320 1 task_graph.go:555] Result of work: [] I0205 10:46:29.126361 1 sync_worker.go:765] Running sync for clusteroperator "network" (532 of 668) E0205 10:51:29.079421 1 task.go:112] error running apply for clusteroperator "network" (532 of 668): Cluster operator network is not available I0205 10:51:29.079558 1 task_graph.go:555] Result of work: [Cluster operator network is not available] Still seriously bad, so still Failing=True. I0205 10:54:41.979907 1 sync_worker.go:549] Running sync registry.ci.openshift.org/ocp/release@sha256:f024d23bac16c1243025048bfaa27c064be7cba74717d1deafbd60f4c1bc01c4 (force=false) on generation 2 in state Updating at attempt 8 I0205 10:54:42.193448 1 task_graph.go:555] Result of work: [] I0205 10:55:23.939235 1 sync_worker.go:765] Running sync for clusteroperator "network" (532 of 668) I0205 10:55:23.939533 1 sync_worker.go:777] Done syncing for clusteroperator "network" (532 of 668) I0205 11:00:23.895880 1 task_graph.go:555] Result of work: [Cluster operator machine-config is still updating] Ahh, network finally happy again. In fact, not only is it back to Available=True, but it's completed its update. We should clear Failing=False. Now we're waiting on machine-config, but there is nothing concerning about its current state, it just hasn't leveled yet. I0205 11:03:27.017620 1 sync_worker.go:549] Running sync registry.ci.openshift.org/ocp/release@sha256:f024d23bac16c1243025048bfaa27c064be7cba74717d1deafbd60f4c1bc01c4 (force=false) on generation 2 in state Updating at attempt 9 I0205 11:03:27.382521 1 task_graph.go:555] Result of work: [] I0205 11:08:59.439678 1 task_graph.go:555] Result of work: [Cluster operator authentication is degraded] Likely the machine-config operator rolling the nodes while it updated freaked out the auth operator. It's been a while since we asked auth to update, so maybe we go straight to Failing=True again here? Jack, we might want to reset times for operators any time we successfully reconcile their manifests. I don't think that blocks this particular bug, it's just a follow-up idea. Moving on to the next pod: $ grep 'Running sync.*in state\|Result of work' 2021-02-05T11:03:07* | head I0205 11:09:15.852963 1 sync_worker.go:549] Running sync registry.ci.openshift.org/ocp/release@sha256:f024d23bac16c1243025048bfaa27c064be7cba74717d1deafbd60f4c1bc01c4 (force=false) on generation 2 in state Updating at attempt 0 I0205 11:09:17.919567 1 task_graph.go:555] Result of work: [] I0205 11:14:57.767259 1 task_graph.go:555] Result of work: [Cluster operator openshift-apiserver is degraded] New degraded operator, but we don't preserve the operator clocks between CVO pods, so now openshift-apiserver is only 5 minutes into its clock, and we set Failing=False, because openshift-apiserver is only Degraded=True, and that's not too bad. I0205 11:15:20.071310 1 sync_worker.go:549] Running sync registry.ci.openshift.org/ocp/release@sha256:f024d23bac16c1243025048bfaa27c064be7cba74717d1deafbd60f4c1bc01c4 (force=false) on generation 2 in state Updating at attempt 1 I0205 11:15:20.265234 1 task_graph.go:555] Result of work: [] I0205 11:21:01.984970 1 task_graph.go:555] Result of work: [Cluster operator openshift-apiserver is degraded] 12 minutes into the openshift-apiserver clock, it's still just Degraded=True, so we stay Failing=False. I0205 11:21:49.201918 1 sync_worker.go:549] Running sync registry.ci.openshift.org/ocp/release@sha256:f024d23bac16c1243025048bfaa27c064be7cba74717d1deafbd60f4c1bc01c4 (force=false) on generation 2 in state Updating at attempt 2 I0205 11:21:49.391717 1 task_graph.go:555] Result of work: [] I0205 11:24:16.786081 1 task_graph.go:555] Result of work: [] I0205 11:27:07.743276 1 sync_worker.go:549] Running sync registry.ci.openshift.org/ocp/release@sha256:f024d23bac16c1243025048bfaa27c064be7cba74717d1deafbd60f4c1bc01c4 (force=false) on generation 2 in state Reconciling at attempt 0 Update complete :) Looks like we don't log ClusterVersion condition changes, but my guesses from above are that we went Failing=True at 10:34, Failing=False at 11:00, Failing=True at 11:08, Failing=False at 11:14, and update-complete at 11:24. If that matches what you saw, it's probably enough to verify this bug. If it diverges, tell me where. [1]: https://github.com/openshift/cluster-version-operator/blob/9e25ac8815dbf0d7b97dc640ad78071e49391a54/hack/log-explainer.py
Trevor, thanks a lot for walking me through the log. Really appreciate it. That's exactly what I saw. So from my understanding, not sure if they're correct, 1) if a co is in a state of Available=False, then Failing=True fires even if it does not exceed 40 min time slot. 2) If a co is in a state of degraded=True, then Failing=True fires only when it goes beyond 40 min time slot. 3) If a co is in a state of still updating, then Failing=True fires only when it goes beyond 40 min time slot.
Based on comment#35, it works as design. Moving it to verified state.
> 1) if a co is in a state of Available=False, then Failing=True fires even if it does not exceed 40 min time slot. > 2) If a co is in a state of degraded=True, then Failing=True fires only when it goes beyond 40 min time slot. Yes to both of these, with some corner cases like "the CVO does not actually care about your ClusterOperator state until it is deep enough into the update graph [1] to get to that manifest. > 3) If a co is in a state of still updating, then Failing=True fires only when it goes beyond 40 min time slot. If a ClusterOperator is Available=True and Degraded=False, we will now happily wait forever for it to level up without going Failing=True [2]. This is useful for things like networking's DNS DaemonSet, which can take quite a while to roll out over a cluster with hundreds of compute nodes. [1]: https://github.com/openshift/cluster-version-operator/blob/9e25ac8815dbf0d7b97dc640ad78071e49391a54/docs/user/reconciliation.md#manifest-graph [2]: https://github.com/openshift/cluster-version-operator/blob/9e25ac8815dbf0d7b97dc640ad78071e49391a54/pkg/cvo/internal/operatorstatus.go#L253
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633