During upgrade of a cluster in CI build farm, we have seen a sequence of alerts and messages of failures from clusterversion. oc --context build01 adm upgrade --allow-explicit-upgrade --to-image registry.svc.ci.openshift.org/ocp/release:4.3.0-0.nightly-2020-04-13-190424 --force=true Eventually upgrade was completed successfully (which is so nice). But those alerts and messages are too frightening. I would like to create a bug for each of those and feel better for the next upgrade. https://coreos.slack.com/archives/CHY2E1BL4/p1587057027430600?thread_ts=1587056182.429300&cid=CHY2E1BL4 Every 10.0s: oc --context build01 get clusterversions.config.openshift.io Hongkais-MacBook-Pro: Thu Apr 16 13:10:12 2020 NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.3.0-0.nightly-2020-03-23-130439 True True 7m4s Unable to apply 4.3.0-0.nightly-2020-04-13-190 424: an unknown error has occurred must-gather after upgrade: http://file.rdu.redhat.com/~hongkliu/test_result/upgrade/upgrade0416/aaa/
PR has been open over a year, so this can't be very severe ;).
Catched the issue in an old e2e logs[1](4 days ago). Apr 15 08:05:17.704: INFO: cluster upgrade is Failing: deployment openshift-cluster-version/cluster-version-operator is progressing NewReplicaSetAvailable: ReplicaSet "cluster-version-operator-7fc8965647" has successfully progressed. Apr 15 08:05:27.780: INFO: cluster upgrade is Progressing: Working towards 4.5.0-0.nightly-2020-04-15-072621: 0% complete ... Apr 15 08:11:07.827: INFO: cluster upgrade is Progressing: Working towards 4.5.0-0.nightly-2020-04-15-072621: 69% complete Apr 15 08:11:17.782: INFO: cluster upgrade is Progressing: Unable to apply 4.5.0-0.nightly-2020-04-15-072621: an unknown error has occurred Apr 15 08:11:17.782: INFO: cluster upgrade is Failing: Multiple errors are preventing progress: * deployment openshift-authentication-operator/authentication-operator is progressing ReplicaSetUpdated: ReplicaSet "authentication-operator-79d757f78f" is progressing. * deployment openshift-cluster-samples-operator/cluster-samples-operator is progressing ReplicaSetUpdated: ReplicaSet "cluster-samples-operator-68fb4dbc57" is progressing. * deployment openshift-controller-manager-operator/openshift-controller-manager-operator is progressing ReplicaSetUpdated: ReplicaSet "openshift-controller-manager-operator-85ffd56f86" is progressing. ... Apr 15 08:20:37.779: INFO: cluster upgrade is Progressing: Working towards 4.5.0-0.nightly-2020-04-15-072621: 84% complete Apr 15 08:20:47.780: INFO: Completed upgrade to registry.svc.ci.openshift.org/ocp/release:4.5.0-0.nightly-2020-04-15-072621 [1] https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/25745/build-log.txt
Checked the latest log in e2e job[1]. Apr 20 00:29:17.596: INFO: cluster upgrade is Progressing: Working towards registry.svc.ci.openshift.org/ocp/release:4.5.0-0.nightly-2020-04-19-234910: downloading update Apr 20 00:29:27.599: INFO: cluster upgrade is Progressing: Working towards 4.5.0-0.nightly-2020-04-19-234910: 0% complete ... Apr 20 00:35:17.605: INFO: cluster upgrade is Progressing: Working towards 4.5.0-0.nightly-2020-04-19-234910: 69% complete Apr 20 00:35:27.597: INFO: cluster upgrade is Progressing: Unable to apply 4.5.0-0.nightly-2020-04-19-234910: an unknown error has occurred: MultipleErrors ... Tracked back to related bugs in #1824981 and went through pr #185. For this bug, it's for an msg enhancement(adding name/reason for msg "an unknown error has occurred") @hongkliu wdyt? is that ok for u that qe verify the bug with the enhancement in the pr? [1] https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/26376/build-log.txt [2] https://bugzilla.redhat.com/show_bug.cgi?id=1824981#c3
> Apr 20 00:35:27.597: INFO: cluster upgrade is Progressing: Unable to apply 4.5.0-0.nightly-2020-04-19-234910: an unknown error has occurred: MultipleErrors Ugh, that is too coarse. The Failing message unpacks all of those: Apr 20 00:35:27.597: INFO: cluster upgrade is Progressing: Unable to apply 4.5.0-0.nightly-2020-04-19-234910: an unknown error has occurred: MultipleErrors Apr 20 00:35:27.597: INFO: cluster upgrade is Failing: Multiple errors are preventing progress: * could not find the deployment openshift-authentication-operator/authentication-operator during rollout * could not find the deployment openshift-cluster-samples-operator/cluster-samples-operator during rollout * could not find the deployment openshift-console/downloads during rollout * could not find the deployment openshift-controller-manager-operator/openshift-controller-manager-operator during rollout * could not find the deployment openshift-image-registry/cluster-image-registry-operator during rollout * could not find the deployment openshift-operator-lifecycle-manager/olm-operator during rollout * could not find the deployment openshift-service-ca-operator/service-ca-operator during rollout Maybe the Progressing reason and message in these cases should just echo the Failing reason and message?
> Maybe the Progressing reason and message in these cases should just echo the Failing reason and message? I'm not sure the expected result for this bug. Need @hongkliu to confirm. But according to the bug description, "alerts and messages are too frightening" maybe pointed to "an unknown error has occurred". It may hint an error/failure, but it's just a middle status actually.
In general, what to do for an admin upon seeing the msg for cluster version? If the upgrade progress knows that retries are still ongoing and admins need not to worry much. In this case, it would be better that the msg should reflect the ongoing retires. If it is technically hard to know if or not retries are ongoing, then give us some hint about how much admins should be tolerant to such an error. Or just ignore this completely. When n% does not increase for sometime and such errors occur, I cannot tell if upgrade is failed or in progress.
@hongkai Thx for your confirm. @W. Trevor King According to above clarification, i think we need more enhancements here. Maybe something like "an unknown error has occurred, still retrying: MultipleErrors", which can reflect that update process is not dead even with the unknown error, and it is still retrying. Anyway, assign the bug back first.
Can I understand this way? With PROGRESSING = True I can ignore any error msg from clusterversion
Still not sure what to add here that's short of "just copy over the Failing message". Adding UpcomingSprint
Comment 13 is still current.
Comment 13 is still current. I've been trying to get some internal consensus we can add to the openshift/api docstrings.
Comment 13 is still current. Did not even have time to try to drive openshift/api consensus this sprint :/
We did some improvements in CVO which would give more accurate progress message now. Going to close the bug. Please re-open if you still think we need to fix this.