Bug 1825003
| Summary: | [4.3 upgrade][clusterversion]Unclear message: Unable to apply ...: an unknown error has occurred | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Hongkai Liu <hongkliu> |
| Component: | Cluster Version Operator | Assignee: | W. Trevor King <wking> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | liujia <jiajliu> |
| Severity: | low | Docs Contact: | |
| Priority: | low | ||
| Version: | 4.3.0 | CC: | aos-bugs, bleanhar, ccoleman, eparis, jokerman, lmohanty, sdodson, wking |
| Target Milestone: | --- | Keywords: | Upgrades |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-03-16 16:13:13 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Hongkai Liu
2020-04-16 20:00:22 UTC
PR has been open over a year, so this can't be very severe ;). Catched the issue in an old e2e logs[1](4 days ago). Apr 15 08:05:17.704: INFO: cluster upgrade is Failing: deployment openshift-cluster-version/cluster-version-operator is progressing NewReplicaSetAvailable: ReplicaSet "cluster-version-operator-7fc8965647" has successfully progressed. Apr 15 08:05:27.780: INFO: cluster upgrade is Progressing: Working towards 4.5.0-0.nightly-2020-04-15-072621: 0% complete ... Apr 15 08:11:07.827: INFO: cluster upgrade is Progressing: Working towards 4.5.0-0.nightly-2020-04-15-072621: 69% complete Apr 15 08:11:17.782: INFO: cluster upgrade is Progressing: Unable to apply 4.5.0-0.nightly-2020-04-15-072621: an unknown error has occurred Apr 15 08:11:17.782: INFO: cluster upgrade is Failing: Multiple errors are preventing progress: * deployment openshift-authentication-operator/authentication-operator is progressing ReplicaSetUpdated: ReplicaSet "authentication-operator-79d757f78f" is progressing. * deployment openshift-cluster-samples-operator/cluster-samples-operator is progressing ReplicaSetUpdated: ReplicaSet "cluster-samples-operator-68fb4dbc57" is progressing. * deployment openshift-controller-manager-operator/openshift-controller-manager-operator is progressing ReplicaSetUpdated: ReplicaSet "openshift-controller-manager-operator-85ffd56f86" is progressing. ... Apr 15 08:20:37.779: INFO: cluster upgrade is Progressing: Working towards 4.5.0-0.nightly-2020-04-15-072621: 84% complete Apr 15 08:20:47.780: INFO: Completed upgrade to registry.svc.ci.openshift.org/ocp/release:4.5.0-0.nightly-2020-04-15-072621 [1] https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/25745/build-log.txt Checked the latest log in e2e job[1]. Apr 20 00:29:17.596: INFO: cluster upgrade is Progressing: Working towards registry.svc.ci.openshift.org/ocp/release:4.5.0-0.nightly-2020-04-19-234910: downloading update Apr 20 00:29:27.599: INFO: cluster upgrade is Progressing: Working towards 4.5.0-0.nightly-2020-04-19-234910: 0% complete ... Apr 20 00:35:17.605: INFO: cluster upgrade is Progressing: Working towards 4.5.0-0.nightly-2020-04-19-234910: 69% complete Apr 20 00:35:27.597: INFO: cluster upgrade is Progressing: Unable to apply 4.5.0-0.nightly-2020-04-19-234910: an unknown error has occurred: MultipleErrors ... Tracked back to related bugs in #1824981 and went through pr #185. For this bug, it's for an msg enhancement(adding name/reason for msg "an unknown error has occurred") @hongkliu wdyt? is that ok for u that qe verify the bug with the enhancement in the pr? [1] https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/26376/build-log.txt [2] https://bugzilla.redhat.com/show_bug.cgi?id=1824981#c3 > Apr 20 00:35:27.597: INFO: cluster upgrade is Progressing: Unable to apply 4.5.0-0.nightly-2020-04-19-234910: an unknown error has occurred: MultipleErrors
Ugh, that is too coarse. The Failing message unpacks all of those:
Apr 20 00:35:27.597: INFO: cluster upgrade is Progressing: Unable to apply 4.5.0-0.nightly-2020-04-19-234910: an unknown error has occurred: MultipleErrors
Apr 20 00:35:27.597: INFO: cluster upgrade is Failing: Multiple errors are preventing progress:
* could not find the deployment openshift-authentication-operator/authentication-operator during rollout
* could not find the deployment openshift-cluster-samples-operator/cluster-samples-operator during rollout
* could not find the deployment openshift-console/downloads during rollout
* could not find the deployment openshift-controller-manager-operator/openshift-controller-manager-operator during rollout
* could not find the deployment openshift-image-registry/cluster-image-registry-operator during rollout
* could not find the deployment openshift-operator-lifecycle-manager/olm-operator during rollout
* could not find the deployment openshift-service-ca-operator/service-ca-operator during rollout
Maybe the Progressing reason and message in these cases should just echo the Failing reason and message?
> Maybe the Progressing reason and message in these cases should just echo the Failing reason and message?
I'm not sure the expected result for this bug. Need @hongkliu to confirm. But according to the bug description, "alerts and messages are too frightening" maybe pointed to "an unknown error has occurred". It may hint an error/failure, but it's just a middle status actually.
In general, what to do for an admin upon seeing the msg for cluster version? If the upgrade progress knows that retries are still ongoing and admins need not to worry much. In this case, it would be better that the msg should reflect the ongoing retires. If it is technically hard to know if or not retries are ongoing, then give us some hint about how much admins should be tolerant to such an error. Or just ignore this completely. When n% does not increase for sometime and such errors occur, I cannot tell if upgrade is failed or in progress. @hongkai Thx for your confirm. @W. Trevor King According to above clarification, i think we need more enhancements here. Maybe something like "an unknown error has occurred, still retrying: MultipleErrors", which can reflect that update process is not dead even with the unknown error, and it is still retrying. Anyway, assign the bug back first. Can I understand this way? With PROGRESSING = True I can ignore any error msg from clusterversion Still not sure what to add here that's short of "just copy over the Failing message". Adding UpcomingSprint Comment 13 is still current. Comment 13 is still current. Comment 13 is still current. Comment 13 is still current. Comment 13 is still current. Comment 13 is still current. I've been trying to get some internal consensus we can add to the openshift/api docstrings. Comment 13 is still current. Did not even have time to try to drive openshift/api consensus this sprint :/ We did some improvements in CVO which would give more accurate progress message now. Going to close the bug. Please re-open if you still think we need to fix this. |