1825003 – [4.3 upgrade][clusterversion]Unclear message: Unable to apply ...: an unknown error has occurred

Bug 1825003 - [4.3 upgrade][clusterversion]Unclear message: Unable to apply ...: an unknown error has occurred

Summary: [4.3 upgrade][clusterversion]Unclear message: Unable to apply ...: an unknown...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	low
Target Milestone:	---
Target Release:	---
Assignee:	W. Trevor King
QA Contact:	liujia
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-04-16 20:00 UTC by Hongkai Liu
Modified:	2021-03-16 16:13 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-03-16 16:13:13 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-version-operator pull 185	0	None	closed	Bug 1825003: pkg/payload/task: Include name/reason in "unknown error" message	2021-02-02 17:14:11 UTC

Internal Links: 1824981

Description Hongkai Liu 2020-04-16 20:00:22 UTC

During upgrade of a cluster in CI build farm, we have seen a sequence of alerts and messages of failures from clusterversion.

oc --context build01 adm upgrade --allow-explicit-upgrade --to-image registry.svc.ci.openshift.org/ocp/release:4.3.0-0.nightly-2020-04-13-190424 --force=true

Eventually upgrade was completed successfully (which is so nice).
But those alerts and messages are too frightening.

I would like to create a bug for each of those and feel better for the next upgrade.

https://coreos.slack.com/archives/CHY2E1BL4/p1587057027430600?thread_ts=1587056182.429300&cid=CHY2E1BL4

Every 10.0s: oc --context build01 get clusterversions.config.openshift.io       Hongkais-MacBook-Pro: Thu Apr 16 13:10:12 2020
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.3.0-0.nightly-2020-03-23-130439   True        True          7m4s    Unable to apply 4.3.0-0.nightly-2020-04-13-190
424: an unknown error has occurred

must-gather after upgrade:
http://file.rdu.redhat.com/~hongkliu/test_result/upgrade/upgrade0416/aaa/

Comment 1 W. Trevor King 2020-04-16 22:36:58 UTC

PR has been open over a year, so this can't be very severe ;).

Comment 4 liujia 2020-04-20 08:47:30 UTC

Catched the issue in an old e2e logs[1](4 days ago).

Apr 15 08:05:17.704: INFO: cluster upgrade is Failing: deployment openshift-cluster-version/cluster-version-operator is progressing NewReplicaSetAvailable: ReplicaSet "cluster-version-operator-7fc8965647" has successfully progressed.
Apr 15 08:05:27.780: INFO: cluster upgrade is Progressing: Working towards 4.5.0-0.nightly-2020-04-15-072621: 0% complete
...
Apr 15 08:11:07.827: INFO: cluster upgrade is Progressing: Working towards 4.5.0-0.nightly-2020-04-15-072621: 69% complete
Apr 15 08:11:17.782: INFO: cluster upgrade is Progressing: Unable to apply 4.5.0-0.nightly-2020-04-15-072621: an unknown error has occurred
Apr 15 08:11:17.782: INFO: cluster upgrade is Failing: Multiple errors are preventing progress:
* deployment openshift-authentication-operator/authentication-operator is progressing ReplicaSetUpdated: ReplicaSet "authentication-operator-79d757f78f" is progressing.
* deployment openshift-cluster-samples-operator/cluster-samples-operator is progressing ReplicaSetUpdated: ReplicaSet "cluster-samples-operator-68fb4dbc57" is progressing.
* deployment openshift-controller-manager-operator/openshift-controller-manager-operator is progressing ReplicaSetUpdated: ReplicaSet "openshift-controller-manager-operator-85ffd56f86" is progressing.
...
Apr 15 08:20:37.779: INFO: cluster upgrade is Progressing: Working towards 4.5.0-0.nightly-2020-04-15-072621: 84% complete
Apr 15 08:20:47.780: INFO: Completed upgrade to registry.svc.ci.openshift.org/ocp/release:4.5.0-0.nightly-2020-04-15-072621


[1] https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/25745/build-log.txt

Comment 5 liujia 2020-04-20 08:59:28 UTC

Checked the latest log in e2e job[1].

Apr 20 00:29:17.596: INFO: cluster upgrade is Progressing: Working towards registry.svc.ci.openshift.org/ocp/release:4.5.0-0.nightly-2020-04-19-234910: downloading update
Apr 20 00:29:27.599: INFO: cluster upgrade is Progressing: Working towards 4.5.0-0.nightly-2020-04-19-234910: 0% complete
...
Apr 20 00:35:17.605: INFO: cluster upgrade is Progressing: Working towards 4.5.0-0.nightly-2020-04-19-234910: 69% complete
Apr 20 00:35:27.597: INFO: cluster upgrade is Progressing: Unable to apply 4.5.0-0.nightly-2020-04-19-234910: an unknown error has occurred: MultipleErrors
...

Tracked back to related bugs in #1824981 and went through pr #185. For this bug, it's for an msg enhancement(adding name/reason for msg "an unknown error has occurred") @hongkliu wdyt? is that ok for u that qe verify the bug with the enhancement in the pr?

[1] https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/26376/build-log.txt
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1824981#c3

Comment 6 W. Trevor King 2020-04-20 22:10:14 UTC

> Apr 20 00:35:27.597: INFO: cluster upgrade is Progressing: Unable to apply 4.5.0-0.nightly-2020-04-19-234910: an unknown error has occurred: MultipleErrors

Ugh, that is too coarse.  The Failing message unpacks all of those:

Apr 20 00:35:27.597: INFO: cluster upgrade is Progressing: Unable to apply 4.5.0-0.nightly-2020-04-19-234910: an unknown error has occurred: MultipleErrors
Apr 20 00:35:27.597: INFO: cluster upgrade is Failing: Multiple errors are preventing progress:
* could not find the deployment openshift-authentication-operator/authentication-operator during rollout
* could not find the deployment openshift-cluster-samples-operator/cluster-samples-operator during rollout
* could not find the deployment openshift-console/downloads during rollout
* could not find the deployment openshift-controller-manager-operator/openshift-controller-manager-operator during rollout
* could not find the deployment openshift-image-registry/cluster-image-registry-operator during rollout
* could not find the deployment openshift-operator-lifecycle-manager/olm-operator during rollout
* could not find the deployment openshift-service-ca-operator/service-ca-operator during rollout

Maybe the Progressing reason and message in these cases should just echo the Failing reason and message?

Comment 7 liujia 2020-04-21 04:12:10 UTC

> Maybe the Progressing reason and message in these cases should just echo the Failing reason and message?
I'm not sure the expected result for this bug. Need @hongkliu to confirm. But according to the bug description, "alerts and messages are too frightening" maybe pointed to "an unknown error has occurred". It may hint an error/failure, but it's just a middle status actually.

Comment 8 Hongkai Liu 2020-04-22 18:39:30 UTC

In general, what to do for an admin upon seeing the msg for cluster version? 

If the upgrade progress knows that retries are still ongoing and admins need not to worry much.
In this case, it would be better that the msg should reflect the ongoing retires.

If it is technically hard to know if or not retries are ongoing, then give us some hint about how much admins should be tolerant to such an error.
Or just ignore this completely.

When n% does not increase for sometime and such errors occur, I cannot tell if upgrade is failed or in progress.

Comment 9 liujia 2020-04-23 01:49:39 UTC

@hongkai Thx for your confirm. 
@W. Trevor King According to above clarification, i think we need more enhancements here. Maybe something like "an unknown error has occurred, still retrying: MultipleErrors", which can reflect that update process is not dead even with the unknown error, and it is still retrying. Anyway, assign the bug back first.

Comment 10 Hongkai Liu 2020-04-28 19:21:43 UTC

Can I understand this way?
With PROGRESSING = True I can ignore any error msg from clusterversion

Comment 13 W. Trevor King 2020-06-21 14:18:35 UTC

Still not sure what to add here that's short of "just copy over the Failing message".  Adding UpcomingSprint

Comment 16 W. Trevor King 2020-07-10 21:33:19 UTC

Comment 13 is still current.

Comment 17 W. Trevor King 2020-08-01 05:42:28 UTC

Comment 13 is still current.

Comment 18 W. Trevor King 2020-08-21 22:27:08 UTC

Comment 13 is still current.

Comment 19 W. Trevor King 2020-09-12 21:04:24 UTC

Comment 13 is still current.

Comment 20 W. Trevor King 2020-10-02 23:14:30 UTC

Comment 13 is still current.

Comment 21 W. Trevor King 2020-10-25 15:49:33 UTC

Comment 13 is still current.  I've been trying to get some internal consensus we can add to the openshift/api docstrings.

Comment 22 W. Trevor King 2020-12-04 22:31:57 UTC

Comment 13 is still current.  Did not even have time to try to drive openshift/api consensus this sprint :/

Comment 23 Lalatendu Mohanty 2021-03-16 16:13:13 UTC

We did some improvements in CVO which would give more accurate progress message  now. Going to close the bug. Please re-open if you still think we need to fix this.

Note You need to log in before you can comment on or make changes to this bug.