Bug 1768260 - Could not update deployment "openshift-console/downloads" - no error given
Summary: Could not update deployment "openshift-console/downloads" - no error given
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cluster Version Operator
Version: 4.3.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.5.0
Assignee: Abhinav Dahiya
QA Contact: Johnny Liu
URL:
Whiteboard:
Depends On:
Blocks: 1804854
TreeView+ depends on / blocked
 
Reported: 2019-11-03 18:17 UTC by Ben Parees
Modified: 2020-07-13 17:12 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Failure to roll-out deployment during updates was logged only in CVO logs and only a general error message was reported to ClusterVersion Consequence: General error message made it diffcult for users and teams to debug the failure unless looking at the CVO logs. Fix: The CVO now exposes underlying error to roll-out to ClusterVersion Result: Easy debugging of deployment roll-outs dring upgrades.
Clone Of:
: 1804854 (view as bug list)
Environment:
Last Closed: 2020-07-13 17:12:05 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-version-operator pull 316 0 None closed Bug 1768260: lib,pkg: provide detailed errors for workload failures 2021-01-20 18:46:38 UTC
Red Hat Product Errata RHBA-2020:2409 0 None None None 2020-07-13 17:12:27 UTC

Description Ben Parees 2019-11-03 18:17:31 UTC
Description of problem:
During a failed upgrade the CVO appears to be reporting it could not update a resource:

* Could not update deployment "openshift-console/downloads" (290 of 433)

in:
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/10313

But no information is provided to help understand the nature of the failure (api server error?  unpatchable resource?  something else?)

The message should provide additional information to help teams understand how to resolve the issue.

Comment 1 Abhinav Dahiya 2019-11-04 17:54:36 UTC
> https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/10313/artifacts/e2e-aws-upgrade/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-879583d2070d7e98b24b73535d19d84c62983f16417fe84d40249f46119f332e/namespaces/openshift-cluster-version/pods/cluster-version-operator-dd9fdbfb7-t8kqd/cluster-version-operator/cluster-version-operator/logs/current.log

```
2019-11-01T01:57:40.248306704Z I1101 01:57:40.248248       1 apps.go:115] Deployment downloads is not ready. status: (replicas: 2, updated: 2, ready: 1, unavailable: 1, reason: MinimumReplicasUnavailable, message: Deployment does not have minimum availability.)
2019-11-01T01:57:43.248385305Z I1101 01:57:43.248330       1 apps.go:115] Deployment downloads is not ready. status: (replicas: 2, updated: 2, ready: 1, unavailable: 1, reason: MinimumReplicasUnavailable, message: Deployment does not have minimum availability.)
2019-11-01T01:57:46.248534786Z I1101 01:57:46.248481       1 apps.go:115] Deployment downloads is not ready. status: (replicas: 2, updated: 2, ready: 1, unavailable: 1, reason: MinimumReplicasUnavailable, message: Deployment does not have minimum availability.)
```

If you look at the CVO logs, it shows the detailed message for easy debugging by owners, we have explicitly kept very detailed errors out of clusterversion as they don't provide value to customers.

Comment 2 Ben Parees 2019-11-04 18:26:46 UTC
Not sure if i agree w/ that philosophy (it means dev teams answering a support question are going to have to ask for another round of logs before they can dig into the problem, at a minimum), but ok.

Follow-on question: Why does a "not ready" deployment prevent updating that deployment?  Does the CVO refuse to patch a resource that's not healthy?  Am i misunderstanding what "could not update deployment" means in this context?

Comment 3 Abhinav Dahiya 2019-11-04 19:02:56 UTC
> Why does a "not ready" deployment prevent updating that deployment?  Does the CVO refuse to patch a resource that's not healthy?  Am i misunderstanding what "could not update deployment" means in this context?


Updating a deployment is not `object updated in the definition`, but rather the newest/desired version of the deployment is complete ie for deployment that criteria is Ready pods == required pods == updated pods and no unavailable pods..

Comment 4 Ben Parees 2019-11-04 19:49:07 UTC
Seems like it's more of a "waiting for deployment foo to rollout" condition than "could not update" then?  can we make the message clearer?

Comment 5 Abhinav Dahiya 2019-11-04 20:01:41 UTC
(In reply to Ben Parees from comment #4)
> Seems like it's more of a "waiting for deployment foo to rollout" condition
> than "could not update" then?  can we make the message clearer?

For CVO and user they don't care about if deployment is updated to latest version but the pods are not running to that version, the required result is still make sure the latest code is running and that what `updated` means for CVO in this context.
So imo the CVO is provding the correct error.

Comment 6 Ben Parees 2019-11-04 20:09:09 UTC
it's not clear who/what "could not update" the deployment.  the message makes it sound (to me) like the CVO couldn't update it.  

the deployment failed to update.  nothing failed to update it.

I would not read that error message and think "oh, i should go look at the deployment and see why it's stuck".

Comment 7 Clayton Coleman 2020-02-05 16:58:29 UTC
We should offer more targeted error causes on specific things.  We have specific messages for cluster operators, it would be acceptable to have specific errors for workload types (specifically around "not yet available").

The code here is the core (note the comment!)

	// log the errors to assist in debugging future summarization
	if klog.V(4) {
		klog.Infof("Summarizing %d errors", len(errs))
		for _, err := range errs {
			if uErr, ok := err.(*payload.UpdateError); ok {
				if uErr.Task != nil {
					klog.Infof("Update error %d of %d: %s %s (%T: %v)", uErr.Task.Index, uErr.Task.Total, uErr.Reason, uErr.Message, uErr.Nested, uErr.Nested)
				} else {
					klog.Infof("Update error: %s %s (%T: %v)", uErr.Reason, uErr.Message, uErr.Nested, uErr.Nested)
				}
			} else {
				klog.Infof("Update error: %T: %v", err, err)
			}
		}
	}

	// collapse into a set of common errors where necessary
	if len(errs) == 1 {
		return errs[0]
	}
	// hide the generic "not available yet" when there are more specific errors present
	if filtered := filterErrors(errs, isClusterOperatorNotAvailable); len(filtered) > 0 {
		return newMultipleError(filtered)
	}
	// if we're only waiting for operators, condense the error down to a singleton
	if err := newClusterOperatorsNotAvailable(errs); err != nil {
		return err
	}

If we have evidence that a large chunk of upgrade failures are related to rollout (which anecdotally I would believe), then this sort of summarization is appropriate. I would say that deployment/statefulset/daemonset that we wait on should have a roughly equivalent reason and error if we can't make progress that is roughly generic to "rollout" and perform the same summarization.  I might suggest "WorkloadNotAvailable" or "RolloutNotProgressing" as a reason and have SummaryForReason() provide a similar mapping.

Comment 10 Johnny Liu 2020-04-01 11:57:52 UTC
In https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/23885, "WorkloadNotAvailable" already is shown in log.

Comment 12 errata-xmlrpc 2020-07-13 17:12:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409


Note You need to log in before you can comment on or make changes to this bug.