Bug 1768260

Summary:	Could not update deployment "openshift-console/downloads" - no error given
Product:	OpenShift Container Platform	Reporter:	Ben Parees <bparees>
Component:	Cluster Version Operator	Assignee:	Abhinav Dahiya <adahiya>
Status:	CLOSED ERRATA	QA Contact:	Johnny Liu <jialiu>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4.3.0	CC:	adahiya, aos-bugs, ccoleman, jialiu, jokerman
Target Milestone:	---
Target Release:	4.5.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: Failure to roll-out deployment during updates was logged only in CVO logs and only a general error message was reported to ClusterVersion Consequence: General error message made it diffcult for users and teams to debug the failure unless looking at the CVO logs. Fix: The CVO now exposes underlying error to roll-out to ClusterVersion Result: Easy debugging of deployment roll-outs dring upgrades.	Story Points:	---
Clone Of:
Clones:	1804854 (view as bug list)		Environment:
Last Closed:	2020-07-13 17:12:05 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1804854

Description Ben Parees 2019-11-03 18:17:31 UTC

Description of problem:
During a failed upgrade the CVO appears to be reporting it could not update a resource:

* Could not update deployment "openshift-console/downloads" (290 of 433)

in:
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/10313

But no information is provided to help understand the nature of the failure (api server error?  unpatchable resource?  something else?)

The message should provide additional information to help teams understand how to resolve the issue.

Comment 1 Abhinav Dahiya 2019-11-04 17:54:36 UTC

> https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/10313/artifacts/e2e-aws-upgrade/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-879583d2070d7e98b24b73535d19d84c62983f16417fe84d40249f46119f332e/namespaces/openshift-cluster-version/pods/cluster-version-operator-dd9fdbfb7-t8kqd/cluster-version-operator/cluster-version-operator/logs/current.log

```
2019-11-01T01:57:40.248306704Z I1101 01:57:40.248248       1 apps.go:115] Deployment downloads is not ready. status: (replicas: 2, updated: 2, ready: 1, unavailable: 1, reason: MinimumReplicasUnavailable, message: Deployment does not have minimum availability.)
2019-11-01T01:57:43.248385305Z I1101 01:57:43.248330       1 apps.go:115] Deployment downloads is not ready. status: (replicas: 2, updated: 2, ready: 1, unavailable: 1, reason: MinimumReplicasUnavailable, message: Deployment does not have minimum availability.)
2019-11-01T01:57:46.248534786Z I1101 01:57:46.248481       1 apps.go:115] Deployment downloads is not ready. status: (replicas: 2, updated: 2, ready: 1, unavailable: 1, reason: MinimumReplicasUnavailable, message: Deployment does not have minimum availability.)
```

If you look at the CVO logs, it shows the detailed message for easy debugging by owners, we have explicitly kept very detailed errors out of clusterversion as they don't provide value to customers.

Comment 2 Ben Parees 2019-11-04 18:26:46 UTC

Not sure if i agree w/ that philosophy (it means dev teams answering a support question are going to have to ask for another round of logs before they can dig into the problem, at a minimum), but ok.

Follow-on question: Why does a "not ready" deployment prevent updating that deployment?  Does the CVO refuse to patch a resource that's not healthy?  Am i misunderstanding what "could not update deployment" means in this context?

Comment 3 Abhinav Dahiya 2019-11-04 19:02:56 UTC

> Why does a "not ready" deployment prevent updating that deployment?  Does the CVO refuse to patch a resource that's not healthy?  Am i misunderstanding what "could not update deployment" means in this context?


Updating a deployment is not `object updated in the definition`, but rather the newest/desired version of the deployment is complete ie for deployment that criteria is Ready pods == required pods == updated pods and no unavailable pods..

Comment 4 Ben Parees 2019-11-04 19:49:07 UTC

Seems like it's more of a "waiting for deployment foo to rollout" condition than "could not update" then?  can we make the message clearer?

Comment 5 Abhinav Dahiya 2019-11-04 20:01:41 UTC

(In reply to Ben Parees from comment #4)
> Seems like it's more of a "waiting for deployment foo to rollout" condition
> than "could not update" then?  can we make the message clearer?

For CVO and user they don't care about if deployment is updated to latest version but the pods are not running to that version, the required result is still make sure the latest code is running and that what `updated` means for CVO in this context.
So imo the CVO is provding the correct error.

Comment 6 Ben Parees 2019-11-04 20:09:09 UTC

it's not clear who/what "could not update" the deployment.  the message makes it sound (to me) like the CVO couldn't update it.  

the deployment failed to update.  nothing failed to update it.

I would not read that error message and think "oh, i should go look at the deployment and see why it's stuck".

Comment 7 Clayton Coleman 2020-02-05 16:58:29 UTC

We should offer more targeted error causes on specific things.  We have specific messages for cluster operators, it would be acceptable to have specific errors for workload types (specifically around "not yet available").

The code here is the core (note the comment!)

	// log the errors to assist in debugging future summarization
	if klog.V(4) {
		klog.Infof("Summarizing %d errors", len(errs))
		for _, err := range errs {
			if uErr, ok := err.(*payload.UpdateError); ok {
				if uErr.Task != nil {
					klog.Infof("Update error %d of %d: %s %s (%T: %v)", uErr.Task.Index, uErr.Task.Total, uErr.Reason, uErr.Message, uErr.Nested, uErr.Nested)
				} else {
					klog.Infof("Update error: %s %s (%T: %v)", uErr.Reason, uErr.Message, uErr.Nested, uErr.Nested)
				}
			} else {
				klog.Infof("Update error: %T: %v", err, err)
			}
		}
	}

	// collapse into a set of common errors where necessary
	if len(errs) == 1 {
		return errs[0]
	}
	// hide the generic "not available yet" when there are more specific errors present
	if filtered := filterErrors(errs, isClusterOperatorNotAvailable); len(filtered) > 0 {
		return newMultipleError(filtered)
	}
	// if we're only waiting for operators, condense the error down to a singleton
	if err := newClusterOperatorsNotAvailable(errs); err != nil {
		return err
	}

If we have evidence that a large chunk of upgrade failures are related to rollout (which anecdotally I would believe), then this sort of summarization is appropriate. I would say that deployment/statefulset/daemonset that we wait on should have a roughly equivalent reason and error if we can't make progress that is roughly generic to "rollout" and perform the same summarization.  I might suggest "WorkloadNotAvailable" or "RolloutNotProgressing" as a reason and have SummaryForReason() provide a similar mapping.

Comment 10 Johnny Liu 2020-04-01 11:57:52 UTC

In https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/23885, "WorkloadNotAvailable" already is shown in log.

Comment 12 errata-xmlrpc 2020-07-13 17:12:05 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409