1804854 – Could not update deployment "openshift-console/downloads" - no error given

Bug 1804854 - Could not update deployment "openshift-console/downloads" - no error given

Summary: Could not update deployment "openshift-console/downloads" - no error given

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.4.0
Assignee:	Abhinav Dahiya
QA Contact:	liujia
Docs Contact:
URL:
Whiteboard:
Depends On:	1768260
Blocks:
TreeView+	depends on / blocked

Reported:	2020-02-19 18:25 UTC by Scott Dodson
Modified:	2020-05-13 21:59 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1768260
Environment:
Last Closed:	2020-05-13 21:59:21 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-version-operator pull 329	0	None	closed	[release-4.4] Bug 1804854: lib,pkg: provide detailed errors for workload failures	2020-05-18 15:12:58 UTC
Red Hat Product Errata	RHBA-2020:0581	0	None	None	None	2020-05-13 21:59:22 UTC

Description Scott Dodson 2020-02-19 18:25:11 UTC

+++ This bug was initially created as a clone of Bug #1768260 +++

Description of problem:
During a failed upgrade the CVO appears to be reporting it could not update a resource:

* Could not update deployment "openshift-console/downloads" (290 of 433)

in:
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/10313

But no information is provided to help understand the nature of the failure (api server error?  unpatchable resource?  something else?)

The message should provide additional information to help teams understand how to resolve the issue.

--- Additional comment from Abhinav Dahiya on 2019-11-04 12:54:36 EST ---

> https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/10313/artifacts/e2e-aws-upgrade/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-879583d2070d7e98b24b73535d19d84c62983f16417fe84d40249f46119f332e/namespaces/openshift-cluster-version/pods/cluster-version-operator-dd9fdbfb7-t8kqd/cluster-version-operator/cluster-version-operator/logs/current.log

```
2019-11-01T01:57:40.248306704Z I1101 01:57:40.248248       1 apps.go:115] Deployment downloads is not ready. status: (replicas: 2, updated: 2, ready: 1, unavailable: 1, reason: MinimumReplicasUnavailable, message: Deployment does not have minimum availability.)
2019-11-01T01:57:43.248385305Z I1101 01:57:43.248330       1 apps.go:115] Deployment downloads is not ready. status: (replicas: 2, updated: 2, ready: 1, unavailable: 1, reason: MinimumReplicasUnavailable, message: Deployment does not have minimum availability.)
2019-11-01T01:57:46.248534786Z I1101 01:57:46.248481       1 apps.go:115] Deployment downloads is not ready. status: (replicas: 2, updated: 2, ready: 1, unavailable: 1, reason: MinimumReplicasUnavailable, message: Deployment does not have minimum availability.)
```

If you look at the CVO logs, it shows the detailed message for easy debugging by owners, we have explicitly kept very detailed errors out of clusterversion as they don't provide value to customers.

--- Additional comment from Ben Parees on 2019-11-04 13:26:46 EST ---

Not sure if i agree w/ that philosophy (it means dev teams answering a support question are going to have to ask for another round of logs before they can dig into the problem, at a minimum), but ok.

Follow-on question: Why does a "not ready" deployment prevent updating that deployment?  Does the CVO refuse to patch a resource that's not healthy?  Am i misunderstanding what "could not update deployment" means in this context?

--- Additional comment from Abhinav Dahiya on 2019-11-04 14:02:56 EST ---

> Why does a "not ready" deployment prevent updating that deployment?  Does the CVO refuse to patch a resource that's not healthy?  Am i misunderstanding what "could not update deployment" means in this context?


Updating a deployment is not `object updated in the definition`, but rather the newest/desired version of the deployment is complete ie for deployment that criteria is Ready pods == required pods == updated pods and no unavailable pods..

--- Additional comment from Ben Parees on 2019-11-04 14:49:07 EST ---

Seems like it's more of a "waiting for deployment foo to rollout" condition than "could not update" then?  can we make the message clearer?

--- Additional comment from Abhinav Dahiya on 2019-11-04 15:01:41 EST ---

(In reply to Ben Parees from comment #4)
> Seems like it's more of a "waiting for deployment foo to rollout" condition
> than "could not update" then?  can we make the message clearer?

For CVO and user they don't care about if deployment is updated to latest version but the pods are not running to that version, the required result is still make sure the latest code is running and that what `updated` means for CVO in this context.
So imo the CVO is provding the correct error.

--- Additional comment from Ben Parees on 2019-11-04 15:09:09 EST ---

it's not clear who/what "could not update" the deployment.  the message makes it sound (to me) like the CVO couldn't update it.  

the deployment failed to update.  nothing failed to update it.

I would not read that error message and think "oh, i should go look at the deployment and see why it's stuck".

--- Additional comment from Clayton Coleman on 2020-02-05 11:58:29 EST ---

We should offer more targeted error causes on specific things.  We have specific messages for cluster operators, it would be acceptable to have specific errors for workload types (specifically around "not yet available").

The code here is the core (note the comment!)

	// log the errors to assist in debugging future summarization
	if klog.V(4) {
		klog.Infof("Summarizing %d errors", len(errs))
		for _, err := range errs {
			if uErr, ok := err.(*payload.UpdateError); ok {
				if uErr.Task != nil {
					klog.Infof("Update error %d of %d: %s %s (%T: %v)", uErr.Task.Index, uErr.Task.Total, uErr.Reason, uErr.Message, uErr.Nested, uErr.Nested)
				} else {
					klog.Infof("Update error: %s %s (%T: %v)", uErr.Reason, uErr.Message, uErr.Nested, uErr.Nested)
				}
			} else {
				klog.Infof("Update error: %T: %v", err, err)
			}
		}
	}

	// collapse into a set of common errors where necessary
	if len(errs) == 1 {
		return errs[0]
	}
	// hide the generic "not available yet" when there are more specific errors present
	if filtered := filterErrors(errs, isClusterOperatorNotAvailable); len(filtered) > 0 {
		return newMultipleError(filtered)
	}
	// if we're only waiting for operators, condense the error down to a singleton
	if err := newClusterOperatorsNotAvailable(errs); err != nil {
		return err
	}

If we have evidence that a large chunk of upgrade failures are related to rollout (which anecdotally I would believe), then this sort of summarization is appropriate. I would say that deployment/statefulset/daemonset that we wait on should have a roughly equivalent reason and error if we can't make progress that is roughly generic to "rollout" and perform the same summarization.  I might suggest "WorkloadNotAvailable" or "RolloutNotProgressing" as a reason and have SummaryForReason() provide a similar mapping.

Comment 4 liujia 2020-02-26 08:40:58 UTC

4.0-0.ci-2020-02-25-165819(the latest 4.4 ci build for upgrade)

1. No replicafailure for deployment in past 2 days ci jobs. So no WorkloadNotProgressing error prompted.
https://search.svc.ci.openshift.org/?search=WorkloadNotProgressing&maxAge=48h&context=2&type=junit

2. Both available condition is false and progressing condition is true for deployment in past 2 days ci jobs. So WorkloadNotAvailable error prompted.
https://search.svc.ci.openshift.org/?search=WorkloadNotAvailable&maxAge=48h&context=2&type=junit
```
Feb 25 18:20:28.399 E clusterversion/version changed Failing to True: WorkloadNotAvailable: deployment openshift-kube-apiserver-operator/kube-apiserver-operator is not available MinimumReplicasUnavailable: Deployment does not have minimum availability.
```
```
Feb 25 15:59:54.709 E clusterversion/version changed Failing to True: WorkloadNotAvailable: deployment openshift-console-operator/console-operator is progressing ReplicaSetUpdated: ReplicaSet "console-operator-d6965b957" is progressing.
```

3. No WorkloadNotAvailable error for daemonset in past 2 days ci jobs.
https://search.svc.ci.openshift.org/?search=WorkloadNotAvailable&maxAge=48h&context=2&type=junit

Comment 6 errata-xmlrpc 2020-05-13 21:59:21 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581

Note You need to log in before you can comment on or make changes to this bug.