1711964 – "Error while reconciling" and "the update could not be applied" many hours after upgrade reported complete/successful

Bug 1711964 - "Error while reconciling" and "the update could not be applied" many hours after upgrade reported complete/successful

Summary: "Error while reconciling" and "the update could not be applied" many hours af...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	high
Target Milestone:	---
Target Release:	4.3.0
Assignee:	Abhinav Dahiya
QA Contact:	liujia
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-05-20 14:07 UTC by Mike Fiedler
Modified:	2023-09-14 05:28 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-09-30 17:17:59 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
4.1.2 listings (9.37 KB, text/plain) 2019-06-17 15:49 UTC, Justin Pierce	no flags	Details
View All

Description Mike Fiedler 2019-05-20 14:07:08 UTC

Description of problem:

After upgrading from  4.1.0-0.nightly-2019-05-17-041605 to 4.1.0-0.nightly-2019-05-18-050636 I am running oc get clusterversion every minute.   Every 3-4 hours the cluster goes through a period where it reports that the update could not be applied.   In between it reports good status.

Some snippets:

NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.1.0-0.nightly-2019-05-18-050636   True        False         9h      Error while reconciling 4.1.0-0.nightly-2019-05-18-050636: the update could not be applied 

NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.1.0-0.nightly-2019-05-18-050636   True        False         12h     Error while reconciling 4.1.0-0.nightly-2019-05-18-050636: the update could not be applied


NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.1.0-0.nightly-2019-05-18-050636   True        False         16h     Error while reconciling 4.1.0-0.nightly-2019-05-18-050636: the update could not be applied


In between it reports like this:

NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.1.0-0.nightly-2019-05-18-050636   True        False         16h     Cluster version is 4.1.0-0.nightly-2019-05-18-050636                                  


Version-Release number of selected component (if applicable):  4.1.0-0.nightly-2019-05-18-050636


How reproducible: Unknown - 1/1 so far.


Steps to Reproduce:
1. Install 4.1.0-0.nightly-2019-05-17-041605
2. Upgrade to 4.1.0-0.nightly-2019-05-18-050636
3. Run oc clusterversion every minute and watch the status - especially many hours after the upgrade purportedly succeeds

Actual results:

Error while reconciling 4.1.0-0.nightly-2019-05-18-050636: the update could not be applied

Expected results:

No oc clusterversion erors

Additional info:

oc adm must-gather will be attached.

Comment 2 Brenton Leanhardt 2019-05-20 18:17:47 UTC

Unfortunately the must-gather logs aren't going to contain any actionable info in this situation.  This is what we need to fix (though in 4.1.z).

At the same time, we'll likely improve the wording in the CVO to make it more clear that the problem is that another operator is flapping.

Comment 3 W. Trevor King 2019-05-20 22:52:50 UTC

> At the same time, we'll likely improve the wording in the CVO to make it more clear that the problem is that another operator is flapping.

Some initial groundwork for this in https://github.com/openshift/cluster-version-operator/pull/194

Comment 5 Justin Pierce 2019-06-17 15:49:12 UTC

Created attachment 1581485 [details]
4.1.2 listings

Comment 6 Abhinav Dahiya 2019-06-24 20:33:55 UTC

Based on https://bugzilla.redhat.com/attachment.cgi?id=1581485 CVO is correctly reporting that it's failing to make progress on reconcile due to cloud-creds-operator.

the summary for `oc get clusterversion version` cannot be all encompassing. It provides enough details to go look for details in the actual object.

I would like to see concrete examples of status updates in the object in contrast to the expected message from users.

Comment 7 Justin Pierce 2019-06-25 13:54:09 UTC

It seems like https://bugzilla.redhat.com/show_bug.cgi?id=1714484 was the root cause of my cloud credential operator failing. So the reconciling message error was valid. An area to consider is the message: "Cluster version is quay.io/openshift-release-dev/ocp-release:4.1.2" which reads as success to a typical user and seems to contradict: "Error while reconciling 4.1.2: the update could not be applied". Consistently displaying the error message or concatenating the messages would have left no room for misunderstanding.

Since their is no ERROR column on clusterversion output, this message presently serves as an important UX for a human operator to sanity check the CVO's state.

[ec2-user us-east-1 ~]$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version             True        False         2d20h   Cluster version is quay.io/openshift-release-dev/ocp-release:4.1.2

Comment 9 Scott Dodson 2019-09-30 17:17:59 UTC

This is working as intended.

Comment 10 Red Hat Bugzilla 2023-09-14 05:28:53 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days

Note You need to log in before you can comment on or make changes to this bug.