Bug 1711964 - "Error while reconciling" and "the update could not be applied" many hours after upgrade reported complete/successful [NEEDINFO]
Summary: "Error while reconciling" and "the update could not be applied" many hours af...
Keywords:
Status: NEW
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cluster Version Operator
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
low
high
Target Milestone: ---
: 4.3.0
Assignee: Abhinav Dahiya
QA Contact: liujia
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-05-20 14:07 UTC by Mike Fiedler
Modified: 2019-09-04 19:55 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
erich: needinfo? (jupierce)


Attachments (Terms of Use)
4.1.2 listings (9.37 KB, text/plain)
2019-06-17 15:49 UTC, Justin Pierce
no flags Details

Description Mike Fiedler 2019-05-20 14:07:08 UTC
Description of problem:

After upgrading from  4.1.0-0.nightly-2019-05-17-041605 to 4.1.0-0.nightly-2019-05-18-050636 I am running oc get clusterversion every minute.   Every 3-4 hours the cluster goes through a period where it reports that the update could not be applied.   In between it reports good status.

Some snippets:

NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.1.0-0.nightly-2019-05-18-050636   True        False         9h      Error while reconciling 4.1.0-0.nightly-2019-05-18-050636: the update could not be applied 

NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.1.0-0.nightly-2019-05-18-050636   True        False         12h     Error while reconciling 4.1.0-0.nightly-2019-05-18-050636: the update could not be applied


NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.1.0-0.nightly-2019-05-18-050636   True        False         16h     Error while reconciling 4.1.0-0.nightly-2019-05-18-050636: the update could not be applied


In between it reports like this:

NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.1.0-0.nightly-2019-05-18-050636   True        False         16h     Cluster version is 4.1.0-0.nightly-2019-05-18-050636                                  


Version-Release number of selected component (if applicable):  4.1.0-0.nightly-2019-05-18-050636


How reproducible: Unknown - 1/1 so far.


Steps to Reproduce:
1. Install 4.1.0-0.nightly-2019-05-17-041605
2. Upgrade to 4.1.0-0.nightly-2019-05-18-050636
3. Run oc clusterversion every minute and watch the status - especially many hours after the upgrade purportedly succeeds

Actual results:

Error while reconciling 4.1.0-0.nightly-2019-05-18-050636: the update could not be applied

Expected results:

No oc clusterversion erors

Additional info:

oc adm must-gather will be attached.

Comment 2 Brenton Leanhardt 2019-05-20 18:17:47 UTC
Unfortunately the must-gather logs aren't going to contain any actionable info in this situation.  This is what we need to fix (though in 4.1.z).

At the same time, we'll likely improve the wording in the CVO to make it more clear that the problem is that another operator is flapping.

Comment 3 W. Trevor King 2019-05-20 22:52:50 UTC
> At the same time, we'll likely improve the wording in the CVO to make it more clear that the problem is that another operator is flapping.

Some initial groundwork for this in https://github.com/openshift/cluster-version-operator/pull/194

Comment 5 Justin Pierce 2019-06-17 15:49:12 UTC
Created attachment 1581485 [details]
4.1.2 listings

Comment 6 Abhinav Dahiya 2019-06-24 20:33:55 UTC
Based on https://bugzilla.redhat.com/attachment.cgi?id=1581485 CVO is correctly reporting that it's failing to make progress on reconcile due to cloud-creds-operator.

the summary for `oc get clusterversion version` cannot be all encompassing. It provides enough details to go look for details in the actual object.

I would like to see concrete examples of status updates in the object in contrast to the expected message from users.

Comment 7 Justin Pierce 2019-06-25 13:54:09 UTC
It seems like https://bugzilla.redhat.com/show_bug.cgi?id=1714484 was the root cause of my cloud credential operator failing. So the reconciling message error was valid. An area to consider is the message: "Cluster version is quay.io/openshift-release-dev/ocp-release:4.1.2" which reads as success to a typical user and seems to contradict: "Error while reconciling 4.1.2: the update could not be applied". Consistently displaying the error message or concatenating the messages would have left no room for misunderstanding.

Since their is no ERROR column on clusterversion output, this message presently serves as an important UX for a human operator to sanity check the CVO's state.

[ec2-user@stg-1.bastion us-east-1 ~]$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version             True        False         2d20h   Cluster version is quay.io/openshift-release-dev/ocp-release:4.1.2


Note You need to log in before you can comment on or make changes to this bug.