Description of problem: After upgrading from 4.1.0-0.nightly-2019-05-17-041605 to 4.1.0-0.nightly-2019-05-18-050636 I am running oc get clusterversion every minute. Every 3-4 hours the cluster goes through a period where it reports that the update could not be applied. In between it reports good status. Some snippets: NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.1.0-0.nightly-2019-05-18-050636 True False 9h Error while reconciling 4.1.0-0.nightly-2019-05-18-050636: the update could not be applied NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.1.0-0.nightly-2019-05-18-050636 True False 12h Error while reconciling 4.1.0-0.nightly-2019-05-18-050636: the update could not be applied NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.1.0-0.nightly-2019-05-18-050636 True False 16h Error while reconciling 4.1.0-0.nightly-2019-05-18-050636: the update could not be applied In between it reports like this: NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.1.0-0.nightly-2019-05-18-050636 True False 16h Cluster version is 4.1.0-0.nightly-2019-05-18-050636 Version-Release number of selected component (if applicable): 4.1.0-0.nightly-2019-05-18-050636 How reproducible: Unknown - 1/1 so far. Steps to Reproduce: 1. Install 4.1.0-0.nightly-2019-05-17-041605 2. Upgrade to 4.1.0-0.nightly-2019-05-18-050636 3. Run oc clusterversion every minute and watch the status - especially many hours after the upgrade purportedly succeeds Actual results: Error while reconciling 4.1.0-0.nightly-2019-05-18-050636: the update could not be applied Expected results: No oc clusterversion erors Additional info: oc adm must-gather will be attached.
Unfortunately the must-gather logs aren't going to contain any actionable info in this situation. This is what we need to fix (though in 4.1.z). At the same time, we'll likely improve the wording in the CVO to make it more clear that the problem is that another operator is flapping.
> At the same time, we'll likely improve the wording in the CVO to make it more clear that the problem is that another operator is flapping. Some initial groundwork for this in https://github.com/openshift/cluster-version-operator/pull/194
Created attachment 1581485 [details] 4.1.2 listings
Based on https://bugzilla.redhat.com/attachment.cgi?id=1581485 CVO is correctly reporting that it's failing to make progress on reconcile due to cloud-creds-operator. the summary for `oc get clusterversion version` cannot be all encompassing. It provides enough details to go look for details in the actual object. I would like to see concrete examples of status updates in the object in contrast to the expected message from users.
It seems like https://bugzilla.redhat.com/show_bug.cgi?id=1714484 was the root cause of my cloud credential operator failing. So the reconciling message error was valid. An area to consider is the message: "Cluster version is quay.io/openshift-release-dev/ocp-release:4.1.2" which reads as success to a typical user and seems to contradict: "Error while reconciling 4.1.2: the update could not be applied". Consistently displaying the error message or concatenating the messages would have left no room for misunderstanding. Since their is no ERROR column on clusterversion output, this message presently serves as an important UX for a human operator to sanity check the CVO's state. [ec2-user us-east-1 ~]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version True False 2d20h Cluster version is quay.io/openshift-release-dev/ocp-release:4.1.2
This is working as intended.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days