Bug 1763295

Summary:	SyncingFailed and other waitForDeploymentRollout consumers often show only 'timed out waiting for the condition'
Product:	OpenShift Container Platform	Reporter:	W. Trevor King <wking>
Component:	Cloud Compute	Assignee:	Alberto <agarcial>
Cloud Compute sub component:	Other Providers	QA Contact:	Jianwei Hou <jhou>
Status:	CLOSED WONTFIX	Docs Contact:
Severity:	unspecified
Priority:	unspecified	CC:	agarcial, jhou, lmohanty, mgugino, wking
Version:	4.2.z	Keywords:	Reopened
Target Milestone:	---
Target Release:	4.2.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1763293	Environment:
Last Closed:	2020-05-19 13:01:28 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1763293, 1763772
Bug Blocks:

Description W. Trevor King 2019-10-18 17:15:15 UTC

+++ This bug was initially created as a clone of Bug #1763293 +++

We currently get errors like [1]:

  Oct 17 18:41:52.205 E clusteroperator/machine-api changed Degraded to True: SyncingFailed: Failed when progressing towards operator: 4.3.0-0.ci-2019-10-17-173803 because timed out waiting for the condition

But "timed out waiting for the condition" is not as useful as whatever it was that made us decide the last poll wasn't good enough.  We should put that reason in the message instead.

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/8809

Comment 1 W. Trevor King 2020-04-23 02:04:42 UTC

Not sure why I was the assignee here.

Comment 2 Lalatendu Mohanty 2020-04-23 12:34:42 UTC

Saw this in below upgrade tests for testing upgrade to 4.2.30

[1] https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade/669
[2] https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade/668
[3] https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade/667

Comment 3 Michael Gugino 2020-05-18 22:26:23 UTC

The operator appeared to be working fine.  machine-controller (container) was in crashloop backoff in the replacement deployment.  This is likely due to some misconfiguration in a build that got promoted at some point.

Unfortunately, the must-gather does not include any logs in any of the linked cases for the pods in crash loop backoff.  This is a bug in must-gather functionality, I think.  I have seen this before with another failed build (semver was not parsible and causing the controller to fail immediately upon startup).

I will get another bug open for gathering the failed container logs in the must-gather logs.

Comment 4 W. Trevor King 2020-05-18 23:10:59 UTC

> The operator appeared to be working fine.

It may be.  The point of this bug is that "timed out waiting for the condition" is a garbage message.  The operator should at least tell us what it was waiting for that timed out.  And ideally, give some hints about the impact of the degraded condition and suggest some mitigation steps, although I don't think we need to be that good before we can close this bug.

Comment 5 Michael Gugino 2020-05-18 23:14:40 UTC

(In reply to W. Trevor King from comment #4)
> > The operator appeared to be working fine.
> 
> It may be.  The point of this bug is that "timed out waiting for the
> condition" is a garbage message.  The operator should at least tell us what
> it was waiting for that timed out.  And ideally, give some hints about the
> impact of the degraded condition and suggest some mitigation steps, although
> I don't think we need to be that good before we can close this bug.

Okay, I think that is a good assessment.  I changed the target release to 4.6, I think we should have a clearer message here as well.

Comment 6 Alberto 2020-05-19 07:16:03 UTC

This bug is a clone of https://bugzilla.redhat.com/show_bug.cgi?id=1763293 which was addressed by https://github.com/openshift/machine-api-operator/pull/417 which is included in >=4.3.
Moving to 4.2 to backport.

Comment 7 Michael Gugino 2020-05-19 13:01:28 UTC

Ok, if this is just the clone of something already fix, we don't need to backport to 4.2 at this point.  This is just cleaning up an error message for something that otherwise works.