Bug 1763295

Summary: SyncingFailed and other waitForDeploymentRollout consumers often show only 'timed out waiting for the condition'
Product: OpenShift Container Platform Reporter: W. Trevor King <wking>
Component: Cloud ComputeAssignee: Alberto <agarcial>
Cloud Compute sub component: Other Providers QA Contact: Jianwei Hou <jhou>
Status: CLOSED WONTFIX Docs Contact:
Severity: unspecified    
Priority: unspecified CC: agarcial, jhou, lmohanty, mgugino, wking
Version: 4.2.zKeywords: Reopened
Target Milestone: ---   
Target Release: 4.2.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1763293 Environment:
Last Closed: 2020-05-19 13:01:28 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1763293, 1763772    
Bug Blocks:    

Description W. Trevor King 2019-10-18 17:15:15 UTC
+++ This bug was initially created as a clone of Bug #1763293 +++

We currently get errors like [1]:

  Oct 17 18:41:52.205 E clusteroperator/machine-api changed Degraded to True: SyncingFailed: Failed when progressing towards operator: 4.3.0-0.ci-2019-10-17-173803 because timed out waiting for the condition

But "timed out waiting for the condition" is not as useful as whatever it was that made us decide the last poll wasn't good enough.  We should put that reason in the message instead.

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/8809

Comment 1 W. Trevor King 2020-04-23 02:04:42 UTC
Not sure why I was the assignee here.

Comment 3 Michael Gugino 2020-05-18 22:26:23 UTC
The operator appeared to be working fine.  machine-controller (container) was in crashloop backoff in the replacement deployment.  This is likely due to some misconfiguration in a build that got promoted at some point.

Unfortunately, the must-gather does not include any logs in any of the linked cases for the pods in crash loop backoff.  This is a bug in must-gather functionality, I think.  I have seen this before with another failed build (semver was not parsible and causing the controller to fail immediately upon startup).

I will get another bug open for gathering the failed container logs in the must-gather logs.

Comment 4 W. Trevor King 2020-05-18 23:10:59 UTC
> The operator appeared to be working fine.

It may be.  The point of this bug is that "timed out waiting for the condition" is a garbage message.  The operator should at least tell us what it was waiting for that timed out.  And ideally, give some hints about the impact of the degraded condition and suggest some mitigation steps, although I don't think we need to be that good before we can close this bug.

Comment 5 Michael Gugino 2020-05-18 23:14:40 UTC
(In reply to W. Trevor King from comment #4)
> > The operator appeared to be working fine.
> 
> It may be.  The point of this bug is that "timed out waiting for the
> condition" is a garbage message.  The operator should at least tell us what
> it was waiting for that timed out.  And ideally, give some hints about the
> impact of the degraded condition and suggest some mitigation steps, although
> I don't think we need to be that good before we can close this bug.

Okay, I think that is a good assessment.  I changed the target release to 4.6, I think we should have a clearer message here as well.

Comment 6 Alberto 2020-05-19 07:16:03 UTC
This bug is a clone of https://bugzilla.redhat.com/show_bug.cgi?id=1763293 which was addressed by https://github.com/openshift/machine-api-operator/pull/417 which is included in >=4.3.
Moving to 4.2 to backport.

Comment 7 Michael Gugino 2020-05-19 13:01:28 UTC
Ok, if this is just the clone of something already fix, we don't need to backport to 4.2 at this point.  This is just cleaning up an error message for something that otherwise works.