Bug 1763295 - SyncingFailed and other waitForDeploymentRollout consumers often show only 'timed out waiting for the condition'
Summary: SyncingFailed and other waitForDeploymentRollout consumers often show only 't...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.2.z
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 4.2.z
Assignee: Alberto
QA Contact: Jianwei Hou
URL:
Whiteboard:
Depends On: 1763293 1763772
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-10-18 17:15 UTC by W. Trevor King
Modified: 2020-05-19 13:01 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1763293
Environment:
Last Closed: 2020-05-19 13:01:28 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description W. Trevor King 2019-10-18 17:15:15 UTC
+++ This bug was initially created as a clone of Bug #1763293 +++

We currently get errors like [1]:

  Oct 17 18:41:52.205 E clusteroperator/machine-api changed Degraded to True: SyncingFailed: Failed when progressing towards operator: 4.3.0-0.ci-2019-10-17-173803 because timed out waiting for the condition

But "timed out waiting for the condition" is not as useful as whatever it was that made us decide the last poll wasn't good enough.  We should put that reason in the message instead.

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/8809

Comment 1 W. Trevor King 2020-04-23 02:04:42 UTC
Not sure why I was the assignee here.

Comment 3 Michael Gugino 2020-05-18 22:26:23 UTC
The operator appeared to be working fine.  machine-controller (container) was in crashloop backoff in the replacement deployment.  This is likely due to some misconfiguration in a build that got promoted at some point.

Unfortunately, the must-gather does not include any logs in any of the linked cases for the pods in crash loop backoff.  This is a bug in must-gather functionality, I think.  I have seen this before with another failed build (semver was not parsible and causing the controller to fail immediately upon startup).

I will get another bug open for gathering the failed container logs in the must-gather logs.

Comment 4 W. Trevor King 2020-05-18 23:10:59 UTC
> The operator appeared to be working fine.

It may be.  The point of this bug is that "timed out waiting for the condition" is a garbage message.  The operator should at least tell us what it was waiting for that timed out.  And ideally, give some hints about the impact of the degraded condition and suggest some mitigation steps, although I don't think we need to be that good before we can close this bug.

Comment 5 Michael Gugino 2020-05-18 23:14:40 UTC
(In reply to W. Trevor King from comment #4)
> > The operator appeared to be working fine.
> 
> It may be.  The point of this bug is that "timed out waiting for the
> condition" is a garbage message.  The operator should at least tell us what
> it was waiting for that timed out.  And ideally, give some hints about the
> impact of the degraded condition and suggest some mitigation steps, although
> I don't think we need to be that good before we can close this bug.

Okay, I think that is a good assessment.  I changed the target release to 4.6, I think we should have a clearer message here as well.

Comment 6 Alberto 2020-05-19 07:16:03 UTC
This bug is a clone of https://bugzilla.redhat.com/show_bug.cgi?id=1763293 which was addressed by https://github.com/openshift/machine-api-operator/pull/417 which is included in >=4.3.
Moving to 4.2 to backport.

Comment 7 Michael Gugino 2020-05-19 13:01:28 UTC
Ok, if this is just the clone of something already fix, we don't need to backport to 4.2 at this point.  This is just cleaning up an error message for something that otherwise works.


Note You need to log in before you can comment on or make changes to this bug.