+++ This bug was initially created as a clone of Bug #1763293 +++ We currently get errors like [1]: Oct 17 18:41:52.205 E clusteroperator/machine-api changed Degraded to True: SyncingFailed: Failed when progressing towards operator: 4.3.0-0.ci-2019-10-17-173803 because timed out waiting for the condition But "timed out waiting for the condition" is not as useful as whatever it was that made us decide the last poll wasn't good enough. We should put that reason in the message instead. [1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/8809
Not sure why I was the assignee here.
Saw this in below upgrade tests for testing upgrade to 4.2.30 [1] https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade/669 [2] https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade/668 [3] https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade/667
The operator appeared to be working fine. machine-controller (container) was in crashloop backoff in the replacement deployment. This is likely due to some misconfiguration in a build that got promoted at some point. Unfortunately, the must-gather does not include any logs in any of the linked cases for the pods in crash loop backoff. This is a bug in must-gather functionality, I think. I have seen this before with another failed build (semver was not parsible and causing the controller to fail immediately upon startup). I will get another bug open for gathering the failed container logs in the must-gather logs.
> The operator appeared to be working fine. It may be. The point of this bug is that "timed out waiting for the condition" is a garbage message. The operator should at least tell us what it was waiting for that timed out. And ideally, give some hints about the impact of the degraded condition and suggest some mitigation steps, although I don't think we need to be that good before we can close this bug.
(In reply to W. Trevor King from comment #4) > > The operator appeared to be working fine. > > It may be. The point of this bug is that "timed out waiting for the > condition" is a garbage message. The operator should at least tell us what > it was waiting for that timed out. And ideally, give some hints about the > impact of the degraded condition and suggest some mitigation steps, although > I don't think we need to be that good before we can close this bug. Okay, I think that is a good assessment. I changed the target release to 4.6, I think we should have a clearer message here as well.
This bug is a clone of https://bugzilla.redhat.com/show_bug.cgi?id=1763293 which was addressed by https://github.com/openshift/machine-api-operator/pull/417 which is included in >=4.3. Moving to 4.2 to backport.
Ok, if this is just the clone of something already fix, we don't need to backport to 4.2 at this point. This is just cleaning up an error message for something that otherwise works.