Failure encountered in conformance test fail [k8s.io/kubernetes/test/e2e/apimachinery/namespace.go:49]: Expected error: <*errors.errorString | 0xc420327af0>: { s: "watch closed before UntilWithoutRetry timeout", } watch closed before UntilWithoutRetry timeout not to have occurred Ref: https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-serial-4.0/3103
Created attachment 1546782 [details] Occurrences of this error in CI from 2019-03-19T12:28 to 2019-03-21T20:06 UTC This occurred in 5 of our 861 failures in *-e2e-aws* jobs across the whole CI system over the past 55 hours. Generated with [1]: $ deck-build-log-plot 'failed: .*ALL of 100 namespaces in 150 seconds' [1]: https://github.com/wking/openshift-release/tree/debug-scripts/deck-build-log
apiserver failed, the particular e2e flake is irrelevant. likely cert rotation.
*** This bug has been marked as a duplicate of bug 1691055 ***
While the bug could be caused by cert rotation or any other API server failure, the test framework should retry (as any decent API server client). The root cause is that test framework uses `UntilWithoutRetry` for waiting for ServiceAccount creation in a new namespace. As the name suggests, `UntilWithoutRetry does not retry on errors and random tests flake just during initial namespace creation. func waitForServiceAccountInNamespace(c clientset.Interface, ns, serviceAccountName string, timeout time.Duration) error { ... _, err = watchtools.UntilWithoutRetry(ctx, w, conditions.ServiceAccountHasSecrets) return err } Test framework should use something less error prone, like `UntilWithSync`. Question is, why there is UntilWithoutRetry at all, there must have been a good reason for that.
This is not the root cause of the error, just a manifestation of non-stable control plane. We wait for stable control plane before running the suite and the hickups or rollovers are bugs. The whole test suite can't restart watches. We have started fixing upstream, but at best it needs 1.14 kube for the apimachinery. Some PRs will be in later kube. Fixing those also doesn't have priority at this point, as it is not the actual failure. In that CI log you can see the control plane hickup, reporting any sort of watch error when control plane failed during e2e is not needed. It is nice to have, but the test would have failed anyways on the operator state. *** This bug has been marked as a duplicate of bug 1691055 ***
> Question is, why there is UntilWithoutRetry at all, there must have been a good reason for that. It was called `Until`, but the code is the same. I have renamed it to `UntilWithoutRetry` so people realize what it actually does. I think I've actually fix this one somewhere around 1.14 timeframe with UntilWithSync.