Bug 1691089 - Unexpected error encountered in Namespaces [Serial] should always delete fast
Summary: Unexpected error encountered in Namespaces [Serial] should always delete fast
Keywords:
Status: CLOSED DUPLICATE of bug 1691055
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Master
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 4.1.0
Assignee: Tomáš Nožička
QA Contact: Xingxing Xia
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-03-20 20:26 UTC by ewolinet
Modified: 2019-04-11 14:00 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-04-11 13:55:49 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Occurrences of this error in CI from 2019-03-19T12:28 to 2019-03-21T20:06 UTC (337.05 KB, image/svg+xml)
2019-03-22 06:07 UTC, W. Trevor King
no flags Details

Description ewolinet 2019-03-20 20:26:25 UTC
Failure encountered in conformance test

fail [k8s.io/kubernetes/test/e2e/apimachinery/namespace.go:49]: Expected error:
    <*errors.errorString | 0xc420327af0>: {
        s: "watch closed before UntilWithoutRetry timeout",
    }
    watch closed before UntilWithoutRetry timeout
not to have occurred


Ref: https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-serial-4.0/3103

Comment 1 W. Trevor King 2019-03-22 06:07:51 UTC
Created attachment 1546782 [details]
Occurrences of this error in CI from 2019-03-19T12:28 to 2019-03-21T20:06 UTC

This occurred in 5 of our 861 failures in *-e2e-aws* jobs across the whole CI system over the past 55 hours.  Generated with [1]:

  $ deck-build-log-plot 'failed: .*ALL of 100 namespaces in 150 seconds'

[1]: https://github.com/wking/openshift-release/tree/debug-scripts/deck-build-log

Comment 2 Tomáš Nožička 2019-03-22 11:00:31 UTC
apiserver failed, the particular e2e flake is irrelevant. likely cert rotation.

Comment 3 Tomáš Nožička 2019-03-22 12:53:45 UTC

*** This bug has been marked as a duplicate of bug 1691055 ***

Comment 4 Jan Safranek 2019-04-11 09:29:39 UTC
While the bug could be caused by cert rotation or any other API server failure, the test framework should retry (as any decent API server client).

The root cause is that test framework uses `UntilWithoutRetry` for waiting for ServiceAccount creation in a new namespace. As the name suggests, `UntilWithoutRetry does not retry on errors and random tests flake just during initial namespace creation.

func waitForServiceAccountInNamespace(c clientset.Interface, ns, serviceAccountName string, timeout time.Duration) error {
...
	_, err = watchtools.UntilWithoutRetry(ctx, w, conditions.ServiceAccountHasSecrets)
	return err
}

Test framework should use something less error prone, like `UntilWithSync`. Question is, why there is UntilWithoutRetry at all, there must have been a good reason for that.

Comment 5 Tomáš Nožička 2019-04-11 13:55:49 UTC
This is not the root cause of the error, just a manifestation of non-stable control plane. 

We wait for stable control plane before running the suite and the hickups or rollovers are bugs. The whole test suite can't restart watches. We have started fixing upstream, but at best it needs 1.14 kube for the apimachinery. Some PRs will be in later kube. Fixing those also doesn't have priority at this point, as it is not the actual failure.

In that CI log you can see the control plane hickup, reporting any sort of watch error when control plane failed during e2e is not needed. It is nice to have, but the test would have failed anyways on the operator state.

*** This bug has been marked as a duplicate of bug 1691055 ***

Comment 6 Tomáš Nožička 2019-04-11 14:00:49 UTC
> Question is, why there is UntilWithoutRetry at all, there must have been a good reason for that.

It was called `Until`, but the code is the same. I have renamed it to `UntilWithoutRetry` so people realize what it actually does. 

I think I've actually fix this one somewhere around 1.14 timeframe with UntilWithSync.


Note You need to log in before you can comment on or make changes to this bug.