Bug 1691089

Summary: Unexpected error encountered in Namespaces [Serial] should always delete fast
Product: OpenShift Container Platform Reporter: ewolinet
Component: MasterAssignee: Tomáš Nožička <tnozicka>
Status: CLOSED DUPLICATE QA Contact: Xingxing Xia <xxia>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.1.0CC: aos-bugs, bparees, jokerman, jsafrane, maszulik, mmccomas
Target Milestone: ---Keywords: Reopened
Target Release: 4.1.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-04-11 13:55:49 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Occurrences of this error in CI from 2019-03-19T12:28 to 2019-03-21T20:06 UTC none

Description ewolinet 2019-03-20 20:26:25 UTC
Failure encountered in conformance test

fail [k8s.io/kubernetes/test/e2e/apimachinery/namespace.go:49]: Expected error:
    <*errors.errorString | 0xc420327af0>: {
        s: "watch closed before UntilWithoutRetry timeout",
    }
    watch closed before UntilWithoutRetry timeout
not to have occurred


Ref: https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-serial-4.0/3103

Comment 1 W. Trevor King 2019-03-22 06:07:51 UTC
Created attachment 1546782 [details]
Occurrences of this error in CI from 2019-03-19T12:28 to 2019-03-21T20:06 UTC

This occurred in 5 of our 861 failures in *-e2e-aws* jobs across the whole CI system over the past 55 hours.  Generated with [1]:

  $ deck-build-log-plot 'failed: .*ALL of 100 namespaces in 150 seconds'

[1]: https://github.com/wking/openshift-release/tree/debug-scripts/deck-build-log

Comment 2 Tomáš Nožička 2019-03-22 11:00:31 UTC
apiserver failed, the particular e2e flake is irrelevant. likely cert rotation.

Comment 3 Tomáš Nožička 2019-03-22 12:53:45 UTC

*** This bug has been marked as a duplicate of bug 1691055 ***

Comment 4 Jan Safranek 2019-04-11 09:29:39 UTC
While the bug could be caused by cert rotation or any other API server failure, the test framework should retry (as any decent API server client).

The root cause is that test framework uses `UntilWithoutRetry` for waiting for ServiceAccount creation in a new namespace. As the name suggests, `UntilWithoutRetry does not retry on errors and random tests flake just during initial namespace creation.

func waitForServiceAccountInNamespace(c clientset.Interface, ns, serviceAccountName string, timeout time.Duration) error {
...
	_, err = watchtools.UntilWithoutRetry(ctx, w, conditions.ServiceAccountHasSecrets)
	return err
}

Test framework should use something less error prone, like `UntilWithSync`. Question is, why there is UntilWithoutRetry at all, there must have been a good reason for that.

Comment 5 Tomáš Nožička 2019-04-11 13:55:49 UTC
This is not the root cause of the error, just a manifestation of non-stable control plane. 

We wait for stable control plane before running the suite and the hickups or rollovers are bugs. The whole test suite can't restart watches. We have started fixing upstream, but at best it needs 1.14 kube for the apimachinery. Some PRs will be in later kube. Fixing those also doesn't have priority at this point, as it is not the actual failure.

In that CI log you can see the control plane hickup, reporting any sort of watch error when control plane failed during e2e is not needed. It is nice to have, but the test would have failed anyways on the operator state.

*** This bug has been marked as a duplicate of bug 1691055 ***

Comment 6 Tomáš Nožička 2019-04-11 14:00:49 UTC
> Question is, why there is UntilWithoutRetry at all, there must have been a good reason for that.

It was called `Until`, but the code is the same. I have renamed it to `UntilWithoutRetry` so people realize what it actually does. 

I think I've actually fix this one somewhere around 1.14 timeframe with UntilWithSync.