1691089 – Unexpected error encountered in Namespaces [Serial] should always delete fast

Bug 1691089 - Unexpected error encountered in Namespaces [Serial] should always delete fast

Summary: Unexpected error encountered in Namespaces [Serial] should always delete fast

Keywords:
Status:	CLOSED DUPLICATE of bug 1691055
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Master
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	4.1.0
Assignee:	Tomáš Nožička
QA Contact:	Xingxing Xia
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-03-20 20:26 UTC by ewolinet
Modified:	2019-04-11 14:00 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-04-11 13:55:49 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Occurrences of this error in CI from 2019-03-19T12:28 to 2019-03-21T20:06 UTC (337.05 KB, image/svg+xml) 2019-03-22 06:07 UTC, W. Trevor King	no flags	Details
View All

Description ewolinet 2019-03-20 20:26:25 UTC

Failure encountered in conformance test

fail [k8s.io/kubernetes/test/e2e/apimachinery/namespace.go:49]: Expected error:
    <*errors.errorString | 0xc420327af0>: {
        s: "watch closed before UntilWithoutRetry timeout",
    }
    watch closed before UntilWithoutRetry timeout
not to have occurred


Ref: https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-serial-4.0/3103

Comment 1 W. Trevor King 2019-03-22 06:07:51 UTC

Created attachment 1546782 [details]
Occurrences of this error in CI from 2019-03-19T12:28 to 2019-03-21T20:06 UTC

This occurred in 5 of our 861 failures in *-e2e-aws* jobs across the whole CI system over the past 55 hours.  Generated with [1]:

  $ deck-build-log-plot 'failed: .*ALL of 100 namespaces in 150 seconds'

[1]: https://github.com/wking/openshift-release/tree/debug-scripts/deck-build-log

Comment 2 Tomáš Nožička 2019-03-22 11:00:31 UTC

apiserver failed, the particular e2e flake is irrelevant. likely cert rotation.

Comment 3 Tomáš Nožička 2019-03-22 12:53:45 UTC


*** This bug has been marked as a duplicate of bug 1691055 ***

Comment 4 Jan Safranek 2019-04-11 09:29:39 UTC

While the bug could be caused by cert rotation or any other API server failure, the test framework should retry (as any decent API server client).

The root cause is that test framework uses `UntilWithoutRetry` for waiting for ServiceAccount creation in a new namespace. As the name suggests, `UntilWithoutRetry does not retry on errors and random tests flake just during initial namespace creation.

func waitForServiceAccountInNamespace(c clientset.Interface, ns, serviceAccountName string, timeout time.Duration) error {
...
	_, err = watchtools.UntilWithoutRetry(ctx, w, conditions.ServiceAccountHasSecrets)
	return err
}

Test framework should use something less error prone, like `UntilWithSync`. Question is, why there is UntilWithoutRetry at all, there must have been a good reason for that.

Comment 5 Tomáš Nožička 2019-04-11 13:55:49 UTC

This is not the root cause of the error, just a manifestation of non-stable control plane. 

We wait for stable control plane before running the suite and the hickups or rollovers are bugs. The whole test suite can't restart watches. We have started fixing upstream, but at best it needs 1.14 kube for the apimachinery. Some PRs will be in later kube. Fixing those also doesn't have priority at this point, as it is not the actual failure.

In that CI log you can see the control plane hickup, reporting any sort of watch error when control plane failed during e2e is not needed. It is nice to have, but the test would have failed anyways on the operator state.

*** This bug has been marked as a duplicate of bug 1691055 ***

Comment 6 Tomáš Nožička 2019-04-11 14:00:49 UTC

> Question is, why there is UntilWithoutRetry at all, there must have been a good reason for that.

It was called `Until`, but the code is the same. I have renamed it to `UntilWithoutRetry` so people realize what it actually does. 

I think I've actually fix this one somewhere around 1.14 timeframe with UntilWithSync.

Note You need to log in before you can comment on or make changes to this bug.