Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1691089

Summary:

Unexpected error encountered in Namespaces [Serial] should always delete fast

Product:

OpenShift Container Platform

Reporter:

ewolinet

Component:

Master

Assignee:

Tomáš Nožička <tnozicka>

Status:

CLOSED DUPLICATE

QA Contact:

Xingxing Xia <xxia>

Severity:

unspecified

Docs Contact:

Priority:

unspecified

Version:

4.1.0

CC:

aos-bugs, bparees, jokerman, jsafrane, maszulik, mmccomas

Target Milestone:

---

Keywords:

Reopened

Target Release:

4.1.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2019-04-11 13:55:49 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Occurrences of this error in CI from 2019-03-19T12:28 to 2019-03-21T20:06 UTC	none

Description ewolinet 2019-03-20 20:26:25 UTC

Failure encountered in conformance test

fail [k8s.io/kubernetes/test/e2e/apimachinery/namespace.go:49]: Expected error:
    <*errors.errorString | 0xc420327af0>: {
        s: "watch closed before UntilWithoutRetry timeout",
    }
    watch closed before UntilWithoutRetry timeout
not to have occurred


Ref: https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-serial-4.0/3103

Comment 1 W. Trevor King 2019-03-22 06:07:51 UTC

Created attachment 1546782 [details]
Occurrences of this error in CI from 2019-03-19T12:28 to 2019-03-21T20:06 UTC

This occurred in 5 of our 861 failures in *-e2e-aws* jobs across the whole CI system over the past 55 hours.  Generated with [1]:

  $ deck-build-log-plot 'failed: .*ALL of 100 namespaces in 150 seconds'

[1]: https://github.com/wking/openshift-release/tree/debug-scripts/deck-build-log

Comment 2 Tomáš Nožička 2019-03-22 11:00:31 UTC

apiserver failed, the particular e2e flake is irrelevant. likely cert rotation.

Comment 3 Tomáš Nožička 2019-03-22 12:53:45 UTC


*** This bug has been marked as a duplicate of bug 1691055 ***

Comment 4 Jan Safranek 2019-04-11 09:29:39 UTC

While the bug could be caused by cert rotation or any other API server failure, the test framework should retry (as any decent API server client).

The root cause is that test framework uses `UntilWithoutRetry` for waiting for ServiceAccount creation in a new namespace. As the name suggests, `UntilWithoutRetry does not retry on errors and random tests flake just during initial namespace creation.

func waitForServiceAccountInNamespace(c clientset.Interface, ns, serviceAccountName string, timeout time.Duration) error {
...
	_, err = watchtools.UntilWithoutRetry(ctx, w, conditions.ServiceAccountHasSecrets)
	return err
}

Test framework should use something less error prone, like `UntilWithSync`. Question is, why there is UntilWithoutRetry at all, there must have been a good reason for that.

Comment 5 Tomáš Nožička 2019-04-11 13:55:49 UTC

This is not the root cause of the error, just a manifestation of non-stable control plane. 

We wait for stable control plane before running the suite and the hickups or rollovers are bugs. The whole test suite can't restart watches. We have started fixing upstream, but at best it needs 1.14 kube for the apimachinery. Some PRs will be in later kube. Fixing those also doesn't have priority at this point, as it is not the actual failure.

In that CI log you can see the control plane hickup, reporting any sort of watch error when control plane failed during e2e is not needed. It is nice to have, but the test would have failed anyways on the operator state.

*** This bug has been marked as a duplicate of bug 1691055 ***

Comment 6 Tomáš Nožička 2019-04-11 14:00:49 UTC

> Question is, why there is UntilWithoutRetry at all, there must have been a good reason for that.

It was called `Until`, but the code is the same. I have renamed it to `UntilWithoutRetry` so people realize what it actually does. 

I think I've actually fix this one somewhere around 1.14 timeframe with UntilWithSync.