1929248 – API priority test case flaking in CI

Bug 1929248 - API priority test case flaking in CI [NEEDINFO]

Summary: API priority test case flaking in CI

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-apiserver
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Abu Kashem
QA Contact:	Ke Wang
Docs Contact:
URL:
Whiteboard:	LifecycleReset
Depends On:
Blocks:	1929012
TreeView+	depends on / blocked

Reported:	2021-02-16 14:44 UTC by Abu Kashem
Modified:	2021-04-28 11:02 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1929012
Environment:
Last Closed:	2021-04-28 11:02:34 UTC
Target Upstream Version:
Embargoed:
Flags:	mfojtik: needinfo?

Attachments	(Terms of Use)

Description Abu Kashem 2021-02-16 14:44:49 UTC

+++ This bug was initially created as a clone of Bug #1929012 +++

Description of problem:

A frequent flake in CI happens around API priority. There are two cases that seem related that fail periodically:

  openshift-tests.[sig-api-machinery] API priority and fairness should ensure that requests can be classified by testing flow-schemas/priority-levels [Suite:openshift/conformance/parallel] [Suite:k8s]

you can see it in this job:

  https://testgrid.k8s.io/redhat-openshift-ocp-release-4.7-blocking#release-openshift-origin-installer-e2e-gcp-4.7

sometimes the test will fail the first time and succeed the second time so our job would end up passing, but it does
fail back to back sometimes which marks the job failed.

It seems to be showing up in ~5% of failing jobs going by this search:
https://search.ci.openshift.org/?search=API+priority+and+fairness+should+ensure+that+requests+can+be+classified+by+testing+flow-schemas%2Fpriority-levels&maxAge=168h&context=1&type=junit&name=4.7&maxMatches=5&maxBytes=20971520&groupBy=job


I don't know the root cause, but looking at this job:

  https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.7/1361412835921367040

and taking it's must-gather:

  https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.7/1361412835921367040/artifacts/e2e-gcp/must-gather.tar

and searching during the time frame of the first failure "21:21:22" I found this log snippet:

  2021-02-15T21:21:22.730001901Z I0215 21:21:22.729862      19 trace.go:205] Trace[179810853]: "Create" url:/apis/rbac.authorization.k8s.io/v1/namespaces/e2e-apf-8687/rolebindings,us
er-agent:openshift-controller-manager/v0.0.0 (linux/amd64) kubernetes/$Format/system:serviceaccount:openshift-infra:default-rolebindings-controller,client:10.0.0.5 (15-Feb-2021 21:
  21:22.215) (total time: 514ms):
  2021-02-15T21:21:22.730001901Z Trace[179810853]: ---"Object stored in database" 514ms (21:21:00.729)
  2021-02-15T21:21:22.730001901Z Trace[179810853]: [514.715556ms] [514.715556ms] END
  2021-02-15T21:21:25.313279055Z E0215 21:21:25.313135      19 wrap.go:54] timeout or abort while handling: GET "/apis/oauth.openshift.io/v1/oauthclients"

I think that would be the request that has trouble and the "timeout or abort" message might be a clue. That log file is:
  namespaces/openshift-kube-apiserver/pods/kube-apiserver-ci-op-zr3hl6j2-c38ab-8njmq-master-0/kube-apiserver/kube-apiserver/logs/current.log

I didn't get much further than that.

--- Additional comment from Michal Fojtik on 2021-02-16 08:29:12 UTC ---

Looks like this is already tracked upstream: https://github.com/kubernetes/kubernetes/issues/96803 and fixed in https://github.com/kubernetes/kubernetes/pull/96984

As this is only test flake, setting severity appropriately.

Comment 3 Michal Fojtik 2021-03-18 16:20:25 UTC

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 4 Clayton Coleman 2021-03-30 16:10:58 UTC

Was told that this is fixed in the rebase, this is still flaking

Comment 5 Michal Fojtik 2021-03-30 16:23:19 UTC

The LifecycleStale keyword was removed because the bug got commented on recently.
The bug assignee was notified.

Comment 6 Maciej Szulik 2021-04-08 09:48:54 UTC

I'm bumping this to high, since this has to be fixed for 4.8 right after we land https://github.com/openshift/origin/pull/26054

Comment 7 Maciej Szulik 2021-04-28 11:02:34 UTC

https://github.com/openshift/origin/pull/26056 didn't merge so the test wasn't disabled. Closing.

Note You need to log in before you can comment on or make changes to this bug.