Bug 1929012 - API priority test case flaking in CI
Summary: API priority test case flaking in CI
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-apiserver
Version: 4.7
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: ---
: 4.7.z
Assignee: Abu Kashem
QA Contact: Ke Wang
URL:
Whiteboard:
Depends On: 1929248
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-02-16 01:08 UTC by jamo luhrsen
Modified: 2021-03-10 11:24 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1929248 (view as bug list)
Environment:
Last Closed: 2021-03-10 11:24:00 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift kubernetes pull 572 0 None closed Bug 1929012: UPSTREAM: 96984: APF e2e: wait for steady state before proceeding 2021-02-27 05:14:41 UTC
Red Hat Product Errata RHBA-2021:0678 0 None None None 2021-03-10 11:24:31 UTC

Description jamo luhrsen 2021-02-16 01:08:42 UTC
Description of problem:

A frequent flake in CI happens around API priority. There are two cases that seem related that fail periodically:

  openshift-tests.[sig-api-machinery] API priority and fairness should ensure that requests can be classified by testing flow-schemas/priority-levels [Suite:openshift/conformance/parallel] [Suite:k8s]

you can see it in this job:

  https://testgrid.k8s.io/redhat-openshift-ocp-release-4.7-blocking#release-openshift-origin-installer-e2e-gcp-4.7

sometimes the test will fail the first time and succeed the second time so our job would end up passing, but it does
fail back to back sometimes which marks the job failed.

It seems to be showing up in ~5% of failing jobs going by this search:
https://search.ci.openshift.org/?search=API+priority+and+fairness+should+ensure+that+requests+can+be+classified+by+testing+flow-schemas%2Fpriority-levels&maxAge=168h&context=1&type=junit&name=4.7&maxMatches=5&maxBytes=20971520&groupBy=job


I don't know the root cause, but looking at this job:

  https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.7/1361412835921367040

and taking it's must-gather:

  https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.7/1361412835921367040/artifacts/e2e-gcp/must-gather.tar

and searching during the time frame of the first failure "21:21:22" I found this log snippet:

  2021-02-15T21:21:22.730001901Z I0215 21:21:22.729862      19 trace.go:205] Trace[179810853]: "Create" url:/apis/rbac.authorization.k8s.io/v1/namespaces/e2e-apf-8687/rolebindings,us
er-agent:openshift-controller-manager/v0.0.0 (linux/amd64) kubernetes/$Format/system:serviceaccount:openshift-infra:default-rolebindings-controller,client:10.0.0.5 (15-Feb-2021 21:
  21:22.215) (total time: 514ms):
  2021-02-15T21:21:22.730001901Z Trace[179810853]: ---"Object stored in database" 514ms (21:21:00.729)
  2021-02-15T21:21:22.730001901Z Trace[179810853]: [514.715556ms] [514.715556ms] END
  2021-02-15T21:21:25.313279055Z E0215 21:21:25.313135      19 wrap.go:54] timeout or abort while handling: GET "/apis/oauth.openshift.io/v1/oauthclients"

I think that would be the request that has trouble and the "timeout or abort" message might be a clue. That log file is:
  namespaces/openshift-kube-apiserver/pods/kube-apiserver-ci-op-zr3hl6j2-c38ab-8njmq-master-0/kube-apiserver/kube-apiserver/logs/current.log

I didn't get much further than that.

Comment 1 Michal Fojtik 2021-02-16 08:29:12 UTC
Looks like this is already tracked upstream: https://github.com/kubernetes/kubernetes/issues/96803 and fixed in https://github.com/kubernetes/kubernetes/pull/96984

As this is only test flake, setting severity appropriately.

Comment 3 jamo luhrsen 2021-02-16 23:51:08 UTC
(In reply to Michal Fojtik from comment #1)
> Looks like this is already tracked upstream:
> https://github.com/kubernetes/kubernetes/issues/96803 and fixed in
> https://github.com/kubernetes/kubernetes/pull/96984
> 
> As this is only test flake, setting severity appropriately.

The PR 96984 was merged a month ago. Do we have to wait for some version bump or cherry-pick before we get that fix
in our openshift CI?

Comment 6 Ke Wang 2021-02-19 08:08:21 UTC
$ git clone https://github.com/openshift/kubernetes

$ cd kubernetes

$ git pull
Already up to date.

$ oc adm release info --commits registry.ci.openshift.org/ocp/release:4.7.0-0.nightly-2021-02-18-110409 | grep 'hyperkube'
  hyperkube          https://github.com/openshift/kubernetes            bd9e4421804c212e6ac7ee074050096f08dda543

$ git log --date=local --pretty="%h %an %cd - %s" bd9e442 | grep '96803'

$ git log --date=local --pretty="%h %an %cd - %s" bd9e442 | grep '96984'

$ git log --date=local --pretty="%h %an %cd - %s" bd9e442 | head -3
bd9e4421804 OpenShift Merge Robot Fri Feb 12 07:05:38 2021 - Merge pull request #566 from openshift-cherrypick-robot/cherry-pick-558-to-release-4.7
128f057a8b7 Dr. Stefan Schimanski Thu Feb 11 07:33:24 2021 - UPSTREAM: <carry>: kube-apiserver: ignore SIGTERM/INT after the first one
ba455830ecb OpenShift Merge Robot Sat Feb 6 06:18:43 2021 - Merge pull request #549 from tkashem/pick-96901-4.7

From above checking, we can see the most recent PR merge date was found on Feb 12, the upstream PRs 96803 and 96984 has not been merged to our repo github.com/openshift/kubernetes.

Hi akashem, please help to merge required PRs, thanks. I assign back the bug first.

Comment 8 Abu Kashem 2021-02-19 13:35:00 UTC
kewang,
The PR merged into 4.7 branch - https://github.com/openshift/kubernetes/pull/572

Comment 9 Ke Wang 2021-02-24 11:03:48 UTC
akashem, I did a quick check, the related PR has been loaded on the latest OCP 4.7 payload, see below, please move the bug ON_QA, I will verify it, thanks.

$ oc adm release info --commits registry.ci.openshift.org/ocp/release:4.7.0-0.nightly-2021-02-22-210958 | grep hyperkube
  hyperkube      https://github.com/openshift/kubernetes      5fbfd197c16d3c5facbaa1d7b9f3ea58cf6b36e9

$ git log --date=local --pretty="%h %an %cd - %s" 5fbfd19 | grep '#572.*tkashem'
5fbfd197c16 OpenShift Merge Robot Wed Feb 17 23:21:33 2021 - Merge pull request #572 from tkashem/pick-96984-4.7

Comment 10 W. Trevor King 2021-02-27 05:14:42 UTC
Formally linking the PR and moving to MODIFIED.  ART's tooling will sweep it into ON_QA soon.

Comment 13 Ke Wang 2021-03-02 11:18:51 UTC
Did a quick checking for case [1] from results https://testgrid.k8s.io/redhat-openshift-ocp-release-4.7-blocking#release-openshift-origin-installer-e2e-gcp-4.7 after the PR was merged date Feb 17th, there are a few failed cases, check the errors, about the the following two types:
1)
fail [k8s.io/kubernetes.0/test/e2e/storage/csi_mock_volume.go:1465]: failed: creating the directory: command terminated with exit code 1
Unexpected error:
    <exec.CodeExitError>: {
        Err: {
            s: "command terminated with exit code 1",
        },
        Code: 1,
    }
    command terminated with exit code 1
occurred

2)
fail [github.com/onsi/ginkgo.0-origin.0+incompatible/internal/leafnodes/runner.go:113]: Feb 21 12:15:20.601: matching user doesnt received UID for the testing priority-level and flow-schema

Cannot find the following error in bug description anymore,

2021-02-15T21:21:25.313279055Z E0215 21:21:25.313135      19 wrap.go:54] timeout or abort while handling: GET "/apis/oauth.openshift.io/v1/oauthclients"

Overall, the case [1] worked well in the past week, so move the bug VERIFIED.

[1] openshift-tests.[sig-api-machinery] API priority and fairness should ensure that requests can be classified by testing flow-schemas/priority-levels [Suite:openshift/conformance/parallel] [Suite:k8s]

Comment 15 errata-xmlrpc 2021-03-10 11:24:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.7.1 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0678


Note You need to log in before you can comment on or make changes to this bug.