Bug 1708187 - CI Operator template e2e AWS could not determine if template instance was ready
Summary: CI Operator template e2e AWS could not determine if template instance was ready
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Test Infrastructure
Version: 3.11.0
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: 3.11.z
Assignee: Steve Kuznetsov
QA Contact:
URL:
Whiteboard: buildcop
: 1707486 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-05-09 10:50 UTC by lserven
Modified: 2019-07-11 15:02 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-06-26 09:08:09 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Two time ranges where this lead to runs on origin batch jobs (327.87 KB, image/png)
2019-05-09 22:28 UTC, W. Trevor King
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:1605 0 None None None 2019-06-26 09:08:23 UTC

Description lserven 2019-05-09 10:50:38 UTC
Description of problem:

This morning for a given batch of PRs, the CI operator failed to run e2e-aws 22 times in a row, from https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/batch/pull-ci-openshift-origin-master-e2e-aws/8576 - https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/batch/pull-ci-openshift-origin-master-e2e-aws/8598.

This issue appears 38 times in the last week, and 0 times in the week before that. This could indicate that some flaky behavior was recently merged into CI Operator and deployed.

This failure occurs within the first ~2 minutes of the test run and produces the following logs:

error: could not run steps: step e2e-aws failed: could not wait for template instance to be ready: could not determine if template instance was ready: failed to create objects: rolebindings.authorization.openshift.io "e2e-aws-image-puller" already exists
rolebindings.authorization.openshift.io "e2e-aws-namespace-editors" already exists
pods "e2e-aws" already exists

Comment 1 Steve Kuznetsov 2019-05-09 14:58:20 UTC
Very interesting. We have not had any merges to CI Operator in that time-frame, but we did upgrade the OpenShift masters to v3.11.0+e5dbec2-186 on May 2nd. What search tool are you using to find these failures? This is perhaps an issue with OpenShift.

Comment 2 Steve Kuznetsov 2019-05-09 15:24:07 UTC
This could be related to https://bugzilla.redhat.com/show_bug.cgi?id=1707486

Comment 3 lserven 2019-05-09 15:43:21 UTC
Steve, I'm searching here: https://search.svc.ci.openshift.org/?search=could+not+wait+for+template+instance+to+be+ready%3A+could+not+determine+if+template+instance+was+ready&maxAge=12h&context=-1&type=all
Hmm, with this search criteria it shows up 135 times in the last week and only once in the week before that.

This issue is occurring for PRs related to 4.1 and thus adding risk to 4.1 (though I understand that CI is running on a 3.11 cluster).

Comment 4 Steve Kuznetsov 2019-05-09 15:56:06 UTC
For clarity, we upgraded

v3.11.0+5be504e-155 --> v3.11.0+e5dbec2-186

It looks like the occurrence of this is well correlated with the upgrade and is affecting ~5% of jobs.

Comment 5 Ben Parees 2019-05-09 16:55:09 UTC
both of these templates tried to create the same rolebinding (depending on the variable/parameter substitution anyway):
https://github.com/openshift/release/blob/master/ci-operator/templates/master-sidecar-3.yaml#L22
https://github.com/openshift/release/blob/master/ci-operator/templates/master-sidecar-4.yaml#L28

so if you are instantiating both of these in the same namespace, that would cause this.

Comment 6 W. Trevor King 2019-05-09 22:28:01 UTC
Created attachment 1566443 [details]
Two time ranges where this lead to runs on origin batch jobs

Pasting a typical error message so I can find this bug later [1]:

  could not wait for template instance to be ready: unable to retrieve existing template instance: templateinstances.template.openshift.io "e2e-aws-serial" not found

When this happens, it can lead to a run on a failing batch job with all sorts of exiting knock-on errors (not sure why that's specific to batch jobs).

[1]: https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/batch/pull-ci-openshift-origin-master-e2e-aws-serial/6060

Comment 7 Steve Kuznetsov 2019-05-09 23:50:12 UTC
One reason we're hitting this is that Prow does not abort previous batch jobs. That is fixed with this PR:

https://github.com/kubernetes/test-infra/pull/12578

Comment 9 Steve Kuznetsov 2019-05-14 23:08:26 UTC
We deployed the fix to Prow this evening and we should see batch jobs get aborted correctly from now on. I do not expect to see this any longer.

Comment 10 Steve Kuznetsov 2019-05-17 15:15:44 UTC
We are seeing batch jobs aborted correctly, this should not happen again.

Comment 12 errata-xmlrpc 2019-06-26 09:08:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:1605

Comment 13 Steve Kuznetsov 2019-07-11 15:02:54 UTC
*** Bug 1707486 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.