Bug 1708187

Summary: CI Operator template e2e AWS could not determine if template instance was ready
Product: OpenShift Container Platform Reporter: lserven
Component: Test InfrastructureAssignee: Steve Kuznetsov <skuznets>
Status: CLOSED ERRATA QA Contact:
Severity: low Docs Contact:
Priority: low    
Version: 3.11.0CC: agreene, bparees, rmeggins, sponnaga, wking
Target Milestone: ---   
Target Release: 3.11.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: buildcop
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-26 09:08:09 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Two time ranges where this lead to runs on origin batch jobs none

Description lserven 2019-05-09 10:50:38 UTC
Description of problem:

This morning for a given batch of PRs, the CI operator failed to run e2e-aws 22 times in a row, from https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/batch/pull-ci-openshift-origin-master-e2e-aws/8576 - https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/batch/pull-ci-openshift-origin-master-e2e-aws/8598.

This issue appears 38 times in the last week, and 0 times in the week before that. This could indicate that some flaky behavior was recently merged into CI Operator and deployed.

This failure occurs within the first ~2 minutes of the test run and produces the following logs:

error: could not run steps: step e2e-aws failed: could not wait for template instance to be ready: could not determine if template instance was ready: failed to create objects: rolebindings.authorization.openshift.io "e2e-aws-image-puller" already exists
rolebindings.authorization.openshift.io "e2e-aws-namespace-editors" already exists
pods "e2e-aws" already exists

Comment 1 Steve Kuznetsov 2019-05-09 14:58:20 UTC
Very interesting. We have not had any merges to CI Operator in that time-frame, but we did upgrade the OpenShift masters to v3.11.0+e5dbec2-186 on May 2nd. What search tool are you using to find these failures? This is perhaps an issue with OpenShift.

Comment 2 Steve Kuznetsov 2019-05-09 15:24:07 UTC
This could be related to https://bugzilla.redhat.com/show_bug.cgi?id=1707486

Comment 3 lserven 2019-05-09 15:43:21 UTC
Steve, I'm searching here: https://search.svc.ci.openshift.org/?search=could+not+wait+for+template+instance+to+be+ready%3A+could+not+determine+if+template+instance+was+ready&maxAge=12h&context=-1&type=all
Hmm, with this search criteria it shows up 135 times in the last week and only once in the week before that.

This issue is occurring for PRs related to 4.1 and thus adding risk to 4.1 (though I understand that CI is running on a 3.11 cluster).

Comment 4 Steve Kuznetsov 2019-05-09 15:56:06 UTC
For clarity, we upgraded

v3.11.0+5be504e-155 --> v3.11.0+e5dbec2-186

It looks like the occurrence of this is well correlated with the upgrade and is affecting ~5% of jobs.

Comment 5 Ben Parees 2019-05-09 16:55:09 UTC
both of these templates tried to create the same rolebinding (depending on the variable/parameter substitution anyway):
https://github.com/openshift/release/blob/master/ci-operator/templates/master-sidecar-3.yaml#L22
https://github.com/openshift/release/blob/master/ci-operator/templates/master-sidecar-4.yaml#L28

so if you are instantiating both of these in the same namespace, that would cause this.

Comment 6 W. Trevor King 2019-05-09 22:28:01 UTC
Created attachment 1566443 [details]
Two time ranges where this lead to runs on origin batch jobs

Pasting a typical error message so I can find this bug later [1]:

  could not wait for template instance to be ready: unable to retrieve existing template instance: templateinstances.template.openshift.io "e2e-aws-serial" not found

When this happens, it can lead to a run on a failing batch job with all sorts of exiting knock-on errors (not sure why that's specific to batch jobs).

[1]: https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/batch/pull-ci-openshift-origin-master-e2e-aws-serial/6060

Comment 7 Steve Kuznetsov 2019-05-09 23:50:12 UTC
One reason we're hitting this is that Prow does not abort previous batch jobs. That is fixed with this PR:

https://github.com/kubernetes/test-infra/pull/12578

Comment 9 Steve Kuznetsov 2019-05-14 23:08:26 UTC
We deployed the fix to Prow this evening and we should see batch jobs get aborted correctly from now on. I do not expect to see this any longer.

Comment 10 Steve Kuznetsov 2019-05-17 15:15:44 UTC
We are seeing batch jobs aborted correctly, this should not happen again.

Comment 12 errata-xmlrpc 2019-06-26 09:08:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:1605

Comment 13 Steve Kuznetsov 2019-07-11 15:02:54 UTC
*** Bug 1707486 has been marked as a duplicate of this bug. ***