Description of problem:
This morning for a given batch of PRs, the CI operator failed to run e2e-aws 22 times in a row, from https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/batch/pull-ci-openshift-origin-master-e2e-aws/8576 - https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/batch/pull-ci-openshift-origin-master-e2e-aws/8598.
This issue appears 38 times in the last week, and 0 times in the week before that. This could indicate that some flaky behavior was recently merged into CI Operator and deployed.
This failure occurs within the first ~2 minutes of the test run and produces the following logs:
error: could not run steps: step e2e-aws failed: could not wait for template instance to be ready: could not determine if template instance was ready: failed to create objects: rolebindings.authorization.openshift.io "e2e-aws-image-puller" already exists
rolebindings.authorization.openshift.io "e2e-aws-namespace-editors" already exists
pods "e2e-aws" already exists
Very interesting. We have not had any merges to CI Operator in that time-frame, but we did upgrade the OpenShift masters to v3.11.0+e5dbec2-186 on May 2nd. What search tool are you using to find these failures? This is perhaps an issue with OpenShift.
This could be related to https://bugzilla.redhat.com/show_bug.cgi?id=1707486
Steve, I'm searching here: https://search.svc.ci.openshift.org/?search=could+not+wait+for+template+instance+to+be+ready%3A+could+not+determine+if+template+instance+was+ready&maxAge=12h&context=-1&type=all
Hmm, with this search criteria it shows up 135 times in the last week and only once in the week before that.
This issue is occurring for PRs related to 4.1 and thus adding risk to 4.1 (though I understand that CI is running on a 3.11 cluster).
For clarity, we upgraded
v3.11.0+5be504e-155 --> v3.11.0+e5dbec2-186
It looks like the occurrence of this is well correlated with the upgrade and is affecting ~5% of jobs.
both of these templates tried to create the same rolebinding (depending on the variable/parameter substitution anyway):
so if you are instantiating both of these in the same namespace, that would cause this.
Created attachment 1566443 [details]
Two time ranges where this lead to runs on origin batch jobs
Pasting a typical error message so I can find this bug later :
could not wait for template instance to be ready: unable to retrieve existing template instance: templateinstances.template.openshift.io "e2e-aws-serial" not found
When this happens, it can lead to a run on a failing batch job with all sorts of exiting knock-on errors (not sure why that's specific to batch jobs).
One reason we're hitting this is that Prow does not abort previous batch jobs. That is fixed with this PR:
As build cop today I ran into this bug 4 times:
We deployed the fix to Prow this evening and we should see batch jobs get aborted correctly from now on. I do not expect to see this any longer.
We are seeing batch jobs aborted correctly, this should not happen again.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
*** Bug 1707486 has been marked as a duplicate of this bug. ***