Bug 1708187

Summary:

CI Operator template e2e AWS could not determine if template instance was ready

Product:

OpenShift Container Platform

Reporter:

lserven

Component:

Test Infrastructure

Assignee:

Steve Kuznetsov <skuznets>

Status:

CLOSED ERRATA

QA Contact:

Severity:

low

Docs Contact:

Priority:

low

Version:

3.11.0

CC:

agreene, bparees, rmeggins, sponnaga, wking

Target Milestone:

---

Target Release:

3.11.z

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

buildcop

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2019-06-26 09:08:09 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Two time ranges where this lead to runs on origin batch jobs	none

Description lserven 2019-05-09 10:50:38 UTC

Description of problem:

This morning for a given batch of PRs, the CI operator failed to run e2e-aws 22 times in a row, from https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/batch/pull-ci-openshift-origin-master-e2e-aws/8576 - https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/batch/pull-ci-openshift-origin-master-e2e-aws/8598.

This issue appears 38 times in the last week, and 0 times in the week before that. This could indicate that some flaky behavior was recently merged into CI Operator and deployed.

This failure occurs within the first ~2 minutes of the test run and produces the following logs:

error: could not run steps: step e2e-aws failed: could not wait for template instance to be ready: could not determine if template instance was ready: failed to create objects: rolebindings.authorization.openshift.io "e2e-aws-image-puller" already exists
rolebindings.authorization.openshift.io "e2e-aws-namespace-editors" already exists
pods "e2e-aws" already exists

Comment 1 Steve Kuznetsov 2019-05-09 14:58:20 UTC

Very interesting. We have not had any merges to CI Operator in that time-frame, but we did upgrade the OpenShift masters to v3.11.0+e5dbec2-186 on May 2nd. What search tool are you using to find these failures? This is perhaps an issue with OpenShift.

Comment 2 Steve Kuznetsov 2019-05-09 15:24:07 UTC

This could be related to https://bugzilla.redhat.com/show_bug.cgi?id=1707486

Comment 3 lserven 2019-05-09 15:43:21 UTC

Steve, I'm searching here: https://search.svc.ci.openshift.org/?search=could+not+wait+for+template+instance+to+be+ready%3A+could+not+determine+if+template+instance+was+ready&maxAge=12h&context=-1&type=all
Hmm, with this search criteria it shows up 135 times in the last week and only once in the week before that.

This issue is occurring for PRs related to 4.1 and thus adding risk to 4.1 (though I understand that CI is running on a 3.11 cluster).

Comment 4 Steve Kuznetsov 2019-05-09 15:56:06 UTC

For clarity, we upgraded

v3.11.0+5be504e-155 --> v3.11.0+e5dbec2-186

It looks like the occurrence of this is well correlated with the upgrade and is affecting ~5% of jobs.

Comment 5 Ben Parees 2019-05-09 16:55:09 UTC

both of these templates tried to create the same rolebinding (depending on the variable/parameter substitution anyway):
https://github.com/openshift/release/blob/master/ci-operator/templates/master-sidecar-3.yaml#L22
https://github.com/openshift/release/blob/master/ci-operator/templates/master-sidecar-4.yaml#L28

so if you are instantiating both of these in the same namespace, that would cause this.

Comment 6 W. Trevor King 2019-05-09 22:28:01 UTC

Created attachment 1566443 [details]
Two time ranges where this lead to runs on origin batch jobs

Pasting a typical error message so I can find this bug later [1]:

  could not wait for template instance to be ready: unable to retrieve existing template instance: templateinstances.template.openshift.io "e2e-aws-serial" not found

When this happens, it can lead to a run on a failing batch job with all sorts of exiting knock-on errors (not sure why that's specific to batch jobs).

[1]: https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/batch/pull-ci-openshift-origin-master-e2e-aws-serial/6060

Comment 7 Steve Kuznetsov 2019-05-09 23:50:12 UTC

One reason we're hitting this is that Prow does not abort previous batch jobs. That is fixed with this PR:

https://github.com/kubernetes/test-infra/pull/12578

Comment 8 Alexander Greene 2019-05-14 20:10:27 UTC

As build cop today I ran into this bug 4 times:
https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_release/3572/rehearse-3572-pull-ci-openshift-builder-master-e2e-aws/54/
https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_release/3572/rehearse-3572-pull-ci-openshift-builder-master-e2e-aws/53/
https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_release/3572/rehearse-3572-pull-ci-openshift-builder-master-e2e-aws/51/
https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_release/3572/rehearse-3572-pull-ci-openshift-builder-master-e2e-aws/47/

Comment 9 Steve Kuznetsov 2019-05-14 23:08:26 UTC

We deployed the fix to Prow this evening and we should see batch jobs get aborted correctly from now on. I do not expect to see this any longer.

Comment 10 Steve Kuznetsov 2019-05-17 15:15:44 UTC

We are seeing batch jobs aborted correctly, this should not happen again.

Comment 12 errata-xmlrpc 2019-06-26 09:08:09 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:1605

Comment 13 Steve Kuznetsov 2019-07-11 15:02:54 UTC

*** Bug 1707486 has been marked as a duplicate of this bug. ***