Bug 2006947 - e2e-aws-proxy for 4.10 is permafailing with samples operator errors
Summary: e2e-aws-proxy for 4.10 is permafailing with samples operator errors
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Samples
Version: 4.10
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.10.0
Assignee: Gabe Montero
QA Contact: Jitendar Singh
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-09-22 17:18 UTC by Stephen Benjamin
Modified: 2022-03-10 16:13 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
job=periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-proxy=all job=periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-proxy=all
Last Closed: 2022-03-10 16:12:52 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-samples-operator pull 397 0 None open Bug 2006947: fix proxy portion of tbr inaccessible check 2021-09-22 23:45:54 UTC
Github openshift release pull 22168 0 None open Bug 2006947: add optional e2e-aws-proxy to samples operator master branch 2021-09-22 23:58:17 UTC
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-10 16:13:13 UTC

Description Stephen Benjamin 2021-09-22 17:18:25 UTC
periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-proxy is permafailing. The 4.9 job doesn't look great, either.

is failing frequently in CI, see:
https://testgrid.k8s.io/redhat-openshift-ocp-release-4.10-informing#periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-proxy


Example job failure:
https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-proxy/1440227332513075200

They mostly seem to be failing on pulling images:

fail [github.com/openshift/origin/test/extended/tbr_health/check.go:18]: Expected
    <string>: Failed to import expected imagestreams, latest error status: ImageStream Error: &errors.StatusError{ErrStatus:v1.Status{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ListMeta:v1.ListMeta{SelfLink:"", ResourceVersion:"", Continue:"", RemainingItemCount:(*int64)(nil)}, Status:"Failure", Message:"imagestreams.image.openshift.io \"ruby\" not found", Reason:"NotFound", Details:(*v1.StatusDetails)(0xc0019e2ea0), Code:404}} 


Looking into the cluster-samples-operator logs I see things like this:

2021-09-21T08:41:03.969380858Z time="2021-09-21T08:41:03Z" level=error msg="unable to sync: config.samples.operator.openshift.io \"cluster\" not found, requeuing"
2021-09-21T08:41:08.426343198Z time="2021-09-21T08:41:08Z" level=info msg="test connection with timeout failed with dial tcp 104.100.22.132:443: i/o timeout"
2021-09-21T08:41:17.534254816Z time="2021-09-21T08:41:17Z" level=info msg="Received watch event imagestream/driver-toolkit but not upserting since deletion of the Config is in progress"
2021-09-21T08:41:28.425834834Z time="2021-09-21T08:41:28Z" level=info msg="test connection with timeout failed with dial tcp 104.119.21.151:443: i/o timeout"
2021-09-21T08:41:48.426345247Z time="2021-09-21T08:41:48Z" level=info msg="test connection with timeout failed with dial tcp 104.119.21.151:443: i/o timeout"
2021-09-21T08:42:08.425404137Z time="2021-09-21T08:42:08Z" level=info msg="test connection with timeout failed with dial tcp 104.119.21.151:443: i/o timeout"
2021-09-21T08:42:22.293899280Z W0921 08:42:22.293835       8 reflector.go:441] github.com/openshift/client-go/template/informers/externalversions/factory.go:101: watch of *v1.Template ended with: an error on the server ("unable to decode an event from the watch stream: stream error: stream ID 25; INTERNAL_ERROR") has prevented the request from succeeding
2021-09-21T08:42:25.890437860Z time="2021-09-21T08:42:25Z" level=error msg="unable to sync: config.samples.operator.openshift.io \"cluster\" not found, requeuing"
2021-09-21T08:42:28.426772151Z time="2021-09-21T08:42:28Z" level=info msg="test connection with timeout failed with dial tcp 104.100.22.132:443: i/o timeout"
2021-09-21T08:42:43.428406867Z time="2021-09-21T08:42:43Z" level=info msg="test connection with timeout failed with dial tcp 104.100.22.132:443: i/o timeout"
2021-09-21T08:42:43.428483787Z time="2021-09-21T08:42:43Z" level=info msg="unable to establish HTTPS connection to registry.redhat.io after 3 minutes, bootstrap to Removed"


I don't see any proxy configuration in the cluster-samples-operator deployment.

Comment 1 Gabe Montero 2021-09-22 23:33:29 UTC
yea my https://github.com/openshift/cluster-samples-operator/pull/394 / bz 2002368 broke proxy

I'll take this one

fix is pretty straight forward, but I'm also going to want to add e2e-aws-proxy as an option with the sample operator repo to vet the fix

Comment 2 Gabe Montero 2021-09-23 13:21:04 UTC
UPDATE:

my addition of adding e2e-aws-proxy to the sample repos proved interesting in that the rehearsal job passes without my fix to samples

I had copied the e2e-aws-proxy job def from openshift/builder

So all this makes me wonder if build api team has not been setting up e2e-aws-proxy correctly.

I'm going to try to compare the setup for periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-proxy
with our PR tests and see if I can find the difference.

While the fix is pretty straight forward, I'd still rather vet it with the equivalent run of periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-proxy
before it merges.

I've copied David (now owner of samples) and Adam (build api team lead) for awareness.

Comment 3 Gabe Montero 2021-09-23 14:21:30 UTC
I've compared our e2e-aws-proxy job configs with https://raw.githubusercontent.com/openshift/release/master/ci-operator/jobs/openshift/release/openshift-release-master-periodics.yaml and see no relevant diffs.

I've asked DPTP help desk in the interim over in #forum-testplatform

In the interim, going to move forward with my openshift/release PR, and create some dummy/test PRs in openshift/builder and openshift/cluster-samples-operator separate from my fix PR to cross reference and vet

Comment 4 Gabe Montero 2021-09-23 14:37:53 UTC
Petr Muller responded .... hopefully can fix our e2e-aws-proxy def

Comment 5 Gabe Montero 2021-09-27 17:02:11 UTC
I'll inspect the mustgather of the next few periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-proxy runs later today / tommorrow and confirm samples operator is not an issue

the key element will be that samples does not bootstrap as removed, which is what was occurring before, but the default managed

that lead to a bunch of sig-builds tests failing because the language sample imagestreams those were dependent on did not exist

Comment 7 Gabe Montero 2021-09-27 18:06:21 UTC
Looking at https://prow.ci.openshift.org/job-history/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-proxy the job https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-proxy/1442546511056474112 *MIGHT* have our fix here.

But rather than sort out the commit levels now, I'll let it finish, inspect, and we'll go from there.

Comment 13 errata-xmlrpc 2022-03-10 16:12:52 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056


Note You need to log in before you can comment on or make changes to this bug.