2006947 – e2e-aws-proxy for 4.10 is permafailing with samples operator errors

Bug 2006947 - e2e-aws-proxy for 4.10 is permafailing with samples operator errors

Summary: e2e-aws-proxy for 4.10 is permafailing with samples operator errors

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Samples
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.10.0
Assignee:	Gabe Montero
QA Contact:	Jitendar Singh
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-09-22 17:18 UTC by Stephen Benjamin
Modified:	2022-03-10 16:13 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:	job=periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-proxy=all job=periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-proxy=all
Last Closed:	2022-03-10 16:12:52 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-samples-operator pull 397	None	open	Bug 2006947: fix proxy portion of tbr inaccessible check	2021-09-22 23:45:54 UTC
Github	openshift release pull 22168	None	open	Bug 2006947: add optional e2e-aws-proxy to samples operator master branch	2021-09-22 23:58:17 UTC
Red Hat Product Errata	RHSA-2022:0056	None	None	None	2022-03-10 16:13:13 UTC

Description Stephen Benjamin 2021-09-22 17:18:25 UTC

periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-proxy is permafailing. The 4.9 job doesn't look great, either.

is failing frequently in CI, see:
https://testgrid.k8s.io/redhat-openshift-ocp-release-4.10-informing#periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-proxy


Example job failure:
https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-proxy/1440227332513075200

They mostly seem to be failing on pulling images:

fail [github.com/openshift/origin/test/extended/tbr_health/check.go:18]: Expected
    <string>: Failed to import expected imagestreams, latest error status: ImageStream Error: &errors.StatusError{ErrStatus:v1.Status{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ListMeta:v1.ListMeta{SelfLink:"", ResourceVersion:"", Continue:"", RemainingItemCount:(*int64)(nil)}, Status:"Failure", Message:"imagestreams.image.openshift.io \"ruby\" not found", Reason:"NotFound", Details:(*v1.StatusDetails)(0xc0019e2ea0), Code:404}} 


Looking into the cluster-samples-operator logs I see things like this:

2021-09-21T08:41:03.969380858Z time="2021-09-21T08:41:03Z" level=error msg="unable to sync: config.samples.operator.openshift.io \"cluster\" not found, requeuing"
2021-09-21T08:41:08.426343198Z time="2021-09-21T08:41:08Z" level=info msg="test connection with timeout failed with dial tcp 104.100.22.132:443: i/o timeout"
2021-09-21T08:41:17.534254816Z time="2021-09-21T08:41:17Z" level=info msg="Received watch event imagestream/driver-toolkit but not upserting since deletion of the Config is in progress"
2021-09-21T08:41:28.425834834Z time="2021-09-21T08:41:28Z" level=info msg="test connection with timeout failed with dial tcp 104.119.21.151:443: i/o timeout"
2021-09-21T08:41:48.426345247Z time="2021-09-21T08:41:48Z" level=info msg="test connection with timeout failed with dial tcp 104.119.21.151:443: i/o timeout"
2021-09-21T08:42:08.425404137Z time="2021-09-21T08:42:08Z" level=info msg="test connection with timeout failed with dial tcp 104.119.21.151:443: i/o timeout"
2021-09-21T08:42:22.293899280Z W0921 08:42:22.293835       8 reflector.go:441] github.com/openshift/client-go/template/informers/externalversions/factory.go:101: watch of *v1.Template ended with: an error on the server ("unable to decode an event from the watch stream: stream error: stream ID 25; INTERNAL_ERROR") has prevented the request from succeeding
2021-09-21T08:42:25.890437860Z time="2021-09-21T08:42:25Z" level=error msg="unable to sync: config.samples.operator.openshift.io \"cluster\" not found, requeuing"
2021-09-21T08:42:28.426772151Z time="2021-09-21T08:42:28Z" level=info msg="test connection with timeout failed with dial tcp 104.100.22.132:443: i/o timeout"
2021-09-21T08:42:43.428406867Z time="2021-09-21T08:42:43Z" level=info msg="test connection with timeout failed with dial tcp 104.100.22.132:443: i/o timeout"
2021-09-21T08:42:43.428483787Z time="2021-09-21T08:42:43Z" level=info msg="unable to establish HTTPS connection to registry.redhat.io after 3 minutes, bootstrap to Removed"


I don't see any proxy configuration in the cluster-samples-operator deployment.

Comment 1 Gabe Montero 2021-09-22 23:33:29 UTC

yea my https://github.com/openshift/cluster-samples-operator/pull/394 / bz 2002368 broke proxy

I'll take this one

fix is pretty straight forward, but I'm also going to want to add e2e-aws-proxy as an option with the sample operator repo to vet the fix

Comment 2 Gabe Montero 2021-09-23 13:21:04 UTC

UPDATE:

my addition of adding e2e-aws-proxy to the sample repos proved interesting in that the rehearsal job passes without my fix to samples

I had copied the e2e-aws-proxy job def from openshift/builder

So all this makes me wonder if build api team has not been setting up e2e-aws-proxy correctly.

I'm going to try to compare the setup for periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-proxy
with our PR tests and see if I can find the difference.

While the fix is pretty straight forward, I'd still rather vet it with the equivalent run of periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-proxy
before it merges.

I've copied David (now owner of samples) and Adam (build api team lead) for awareness.

Comment 3 Gabe Montero 2021-09-23 14:21:30 UTC

I've compared our e2e-aws-proxy job configs with https://raw.githubusercontent.com/openshift/release/master/ci-operator/jobs/openshift/release/openshift-release-master-periodics.yaml and see no relevant diffs.

I've asked DPTP help desk in the interim over in #forum-testplatform

In the interim, going to move forward with my openshift/release PR, and create some dummy/test PRs in openshift/builder and openshift/cluster-samples-operator separate from my fix PR to cross reference and vet

Comment 4 Gabe Montero 2021-09-23 14:37:53 UTC

Petr Muller responded .... hopefully can fix our e2e-aws-proxy def

Comment 5 Gabe Montero 2021-09-27 17:02:11 UTC

I'll inspect the mustgather of the next few periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-proxy runs later today / tommorrow and confirm samples operator is not an issue

the key element will be that samples does not bootstrap as removed, which is what was occurring before, but the default managed

that lead to a bunch of sig-builds tests failing because the language sample imagestreams those were dependent on did not exist

Comment 7 Gabe Montero 2021-09-27 18:06:21 UTC

Looking at https://prow.ci.openshift.org/job-history/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-proxy the job https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-proxy/1442546511056474112 *MIGHT* have our fix here.

But rather than sort out the commit levels now, I'll let it finish, inspect, and we'll go from there.

Comment 8 Gabe Montero 2021-09-27 20:23:52 UTC

Yeah https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-proxy/1442546511056474112 was missing the fix.  Waiting for the next invocation

Comment 10 Gabe Montero 2021-09-28 12:16:23 UTC

And we have green e2e's 

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-proxy/1442674651170869248
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-proxy/1442770312683851776

verification for this will be handled via the verification for https://bugzilla.redhat.com/show_bug.cgi?id=2002368

this bug addressed a regression introduced with that in 4.10 only change

marking verified

Comment 13 errata-xmlrpc 2022-03-10 16:12:52 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Note You need to log in before you can comment on or make changes to this bug.