Bug 2006947

Summary: e2e-aws-proxy for 4.10 is permafailing with samples operator errors
Product: OpenShift Container Platform Reporter: Stephen Benjamin <stbenjam>
Component: SamplesAssignee: Gabe Montero <gmontero>
Status: CLOSED ERRATA QA Contact: Jitendar Singh <jitsingh>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.10CC: adam.kaplan, aos-bugs, dperaza, gmontero, sippy, wking
Target Milestone: ---   
Target Release: 4.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
job=periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-proxy=all job=periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-proxy=all
Last Closed: 2022-03-10 16:12:52 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Stephen Benjamin 2021-09-22 17:18:25 UTC
periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-proxy is permafailing. The 4.9 job doesn't look great, either.

is failing frequently in CI, see:
https://testgrid.k8s.io/redhat-openshift-ocp-release-4.10-informing#periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-proxy


Example job failure:
https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-proxy/1440227332513075200

They mostly seem to be failing on pulling images:

fail [github.com/openshift/origin/test/extended/tbr_health/check.go:18]: Expected
    <string>: Failed to import expected imagestreams, latest error status: ImageStream Error: &errors.StatusError{ErrStatus:v1.Status{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ListMeta:v1.ListMeta{SelfLink:"", ResourceVersion:"", Continue:"", RemainingItemCount:(*int64)(nil)}, Status:"Failure", Message:"imagestreams.image.openshift.io \"ruby\" not found", Reason:"NotFound", Details:(*v1.StatusDetails)(0xc0019e2ea0), Code:404}} 


Looking into the cluster-samples-operator logs I see things like this:

2021-09-21T08:41:03.969380858Z time="2021-09-21T08:41:03Z" level=error msg="unable to sync: config.samples.operator.openshift.io \"cluster\" not found, requeuing"
2021-09-21T08:41:08.426343198Z time="2021-09-21T08:41:08Z" level=info msg="test connection with timeout failed with dial tcp 104.100.22.132:443: i/o timeout"
2021-09-21T08:41:17.534254816Z time="2021-09-21T08:41:17Z" level=info msg="Received watch event imagestream/driver-toolkit but not upserting since deletion of the Config is in progress"
2021-09-21T08:41:28.425834834Z time="2021-09-21T08:41:28Z" level=info msg="test connection with timeout failed with dial tcp 104.119.21.151:443: i/o timeout"
2021-09-21T08:41:48.426345247Z time="2021-09-21T08:41:48Z" level=info msg="test connection with timeout failed with dial tcp 104.119.21.151:443: i/o timeout"
2021-09-21T08:42:08.425404137Z time="2021-09-21T08:42:08Z" level=info msg="test connection with timeout failed with dial tcp 104.119.21.151:443: i/o timeout"
2021-09-21T08:42:22.293899280Z W0921 08:42:22.293835       8 reflector.go:441] github.com/openshift/client-go/template/informers/externalversions/factory.go:101: watch of *v1.Template ended with: an error on the server ("unable to decode an event from the watch stream: stream error: stream ID 25; INTERNAL_ERROR") has prevented the request from succeeding
2021-09-21T08:42:25.890437860Z time="2021-09-21T08:42:25Z" level=error msg="unable to sync: config.samples.operator.openshift.io \"cluster\" not found, requeuing"
2021-09-21T08:42:28.426772151Z time="2021-09-21T08:42:28Z" level=info msg="test connection with timeout failed with dial tcp 104.100.22.132:443: i/o timeout"
2021-09-21T08:42:43.428406867Z time="2021-09-21T08:42:43Z" level=info msg="test connection with timeout failed with dial tcp 104.100.22.132:443: i/o timeout"
2021-09-21T08:42:43.428483787Z time="2021-09-21T08:42:43Z" level=info msg="unable to establish HTTPS connection to registry.redhat.io after 3 minutes, bootstrap to Removed"


I don't see any proxy configuration in the cluster-samples-operator deployment.

Comment 1 Gabe Montero 2021-09-22 23:33:29 UTC
yea my https://github.com/openshift/cluster-samples-operator/pull/394 / bz 2002368 broke proxy

I'll take this one

fix is pretty straight forward, but I'm also going to want to add e2e-aws-proxy as an option with the sample operator repo to vet the fix

Comment 2 Gabe Montero 2021-09-23 13:21:04 UTC
UPDATE:

my addition of adding e2e-aws-proxy to the sample repos proved interesting in that the rehearsal job passes without my fix to samples

I had copied the e2e-aws-proxy job def from openshift/builder

So all this makes me wonder if build api team has not been setting up e2e-aws-proxy correctly.

I'm going to try to compare the setup for periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-proxy
with our PR tests and see if I can find the difference.

While the fix is pretty straight forward, I'd still rather vet it with the equivalent run of periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-proxy
before it merges.

I've copied David (now owner of samples) and Adam (build api team lead) for awareness.

Comment 3 Gabe Montero 2021-09-23 14:21:30 UTC
I've compared our e2e-aws-proxy job configs with https://raw.githubusercontent.com/openshift/release/master/ci-operator/jobs/openshift/release/openshift-release-master-periodics.yaml and see no relevant diffs.

I've asked DPTP help desk in the interim over in #forum-testplatform

In the interim, going to move forward with my openshift/release PR, and create some dummy/test PRs in openshift/builder and openshift/cluster-samples-operator separate from my fix PR to cross reference and vet

Comment 4 Gabe Montero 2021-09-23 14:37:53 UTC
Petr Muller responded .... hopefully can fix our e2e-aws-proxy def

Comment 5 Gabe Montero 2021-09-27 17:02:11 UTC
I'll inspect the mustgather of the next few periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-proxy runs later today / tommorrow and confirm samples operator is not an issue

the key element will be that samples does not bootstrap as removed, which is what was occurring before, but the default managed

that lead to a bunch of sig-builds tests failing because the language sample imagestreams those were dependent on did not exist

Comment 7 Gabe Montero 2021-09-27 18:06:21 UTC
Looking at https://prow.ci.openshift.org/job-history/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-proxy the job https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-proxy/1442546511056474112 *MIGHT* have our fix here.

But rather than sort out the commit levels now, I'll let it finish, inspect, and we'll go from there.

Comment 13 errata-xmlrpc 2022-03-10 16:12:52 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056