Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1845661

Summary:	Proxy CI Jobs fail during bootstrap - cannot pull release image
Product:	OpenShift Container Platform	Reporter:	Ben Parees <bparees>
Component:	Release	Assignee:	ewolinet
Status:	CLOSED WORKSFORME	QA Contact:	Wei Sun <wsun>
Severity:	high	Docs Contact:
Priority:	low
Version:	4.5	CC:	amcdermo, aos-bugs, dwalsh, eparis, ewolinet, gpei, jokerman, nagrawal, pruan, scuppett, sponnaga, tsweeney, wking, xtian
Target Milestone:	---
Target Release:	4.6.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-09-29 21:39:57 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Ben Parees 2020-06-09 18:51:16 UTC

Description of problem:

bootstrap fails to pull images in a proxy environment:

https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-proxy-4.5/93/artifacts/e2e-aws-proxy/installer/log-bundle-20200606223021.tar

shows

Jun 06 22:06:41 ip-10-0-10-69 release-image-download.sh[1773]: Error: error pulling image "registry.svc.ci.openshift.org/ocp/release:4.5": unable to pull registry.svc.ci.openshift.org/ocp/release:4.5: unable to pull image: Error initializing source docker://registry.svc.ci.openshift.org/ocp/release:4.5: error pinging docker registry registry.svc.ci.openshift.org: Get https://registry.svc.ci.openshift.org/v2/: proxyconnect tcp: dial tcp 3.85.43.45:3128: i/o timeout
Jun 06 22:06:41 ip-10-0-10-69 release-image-download.sh[1773]: Pull failed. Retrying registry.svc.ci.openshift.org/ocp/release:4.5...

in bootstrap/journals/release-image.log



Starting w/ containers team since its podman that's failing to pull the image, perhaps not handling the proxy config properly?

Comment 1 Tom Sweeney 2020-06-10 14:51:19 UTC

Miloslav could you take a look at this.  I'm guessing it might be in c/image and might be fixed with a recent update?

Comment 3 Ben Parees 2020-06-11 02:28:24 UTC

BZ dropped my comment:  it's possible the proxy is actually not accessible for some reason, I don't know why the job would have suddenly stopped functioning properly in the last few weeks, but it's certainly worth sanity checking that the proxy is actually being stood up and accessible.  Eric + Trevor have the most familiarity w/ the job itself I believe.

Comment 4 W. Trevor King 2020-06-11 03:54:31 UTC

Job-detail is [1] for folks who want the Spyglass rendering or access to other assets.  Gathered Squid logs are empty [2], but maybe those are a relic from when we were launching a new proxy for each job.  Looks like we maybe never landed that, and instead have a long-running proxy somewhere in the CI cluster.  Maybe:

$ oc -n ci-test-ewolinet get pods
NAME                        READY   STATUS    RESTARTS   AGE
egress-proxy-7-wkts2        1/1     Running   0          5h
egress-proxy-tls-12-9rbzc   1/1     Running   0          5h

And maybe the failure was a networking/node hiccup in the CI cluster that killed/evicted the proxy pods at some point?  I don't think they're set up to be highly available or anything.

[1]: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-proxy-4.5/93
[2]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-proxy-4.5/93/artifacts/e2e-aws-proxy/proxy/

Comment 5 Ben Parees 2020-06-11 04:08:18 UTC

job has been consistently failing since May 21st, so it's not a hiccup in the proxy.

Comment 6 Ben Parees 2020-06-11 04:09:07 UTC

https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/job-history/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-proxy-4.5

Comment 8 ewolinet 2020-06-11 14:08:29 UTC

We currently spin up a proxy per job, the change that was still pending [1] was to create a blackhole in the ec2 private subnet so that there wasn't direct internet access, only via the proxy.
The proxies in the ci cluster were only being used by myself and Daneyon for local testing.

It may be that we need to add to the `noProxy` field so that the bootstrap node can pull images, however I thought we were already doing that...


[1] https://github.com/openshift/release/pull/5308

Comment 9 Tom Sweeney 2020-06-15 19:39:34 UTC

Is there anything actionable to do on this BZ at the moment?  Given OCP 4.5 is closing in 5 days, I'd like to move this to OCP 4.6 as suggested in https://bugzilla.redhat.com/show_bug.cgi?id=1845661#c7.  Any objections?

Comment 10 Ben Parees 2020-06-15 19:50:57 UTC

i'd at least like to see feedback from QE that they've done successful 4.5 testing in proxied environments, since right now we have zero other proof that it still works.

Comment 11 Peter Ruan 2020-06-16 18:13:18 UTC

QE is able to do an installation of proxy-enabled OCP, which I was able to do https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/Launch%20Environment%20Flexy/98134/console

Comment 13 Miloslav Trmač 2020-06-16 19:28:49 UTC

(In reply to ewolinet from comment #8)
> It may be that we need to add to the `noProxy` field so that the bootstrap
> node can pull images, however I thought we were already doing that...

The error reports that the proxy is not reachable. The build.log contains entries like
> Proxy is available at http://$user@pass:$host:3128/

so, shouldn’t the proxy be reachable from the bootstrap node in principle? If the proxy is not reachable, adding noProxy entries won’t change that.

Comment 19 Ben Parees 2020-06-17 16:48:51 UTC

on_qa is not helpful here.  this bug is to fix the broken CI job.

it's great that QE has confirmed the product works, but the CI jobs still need to be fixed and they are currently still failing so this cannot be verified.

i will update the BZ title to make that clear.

Comment 20 Ben Parees 2020-06-17 16:50:21 UTC

and I agree w/ Scott that this is not the install team's responsibility (since QE has confirmed that the product works with proxies, it's no team's responsibility unfortunately.  Again, we just need someone who's willing/able to go investigate what has gone wrong in the CI job itself and fix it).  Historically that has been Trevor or Eric W. since they were involved in defining the job in the first place (not that it should have to fall on them forever)

Comment 21 Abhinav Dahiya 2020-06-17 16:56:44 UTC

Moving this to low priority for the installer team. We will try to get to it as soon as possible.

Comment 23 Ben Parees 2020-06-18 13:45:27 UTC

This currently appears to be an issue w/ the proxy ec2 machine not coming up (we're not getting any logs from it), so a job configuration issue, not an installer issue.  Eric+Trevor are continuing to look into it.

I don't know what component we can logically assign it to, but i'll assign it to Eric.

Comment 27 Andrew McDermott 2020-07-09 12:14:49 UTC

I’m adding UpcomingSprint, because I was occupied by fixing bugs with
higher priority/severity, developing new features with higher
priority, or developing new features to improve stability at a macro
level. I will revisit this bug next sprint.

Comment 28 Ben Parees 2020-07-09 13:32:24 UTC

I think Trevor and Eric have made a lot of progress on fixing the proxy job i the last 2 weeks

Comment 29 ewolinet 2020-07-09 13:57:28 UTC

I believe the issue is with the proxy installation itself. My update to how we set up the proxy for ci passed its rehearsal jobs, however part of what needed to change was the ignition file. We should update our proxy ci tests to use that workflow instead of the upi-aws template and that should fix the issue being seen.

Comment 32 Wei Sun 2020-09-29 07:41:56 UTC

Hi Ben, QE could install the proxy cluster successfully,could you please help check if ci works well for this bug? If we could close this bug? Thanks!

Comment 33 Ben Parees 2020-09-29 13:41:10 UTC

proxy job itself continues to fail:
https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/job-history/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-proxy-4.6
https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/job-history/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-proxy-4.5

Comment 34 W. Trevor King 2020-09-29 21:39:57 UTC

Lots of changes recently with Proxy CI, mostly around CI infra and the step workflow.  A recent job passed for both 4.6 [1] and 4.5 [2].  There are a few e2e test-cases being skipped, but they have trackers:

* Bug 1882486, ossibly setting proxy stuff in the test pod.
  * [sig-network][Feature:Router] The HAProxy router should respond with 503 to unrecognized hosts [Suite:openshift/conformance/parallel]
  * [sig-network][Feature:Router] The HAProxy router should serve routes that were created from an ingress [Suite:openshift/conformance/parallel]
  * [sig-network][Feature:Router] The HAProxy router should set Forwarded headers appropriately [Suite:openshift/conformance/parallel]
  * [sig-network][Feature:Router] The HAProxy router should support reencrypt to services backed by a serving certificate automatically [Suite:openshift/conformance/parallel]

  * [sig-imageregistry][Feature:ImageAppend] Image append should create images by appending them [Suite:openshift/conformance/parallel]
  * [sig-imageregistry][Feature:ImageInfo] Image info should display information about images [Suite:openshift/conformance/parallel]

* Bug 1882845, possibly setting proxy stuff in the test pod.
  * [sig-network] Networking should provide Internet connection for containers [Feature:Networking-IPv4] [Skipped:azure] [Suite:openshift/conformance/parallel] [Suite:k8s]

* Bug 1882850, possibly setting proxy stuff in the test pod.
  * [sig-network] Internal connectivity for TCP and UDP on ports 9000-9999 is allowed [Suite:openshift/conformance/parallel]

* Bug 1882556, git:// is not proxied; we should use https://
  * [sig-builds][Feature:Builds] build have source revision metadata  started build should contain source revision information [Suite:openshift/conformance/parallel]

* Bug 1882853
  * [sig-arch] Managed cluster should should expose cluster services outside the cluster [Suite:openshift/conformance/parallel]

* Bug 1882855
  * [sig-cluster-lifecycle][Feature:Machines] Managed cluster should have machine resources [Suite:openshift/conformance/parallel]

* Bug 1861746, fixed on the 23rd, maybe unrelated to proxy
  * [sig-cli] oc adm must-gather runs successfully for audit logs [Suite:openshift/conformance/parallel]

I'm going to close this one WORKSFORME, and if we have outstanding proxy issues that aren't covered in the above bugs, we can open new tickets about those issues.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-aws-proxy/1310980018502897664
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.5-e2e-aws-proxy/1310006408917291008