Bug 1845661
| Summary: | Proxy CI Jobs fail during bootstrap - cannot pull release image | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Ben Parees <bparees> |
| Component: | Release | Assignee: | ewolinet |
| Status: | CLOSED WORKSFORME | QA Contact: | Wei Sun <wsun> |
| Severity: | high | Docs Contact: | |
| Priority: | low | ||
| Version: | 4.5 | CC: | amcdermo, aos-bugs, dwalsh, eparis, ewolinet, gpei, jokerman, nagrawal, pruan, scuppett, sponnaga, tsweeney, wking, xtian |
| Target Milestone: | --- | ||
| Target Release: | 4.6.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | No Doc Update | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-09-29 21:39:57 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Ben Parees
2020-06-09 18:51:16 UTC
Miloslav could you take a look at this. I'm guessing it might be in c/image and might be fixed with a recent update? BZ dropped my comment: it's possible the proxy is actually not accessible for some reason, I don't know why the job would have suddenly stopped functioning properly in the last few weeks, but it's certainly worth sanity checking that the proxy is actually being stood up and accessible. Eric + Trevor have the most familiarity w/ the job itself I believe. Job-detail is [1] for folks who want the Spyglass rendering or access to other assets. Gathered Squid logs are empty [2], but maybe those are a relic from when we were launching a new proxy for each job. Looks like we maybe never landed that, and instead have a long-running proxy somewhere in the CI cluster. Maybe: $ oc -n ci-test-ewolinet get pods NAME READY STATUS RESTARTS AGE egress-proxy-7-wkts2 1/1 Running 0 5h egress-proxy-tls-12-9rbzc 1/1 Running 0 5h And maybe the failure was a networking/node hiccup in the CI cluster that killed/evicted the proxy pods at some point? I don't think they're set up to be highly available or anything. [1]: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-proxy-4.5/93 [2]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-proxy-4.5/93/artifacts/e2e-aws-proxy/proxy/ job has been consistently failing since May 21st, so it's not a hiccup in the proxy. We currently spin up a proxy per job, the change that was still pending [1] was to create a blackhole in the ec2 private subnet so that there wasn't direct internet access, only via the proxy. The proxies in the ci cluster were only being used by myself and Daneyon for local testing. It may be that we need to add to the `noProxy` field so that the bootstrap node can pull images, however I thought we were already doing that... [1] https://github.com/openshift/release/pull/5308 Is there anything actionable to do on this BZ at the moment? Given OCP 4.5 is closing in 5 days, I'd like to move this to OCP 4.6 as suggested in https://bugzilla.redhat.com/show_bug.cgi?id=1845661#c7. Any objections? i'd at least like to see feedback from QE that they've done successful 4.5 testing in proxied environments, since right now we have zero other proof that it still works. QE is able to do an installation of proxy-enabled OCP, which I was able to do https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/Launch%20Environment%20Flexy/98134/console (In reply to ewolinet from comment #8) > It may be that we need to add to the `noProxy` field so that the bootstrap > node can pull images, however I thought we were already doing that... The error reports that the proxy is not reachable. The build.log contains entries like > Proxy is available at http://$user@pass:$host:3128/ so, shouldn’t the proxy be reachable from the bootstrap node in principle? If the proxy is not reachable, adding noProxy entries won’t change that. on_qa is not helpful here. this bug is to fix the broken CI job. it's great that QE has confirmed the product works, but the CI jobs still need to be fixed and they are currently still failing so this cannot be verified. i will update the BZ title to make that clear. and I agree w/ Scott that this is not the install team's responsibility (since QE has confirmed that the product works with proxies, it's no team's responsibility unfortunately. Again, we just need someone who's willing/able to go investigate what has gone wrong in the CI job itself and fix it). Historically that has been Trevor or Eric W. since they were involved in defining the job in the first place (not that it should have to fall on them forever) Moving this to low priority for the installer team. We will try to get to it as soon as possible. This currently appears to be an issue w/ the proxy ec2 machine not coming up (we're not getting any logs from it), so a job configuration issue, not an installer issue. Eric+Trevor are continuing to look into it. I don't know what component we can logically assign it to, but i'll assign it to Eric. I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint. I think Trevor and Eric have made a lot of progress on fixing the proxy job i the last 2 weeks I believe the issue is with the proxy installation itself. My update to how we set up the proxy for ci passed its rehearsal jobs, however part of what needed to change was the ignition file. We should update our proxy ci tests to use that workflow instead of the upi-aws template and that should fix the issue being seen. Hi Ben, QE could install the proxy cluster successfully,could you please help check if ci works well for this bug? If we could close this bug? Thanks! proxy job itself continues to fail: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/job-history/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-proxy-4.6 https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/job-history/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-proxy-4.5 Lots of changes recently with Proxy CI, mostly around CI infra and the step workflow. A recent job passed for both 4.6 [1] and 4.5 [2]. There are a few e2e test-cases being skipped, but they have trackers: * Bug 1882486, ossibly setting proxy stuff in the test pod. * [sig-network][Feature:Router] The HAProxy router should respond with 503 to unrecognized hosts [Suite:openshift/conformance/parallel] * [sig-network][Feature:Router] The HAProxy router should serve routes that were created from an ingress [Suite:openshift/conformance/parallel] * [sig-network][Feature:Router] The HAProxy router should set Forwarded headers appropriately [Suite:openshift/conformance/parallel] * [sig-network][Feature:Router] The HAProxy router should support reencrypt to services backed by a serving certificate automatically [Suite:openshift/conformance/parallel] * [sig-imageregistry][Feature:ImageAppend] Image append should create images by appending them [Suite:openshift/conformance/parallel] * [sig-imageregistry][Feature:ImageInfo] Image info should display information about images [Suite:openshift/conformance/parallel] * Bug 1882845, possibly setting proxy stuff in the test pod. * [sig-network] Networking should provide Internet connection for containers [Feature:Networking-IPv4] [Skipped:azure] [Suite:openshift/conformance/parallel] [Suite:k8s] * Bug 1882850, possibly setting proxy stuff in the test pod. * [sig-network] Internal connectivity for TCP and UDP on ports 9000-9999 is allowed [Suite:openshift/conformance/parallel] * Bug 1882556, git:// is not proxied; we should use https:// * [sig-builds][Feature:Builds] build have source revision metadata started build should contain source revision information [Suite:openshift/conformance/parallel] * Bug 1882853 * [sig-arch] Managed cluster should should expose cluster services outside the cluster [Suite:openshift/conformance/parallel] * Bug 1882855 * [sig-cluster-lifecycle][Feature:Machines] Managed cluster should have machine resources [Suite:openshift/conformance/parallel] * Bug 1861746, fixed on the 23rd, maybe unrelated to proxy * [sig-cli] oc adm must-gather runs successfully for audit logs [Suite:openshift/conformance/parallel] I'm going to close this one WORKSFORME, and if we have outstanding proxy issues that aren't covered in the above bugs, we can open new tickets about those issues. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-aws-proxy/1310980018502897664 [2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.5-e2e-aws-proxy/1310006408917291008 |