Bug 1875773
| Summary: | periodic-ci-openshift-release-master-ocp-4.6-e2e-aws-proxy provider requests timeout | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Alberto <agarcial> |
| Component: | Installer | Assignee: | W. Trevor King <wking> |
| Installer sub component: | openshift-installer | QA Contact: | Gaoyun Pei <gpei> |
| Status: | CLOSED WORKSFORME | Docs Contact: | |
| Severity: | medium | ||
| Priority: | medium | CC: | adahiya, aos-bugs, astoycos, deads, ewolinet, lszaszki, pkrupa, pmahajan, scuppett, sdodson, walters |
| Version: | 4.6 | Keywords: | Reopened |
| Target Milestone: | --- | ||
| Target Release: | 4.7.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-10-28 12:28:32 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Alberto
2020-09-04 10:41:43 UTC
The network edge team has agreed to maintain the proxy test jobs with assistance from Eric. *** Bug 1876967 has been marked as a duplicate of this bug. *** This is blocking the entire proxy job. Marking urgent. The logs in https://bugzilla.redhat.com/show_bug.cgi?id=1875773#c0 contain several of the following errors: E0904 07:18:04.613353 1 reconciler.go:236] ci-op-pvgw8zi5-f516a-g4tlf-master-0: error getting existing instances: RequestError: send request failed caused by: Post "https://ec2.us-west-2.amazonaws.com/": dial tcp 54.240.253.45:443: i/o timeout This indicates reconciler.go is making an https call to the EC2 public api endpoint in region us-west-2. The e2e-aws-proxy job implementation requires all cluster-external calls to be proxied. The machine-api controller must implement the cluster-wide egress proxy feature so this call is proxied. Please see [1] for a reference implementation of this feature. I'm reassigning to the MCO team to complete the implementation. [1] https://github.com/openshift/cluster-ingress-operator/pull/334/files >This indicates reconciler.go is making an https call to the EC2 public api endpoint in region us-west-2. The e2e-aws-proxy job implementation requires all cluster-external calls to be proxied. The machine-api controller must implement the cluster-wide egress proxy feature so this call is proxied. Please see [1] for a reference implementation of this feature. I'm reassigning to the MCO team to complete the implementation.
There's a RFE for the machine API to honour the proxy in later releases. As per <= 4.6 the machine API does not intentionally honour the proxy. The infra topology setup is assumed to let the requests to the hosting cloud provider to succeed. Moving this to installer as I assume they own this setup.
The proxy jobs are maintained by the routing / network edge team. Assigning to Dane to take another look. *** Bug 1879633 has been marked as a duplicate of this bug. *** *** Bug 1874914 has been marked as a duplicate of this bug. *** Bringing back to 4.6, because the changes in the linked [1] have fixed EC2 access for the machine-API provider. Tests still aren't green, but the issue there is no longer machine-API-provider EC2 timeouts, so they'll need different bugs. [1]: https://github.com/openshift/release/pull/11723 Example with a successful install [1]. No need for QE validation or errata attachment or anything for this CI-infra change, so I'll just move this to CLOSED WORKSFORME. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-aws-proxy/1308949438701506560 I'm reopening as the proxy jobs keeps failing https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-informing#periodic-ci-openshift-release-master-ocp-4.6-e2e-aws-proxy with the same issue. For example in https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-aws-proxy/1315906418754195456 MCO "failed\ncaused by: Post \"https://ec2.us-west-1.amazonaws.com/\": dial tcp 176.32.118.30:443: i/o timeout" "controller"="machine_controller" "name"="ci-op-lty3p9l4-f516a-mwj66-worker-us-west-1a-5chxw" "namespace"="openshift-machine-api" Setting target release to the active development branch (4.7.0). For any fixes, where required and requested, cloned BZs will be created for those release maintenance streams where appropriate once they are identified. Dropping priority. We're pretty close to picking a 4.6 GA release, and nobody is hounding me to have green CI from these proxy informers as a 4.6 blocker. We have some time to get it fixed before folks start getting worried about 4.7 breaking. Discussion around infra issues in internal Slack's #forum-proxy channel. Proxy CI is still sad [1], although we're a lot better since the most recent infra recovery [2]. DPP should have the most recent infra in its reaper allowlist, so hopefully we don't break again. Example job [3] died on
[sig-imageregistry][Feature:ImageExtract] Image extract should extract content from an image [Suite:openshift/conformance/parallel]
fail [k8s.io/kubernetes.0/test/e2e/framework/pods.go:212]: wait for pod "append-test" to succeed
Expected success, but got an error:
<*errors.errorString | 0xc001e624c0>: {
s: "pod \"append-test\" failed with reason: \"\", message: \"\"",
}
pod "append-test" failed with reason: "", message: ""
That might be related to bug 1891759. But it's not this bug's "provider requests timeout", so marking this one WORKSFORME.
[1]: https://prow.ci.openshift.org/?job=periodic-ci-openshift-release-master-ocp-4.6-e2e-aws-proxy
[2]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-informing#periodic-ci-openshift-release-master-ocp-4.6-e2e-aws-proxy [3]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-aws-proxy/1321245748595003392
|