Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1875773

Summary: periodic-ci-openshift-release-master-ocp-4.6-e2e-aws-proxy provider requests timeout
Product: OpenShift Container Platform Reporter: Alberto <agarcial>
Component: InstallerAssignee: W. Trevor King <wking>
Installer sub component: openshift-installer QA Contact: Gaoyun Pei <gpei>
Status: CLOSED WORKSFORME Docs Contact:
Severity: medium    
Priority: medium CC: adahiya, aos-bugs, astoycos, deads, ewolinet, lszaszki, pkrupa, pmahajan, scuppett, sdodson, walters
Version: 4.6Keywords: Reopened
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-10-28 12:28:32 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Alberto 2020-09-04 10:41:43 UTC
Description of problem:

Version-Release number of the following components:
Spin off of https://bugzilla.redhat.com/show_bug.cgi?id=1874914

To address 2 - periodic-ci-openshift-release-master-ocp-4.6-e2e-aws-proxy:
The machine API can't reach AWS and requests timeout. This is likely to be because the infra pre created for the test is miss configured and does not let the cloud requests to succeed.

https://storage.googleapis.com/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-aws-proxy/1301773744699609088/artifacts/e2e-aws-proxy/gather-extra/pods/openshift-machine-api_machine-api-controllers-64f45bf95d-rq6jw_machine-controller.log

https://prow.ci.openshift.org/job-history/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-aws-proxy

Comment 1 Scott Dodson 2020-09-04 13:04:13 UTC
The network edge team has agreed to maintain the proxy test jobs with assistance from Eric.

Comment 2 Alberto 2020-09-08 15:34:57 UTC
*** Bug 1876967 has been marked as a duplicate of this bug. ***

Comment 3 David Eads 2020-09-08 15:41:59 UTC
This is blocking the entire proxy job.  Marking urgent.

Comment 4 Daneyon Hansen 2020-09-08 21:48:50 UTC
The logs in https://bugzilla.redhat.com/show_bug.cgi?id=1875773#c0 contain several of the following errors:

E0904 07:18:04.613353       1 reconciler.go:236] ci-op-pvgw8zi5-f516a-g4tlf-master-0: error getting existing instances: RequestError: send request failed
caused by: Post "https://ec2.us-west-2.amazonaws.com/": dial tcp 54.240.253.45:443: i/o timeout

This indicates reconciler.go is making an https call to the EC2 public api endpoint in region us-west-2. The e2e-aws-proxy job implementation requires all cluster-external calls to be proxied. The machine-api controller must implement the cluster-wide egress proxy feature so this call is proxied. Please see [1] for a reference implementation of this feature. I'm reassigning to the MCO team to complete the implementation.

[1] https://github.com/openshift/cluster-ingress-operator/pull/334/files

Comment 6 Alberto 2020-09-09 07:42:43 UTC
>This indicates reconciler.go is making an https call to the EC2 public api endpoint in region us-west-2. The e2e-aws-proxy job implementation requires all cluster-external calls to be proxied. The machine-api controller must implement the cluster-wide egress proxy feature so this call is proxied. Please see [1] for a reference implementation of this feature. I'm reassigning to the MCO team to complete the implementation.

There's a RFE for the machine API to honour the proxy in later releases. As per <= 4.6 the machine API does not intentionally honour the proxy. The infra topology setup is assumed to let the requests to the hosting cloud provider to succeed. Moving this to installer as I assume they own this setup.

Comment 7 Scott Dodson 2020-09-09 12:29:10 UTC
The proxy jobs are maintained by the routing / network edge team.

Comment 8 Andrew McDermott 2020-09-09 16:13:38 UTC
Assigning to Dane to take another look.

Comment 10 Alberto 2020-09-18 07:58:10 UTC
*** Bug 1879633 has been marked as a duplicate of this bug. ***

Comment 11 Alberto 2020-09-22 14:55:02 UTC
*** Bug 1874914 has been marked as a duplicate of this bug. ***

Comment 14 W. Trevor King 2020-09-24 23:10:08 UTC
Bringing back to 4.6, because the changes in the linked [1] have fixed EC2 access for the machine-API provider.  Tests still aren't green, but the issue there is no longer machine-API-provider EC2 timeouts, so they'll need different bugs.

[1]: https://github.com/openshift/release/pull/11723

Comment 15 W. Trevor King 2020-09-24 23:12:13 UTC
Example with a successful install [1].  No need for QE validation or errata attachment or anything for this CI-infra change, so I'll just move this to CLOSED WORKSFORME.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-aws-proxy/1308949438701506560

Comment 16 Lukasz Szaszkiewicz 2020-10-14 09:28:17 UTC
I'm reopening as the proxy jobs keeps failing https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-informing#periodic-ci-openshift-release-master-ocp-4.6-e2e-aws-proxy with the same issue.

For example in https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-aws-proxy/1315906418754195456 MCO "failed\ncaused by: Post \"https://ec2.us-west-1.amazonaws.com/\": dial tcp 176.32.118.30:443: i/o timeout" "controller"="machine_controller" "name"="ci-op-lty3p9l4-f516a-mwj66-worker-us-west-1a-5chxw" "namespace"="openshift-machine-api"

Comment 17 Stephen Cuppett 2020-10-14 11:45:32 UTC
Setting target release to the active development branch (4.7.0). For any fixes, where required and requested, cloned BZs will be created for those release maintenance streams where appropriate once they are identified.

Comment 19 W. Trevor King 2020-10-22 17:53:11 UTC
Dropping priority.  We're pretty close to picking a 4.6 GA release, and nobody is hounding me to have green CI from these proxy informers as a 4.6 blocker.  We have some time to get it fixed before folks start getting worried about 4.7 breaking.  Discussion around infra issues in internal Slack's #forum-proxy channel.

Comment 20 W. Trevor King 2020-10-28 12:28:32 UTC
Proxy CI is still sad [1], although we're a lot better since the most recent infra recovery [2].  DPP should have the most recent infra in its reaper allowlist, so hopefully we don't break again.  Example job [3] died on 

  [sig-imageregistry][Feature:ImageExtract] Image extract should extract content from an image [Suite:openshift/conformance/parallel]
  fail [k8s.io/kubernetes.0/test/e2e/framework/pods.go:212]: wait for pod "append-test" to succeed
  Expected success, but got an error:
    <*errors.errorString | 0xc001e624c0>: {
        s: "pod \"append-test\" failed with reason: \"\", message: \"\"",
    }
    pod "append-test" failed with reason: "", message: ""

That might be related to bug 1891759.  But it's not this bug's "provider requests timeout", so marking this one WORKSFORME.

[1]: https://prow.ci.openshift.org/?job=periodic-ci-openshift-release-master-ocp-4.6-e2e-aws-proxy
[2]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-informing#periodic-ci-openshift-release-master-ocp-4.6-e2e-aws-proxy [3]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-aws-proxy/1321245748595003392