Bug 1868158 - internal load balancer on azure times out intermittently
Summary: internal load balancer on azure times out intermittently
Keywords:
Status: ON_QA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.6
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.6.0
Assignee: Stefan Schimanski
QA Contact: Michael Nguyen
URL:
Whiteboard:
: 1869788 1873000 (view as bug list)
Depends On:
Blocks: 1845414 1881143 1869790
TreeView+ depends on / blocked
 
Reported: 2020-08-11 20:26 UTC by David Eads
Modified: 2020-09-21 15:39 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1881143 (view as bug list)
Environment:
Last Closed:
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github openshift machine-config-operator pull 2011 None closed Bug 1868158: gcp, azure: Handle azure vips similar to GCP 2020-09-19 18:44:15 UTC
Github openshift machine-config-operator pull 2061 None closed Bug 1868158: machine-config-daemon-pull: Use the MCO image 2020-09-18 15:03:15 UTC

Description David Eads 2020-08-11 20:26:56 UTC
Created in installer component at request of ffranz from SPLAT.

The azure internal load balancer has frequent, short, intermittent timeouts.  During these times, direct access to the kube-apiserver endpoints themselves (the pods), don't experience any disruption.

We know this based on the check-endpoints data contained in must-gather.  It makes tcp connections to the kube-apiserver directly and via the load balancer every second and records results. 

One example is here https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-4.6/1293090769866854400 must-gather.tar, then registry-svc-ci-openshift-org-ocp-4-6-2020-08-11-074107-sha256-0eb469557fb7b90527742e6604a3be64bf5727db4993dd0a00aa9dd58154c5a1/namespaces/openshift-apiserver/controlplane.operator.openshift.io/podnetworkconnectivitychecks  in the **-api-internal.yaml show numerous short lived outages, while the endpoints themselves are reliable.

We are adding an e2e test https://github.com/openshift/origin/pull/25291 to highlight test problems more clearly so we can count them effectively, but we've seen this behavior in several failed promotion jobs.

It often shows up as the failure to install.

Comment 1 Fabiano Franz 2020-08-11 20:47:11 UTC
I'm asking ARO if they ever faced it and where it's being tracked in such case.

Comment 2 Elana Hashman 2020-08-11 21:19:00 UTC
Is it possible you are encountering this issue? https://docs.microsoft.com/en-us/azure/load-balancer/concepts#limitations

> Outbound flow from a backend VM to a frontend of an internal Load Balancer will fail.

Basically, if you have a Kubernetes master behind an ILB, and you try to use the ILB to route traffic back to the originating VM, it will fail. Hence, you can see connections fail about 1/3 of the time.

In ARO 3.11, we avoided this issue by pinning all master traffic to the local apiserver: https://github.com/openshift/openshift-azure/issues/1632

Comment 3 David Eads 2020-08-12 12:25:14 UTC
based on a recommendation from @casey (https://coreos.slack.com/archives/CB48XQ4KZ/p1597234926141800?thread_ts=1597234292.136800&cid=CB48XQ4KZ), I'm assigning to sttts to work with casey to figure out how to apply something like gcp-routes.service to this.

Comment 4 Abhinav Dahiya 2020-08-18 23:22:32 UTC
The installer team cannot fix the azure platform restriction and seems like the apiserver and sdn team will have to help fix this issue. So moving to networking team to help provide a fix.

Comment 5 zhaozhanqi 2020-08-19 01:31:35 UTC
seems same issue with https://bugzilla.redhat.com/show_bug.cgi?id=1825219

Comment 6 Stefan Schimanski 2020-08-28 07:59:42 UTC
*** Bug 1873000 has been marked as a duplicate of this bug. ***

Comment 11 Stefan Schimanski 2020-09-11 14:49:11 UTC
*** Bug 1869788 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.