Created in installer component at request of ffranz from SPLAT.
The azure internal load balancer has frequent, short, intermittent timeouts. During these times, direct access to the kube-apiserver endpoints themselves (the pods), don't experience any disruption.
We know this based on the check-endpoints data contained in must-gather. It makes tcp connections to the kube-apiserver directly and via the load balancer every second and records results.
One example is here https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-4.6/1293090769866854400 must-gather.tar, then registry-svc-ci-openshift-org-ocp-4-6-2020-08-11-074107-sha256-0eb469557fb7b90527742e6604a3be64bf5727db4993dd0a00aa9dd58154c5a1/namespaces/openshift-apiserver/controlplane.operator.openshift.io/podnetworkconnectivitychecks in the **-api-internal.yaml show numerous short lived outages, while the endpoints themselves are reliable.
We are adding an e2e test https://github.com/openshift/origin/pull/25291 to highlight test problems more clearly so we can count them effectively, but we've seen this behavior in several failed promotion jobs.
It often shows up as the failure to install.
I'm asking ARO if they ever faced it and where it's being tracked in such case.
Is it possible you are encountering this issue? https://docs.microsoft.com/en-us/azure/load-balancer/concepts#limitations
> Outbound flow from a backend VM to a frontend of an internal Load Balancer will fail.
Basically, if you have a Kubernetes master behind an ILB, and you try to use the ILB to route traffic back to the originating VM, it will fail. Hence, you can see connections fail about 1/3 of the time.
In ARO 3.11, we avoided this issue by pinning all master traffic to the local apiserver: https://github.com/openshift/openshift-azure/issues/1632
based on a recommendation from @casey (https://coreos.slack.com/archives/CB48XQ4KZ/p1597234926141800?thread_ts=1597234292.136800&cid=CB48XQ4KZ), I'm assigning to sttts to work with casey to figure out how to apply something like gcp-routes.service to this.
The installer team cannot fix the azure platform restriction and seems like the apiserver and sdn team will have to help fix this issue. So moving to networking team to help provide a fix.
seems same issue with https://bugzilla.redhat.com/show_bug.cgi?id=1825219
*** Bug 1873000 has been marked as a duplicate of this bug. ***
*** Bug 1869788 has been marked as a duplicate of this bug. ***