See https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.6/1305155086455934976 1. all kube-apiservers are functioning 2. the kube-apiserver (host network) can reach the internal/external load balancer consistently. 3. the openshift-apiserver (pod network) can access each kube-apiserver by direct IP 4. the openshift-apiserver (pod network) canNOT access the internal/external load balancer consistently. Our connectivity check in must-gather shows connection interruptions every third attempt or so to the load balancers. To summarize, the host network to LB access works. The pod network to LB access does not work reliably.
The logic around https://github.com/openshift/machine-config-operator/blob/master/templates/master/00-master/gcp/files/opt-libexec-openshift-gcp-routes-sh.yaml#L60 is missing in the Azure script. It matches the symptoms of non-local clients (e.g. pod network namespace) not being able to contact the LB 1/3 of the time.
Verified this bug on 4.6.0-0.nightly-2020-09-22-051033 rsh into openshift-apiserver pod and then try to curl the external LB 20 times, all of them success oc rsh -n openshift-apiserver apiserver-74b6465579-6zdpc for i in `seq 20` ; do curl -I --connect-timeout 5 https://api.zzhaoazu.qe.azure.devcluster.openshift.com:6443 -k ; done Move this bug to 'verified'
*** Bug 1877880 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196