Bug 1868158

Summary: internal load balancer on azure times out intermittently
Product: OpenShift Container Platform Reporter: David Eads <deads>
Component: Machine Config OperatorAssignee: Stefan Schimanski <sttts>
Status: CLOSED ERRATA QA Contact: Michael Nguyen <mnguyen>
Severity: high Docs Contact:
Priority: high    
Version: 4.6CC: adahiya, anbhat, bbennett, ehashman, ffranz, geliu, gmarkley, jhixson, miabbott, nstielau, tnozicka, wking
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1881143 (view as bug list) Environment:
Last Closed: 2020-10-27 16:27:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1845414, 1869790, 1881143    

Description David Eads 2020-08-11 20:26:56 UTC
Created in installer component at request of ffranz from SPLAT.

The azure internal load balancer has frequent, short, intermittent timeouts.  During these times, direct access to the kube-apiserver endpoints themselves (the pods), don't experience any disruption.

We know this based on the check-endpoints data contained in must-gather.  It makes tcp connections to the kube-apiserver directly and via the load balancer every second and records results. 

One example is here https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-4.6/1293090769866854400 must-gather.tar, then registry-svc-ci-openshift-org-ocp-4-6-2020-08-11-074107-sha256-0eb469557fb7b90527742e6604a3be64bf5727db4993dd0a00aa9dd58154c5a1/namespaces/openshift-apiserver/controlplane.operator.openshift.io/podnetworkconnectivitychecks  in the **-api-internal.yaml show numerous short lived outages, while the endpoints themselves are reliable.

We are adding an e2e test https://github.com/openshift/origin/pull/25291 to highlight test problems more clearly so we can count them effectively, but we've seen this behavior in several failed promotion jobs.

It often shows up as the failure to install.

Comment 1 Fabiano Franz 2020-08-11 20:47:11 UTC
I'm asking ARO if they ever faced it and where it's being tracked in such case.

Comment 2 Elana Hashman 2020-08-11 21:19:00 UTC
Is it possible you are encountering this issue? https://docs.microsoft.com/en-us/azure/load-balancer/concepts#limitations

> Outbound flow from a backend VM to a frontend of an internal Load Balancer will fail.

Basically, if you have a Kubernetes master behind an ILB, and you try to use the ILB to route traffic back to the originating VM, it will fail. Hence, you can see connections fail about 1/3 of the time.

In ARO 3.11, we avoided this issue by pinning all master traffic to the local apiserver: https://github.com/openshift/openshift-azure/issues/1632

Comment 3 David Eads 2020-08-12 12:25:14 UTC
based on a recommendation from @casey (https://coreos.slack.com/archives/CB48XQ4KZ/p1597234926141800?thread_ts=1597234292.136800&cid=CB48XQ4KZ), I'm assigning to sttts to work with casey to figure out how to apply something like gcp-routes.service to this.

Comment 4 Abhinav Dahiya 2020-08-18 23:22:32 UTC
The installer team cannot fix the azure platform restriction and seems like the apiserver and sdn team will have to help fix this issue. So moving to networking team to help provide a fix.

Comment 5 zhaozhanqi 2020-08-19 01:31:35 UTC
seems same issue with https://bugzilla.redhat.com/show_bug.cgi?id=1825219

Comment 6 Stefan Schimanski 2020-08-28 07:59:42 UTC
*** Bug 1873000 has been marked as a duplicate of this bug. ***

Comment 11 Stefan Schimanski 2020-09-11 14:49:11 UTC
*** Bug 1869788 has been marked as a duplicate of this bug. ***

Comment 14 Micah Abbott 2020-09-23 17:13:30 UTC
@Fabiano were you able to verify this BZ while testing BZ#1878794?

Comment 15 Micah Abbott 2020-09-24 17:43:31 UTC
On a cluster running 4.6.0-0.nightly-2020-09-24-111253, I was able to confirm that the new `openshift-azure-routes` is landed on the masters and operating successfully.  Will mark VERIFIED in a few days unless Fabiano comes back to say it wasn't properly fixed.

```
$ oc get clusterversion                                                                                                                                                                                                                            
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS                                                                                                                                                                                                                                        
version   4.6.0-0.nightly-2020-09-24-111253   True        False         80m     Cluster version is 4.6.0-0.nightly-2020-09-24-111253

$ oc get nodes                                                                                                                                                                                                                                     
NAME                                                STATUS   ROLES    AGE    VERSION                                                                                                                                                                                                                                          
ci-ln-2h9bjmk-002ac-mp4xd-master-0                  Ready    master   103m   v1.19.0+8a39924                                                                                                                                                                                                                                  
ci-ln-2h9bjmk-002ac-mp4xd-master-1                  Ready    master   104m   v1.19.0+8a39924                                                                                                                                                                                                                                  
ci-ln-2h9bjmk-002ac-mp4xd-master-2                  Ready    master   102m   v1.19.0+8a39924                                                                                                                                                                                                                                  
ci-ln-2h9bjmk-002ac-mp4xd-worker-centralus1-24tsn   Ready    worker   91m    v1.19.0+8a39924                                                                                                                                                                                                                                  
ci-ln-2h9bjmk-002ac-mp4xd-worker-centralus2-z7ghd   Ready    worker   91m    v1.19.0+8a39924                                                                                                                                                                                                                                  
ci-ln-2h9bjmk-002ac-mp4xd-worker-centralus3-cphdn   Ready    worker   91m    v1.19.0+8a39924                                                                                                                                                                                                                                  

$ oc debug node/ci-ln-2h9bjmk-002ac-mp4xd-master-0                                                                                                                                                                                                 
Starting pod/ci-ln-2h9bjmk-002ac-mp4xd-master-0-debug ...                                                                                                                                                                                                                                                                     
To use host binaries, run `chroot /host`                                                                                                                                                                                                                                                                                      
Pod IP: 10.0.0.6                                                                                                                                                                                                                                                                                                              
If you don't see a command prompt, try pressing enter.                                                                                                                                                                                                                                                                        
sh-4.4# chroot /host
sh-4.4# head /opt/libexec/openshift-azure-routes.sh                                                                                                                                                                                                                                                                           
#!/bin/bash                                                                                                                                                                                                                                                                                                                   
                                                                                                                                                                                                                                                                                                                              
# Prevent hairpin traffic when the apiserver is up                                                                                                                                                                                                                                                                            
                                                                                                                                                                                                                                                                                                                              
# As per the Azure documentation (https://docs.microsoft.com/en-us/azure/load-balancer/concepts#limitations),                    
# if a backend is load-balanced to itself, then the traffic will be dropped.
#
# This is because the L3LB does DNAT, so while the outgoing packet has a destination
# IP of the VIP, the incoming load-balanced packet has a destination IP of the
# host. That means that it "sees" a syn with the source and destination
sh-4.4# systemctl status openshift-azure-routes
● openshift-azure-routes.service - Work around Azure load balancer hairpin
   Loaded: loaded (/etc/systemd/system/openshift-azure-routes.service; static; vendor preset: disabled)
   Active: inactive (dead) since Thu 2020-09-24 16:10:34 UTC; 1h 29min ago
  Process: 44091 ExecStart=/bin/bash /opt/libexec/openshift-azure-routes.sh start (code=exited, status=0/SUCCESS)
 Main PID: 44091 (code=exited, status=0/SUCCESS)
      CPU: 47ms

Sep 24 16:10:34 ci-ln-2h9bjmk-002ac-mp4xd-master-0 systemd[1]: Started Work around Azure load balancer hairpin.
Sep 24 16:10:34 ci-ln-2h9bjmk-002ac-mp4xd-master-0 openshift-azure-routes[44091]: processing v4 vip 10.0.0.8
Sep 24 16:10:34 ci-ln-2h9bjmk-002ac-mp4xd-master-0 openshift-azure-routes[44091]: ensuring rule for 10.0.0.8 for internal clients
Sep 24 16:10:34 ci-ln-2h9bjmk-002ac-mp4xd-master-0 openshift-azure-routes[44091]: done applying vip rules
Sep 24 16:10:34 ci-ln-2h9bjmk-002ac-mp4xd-master-0 systemd[1]: openshift-azure-routes.service: Consumed 47ms CPU time
sh-4.4# journalctl -u openshift-azure-routes               
-- Logs begin at Thu 2020-09-24 15:44:44 UTC, end at Thu 2020-09-24 17:40:10 UTC. --
Sep 24 15:53:28 ci-ln-2h9bjmk-002ac-mp4xd-master-0 systemd[1]: Started Work around Azure load balancer hairpin.
Sep 24 15:53:28 ci-ln-2h9bjmk-002ac-mp4xd-master-0 openshift-azure-routes[1899]: done applying vip rules
Sep 24 15:53:28 ci-ln-2h9bjmk-002ac-mp4xd-master-0 systemd[1]: openshift-azure-routes.service: Consumed 61ms CPU time
Sep 24 15:57:44 ci-ln-2h9bjmk-002ac-mp4xd-master-0 systemd[1]: Started Work around Azure load balancer hairpin.
Sep 24 15:57:44 ci-ln-2h9bjmk-002ac-mp4xd-master-0 openshift-azure-routes[14270]: processing v4 vip 10.0.0.8
Sep 24 15:57:44 ci-ln-2h9bjmk-002ac-mp4xd-master-0 openshift-azure-routes[14270]: ensuring rule for 10.0.0.8 for internal clients
Sep 24 15:57:44 ci-ln-2h9bjmk-002ac-mp4xd-master-0 openshift-azure-routes[14270]: done applying vip rules
Sep 24 15:57:44 ci-ln-2h9bjmk-002ac-mp4xd-master-0 systemd[1]: openshift-azure-routes.service: Consumed 51ms CPU time
Sep 24 16:01:12 ci-ln-2h9bjmk-002ac-mp4xd-master-0 systemd[1]: Started Work around Azure load balancer hairpin.
Sep 24 16:01:12 ci-ln-2h9bjmk-002ac-mp4xd-master-0 openshift-azure-routes[23866]: removing stale vip 10.0.0.8 for local clients
Sep 24 16:01:12 ci-ln-2h9bjmk-002ac-mp4xd-master-0 openshift-azure-routes[23866]: done applying vip rules
Sep 24 16:01:12 ci-ln-2h9bjmk-002ac-mp4xd-master-0 systemd[1]: openshift-azure-routes.service: Consumed 43ms CPU time
Sep 24 16:02:18 ci-ln-2h9bjmk-002ac-mp4xd-master-0 systemd[1]: Started Work around Azure load balancer hairpin.
Sep 24 16:02:18 ci-ln-2h9bjmk-002ac-mp4xd-master-0 openshift-azure-routes[28256]: processing v4 vip 10.0.0.8
Sep 24 16:02:18 ci-ln-2h9bjmk-002ac-mp4xd-master-0 openshift-azure-routes[28256]: ensuring rule for 10.0.0.8 for internal clients
Sep 24 16:02:18 ci-ln-2h9bjmk-002ac-mp4xd-master-0 openshift-azure-routes[28256]: done applying vip rules
Sep 24 16:02:18 ci-ln-2h9bjmk-002ac-mp4xd-master-0 systemd[1]: openshift-azure-routes.service: Consumed 53ms CPU time
Sep 24 16:08:30 ci-ln-2h9bjmk-002ac-mp4xd-master-0 systemd[1]: Started Work around Azure load balancer hairpin.
Sep 24 16:08:30 ci-ln-2h9bjmk-002ac-mp4xd-master-0 openshift-azure-routes[40284]: removing stale vip 10.0.0.8 for local clients
Sep 24 16:08:30 ci-ln-2h9bjmk-002ac-mp4xd-master-0 openshift-azure-routes[40284]: done applying vip rules
Sep 24 16:08:30 ci-ln-2h9bjmk-002ac-mp4xd-master-0 systemd[1]: openshift-azure-routes.service: Consumed 43ms CPU time
Sep 24 16:10:34 ci-ln-2h9bjmk-002ac-mp4xd-master-0 systemd[1]: Started Work around Azure load balancer hairpin.
Sep 24 16:10:34 ci-ln-2h9bjmk-002ac-mp4xd-master-0 openshift-azure-routes[44091]: processing v4 vip 10.0.0.8
Sep 24 16:10:34 ci-ln-2h9bjmk-002ac-mp4xd-master-0 openshift-azure-routes[44091]: ensuring rule for 10.0.0.8 for internal clients
Sep 24 16:10:34 ci-ln-2h9bjmk-002ac-mp4xd-master-0 openshift-azure-routes[44091]: done applying vip rules
Sep 24 16:10:34 ci-ln-2h9bjmk-002ac-mp4xd-master-0 systemd[1]: openshift-azure-routes.service: Consumed 47ms CPU time
```

Comment 16 Micah Abbott 2020-10-01 14:06:01 UTC
No updates for about a week; marking VERIFIED per comment #15

Comment 19 errata-xmlrpc 2020-10-27 16:27:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196