Bug 1868158
Summary: | internal load balancer on azure times out intermittently | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | David Eads <deads> | |
Component: | Machine Config Operator | Assignee: | Stefan Schimanski <sttts> | |
Status: | CLOSED ERRATA | QA Contact: | Michael Nguyen <mnguyen> | |
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | 4.6 | CC: | adahiya, anbhat, bbennett, ehashman, ffranz, geliu, gmarkley, jhixson, miabbott, nstielau, tnozicka, wking | |
Target Milestone: | --- | |||
Target Release: | 4.6.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1881143 (view as bug list) | Environment: | ||
Last Closed: | 2020-10-27 16:27:34 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1845414, 1869790, 1881143 |
Description
David Eads
2020-08-11 20:26:56 UTC
I'm asking ARO if they ever faced it and where it's being tracked in such case. Is it possible you are encountering this issue? https://docs.microsoft.com/en-us/azure/load-balancer/concepts#limitations > Outbound flow from a backend VM to a frontend of an internal Load Balancer will fail. Basically, if you have a Kubernetes master behind an ILB, and you try to use the ILB to route traffic back to the originating VM, it will fail. Hence, you can see connections fail about 1/3 of the time. In ARO 3.11, we avoided this issue by pinning all master traffic to the local apiserver: https://github.com/openshift/openshift-azure/issues/1632 based on a recommendation from @casey (https://coreos.slack.com/archives/CB48XQ4KZ/p1597234926141800?thread_ts=1597234292.136800&cid=CB48XQ4KZ), I'm assigning to sttts to work with casey to figure out how to apply something like gcp-routes.service to this. The installer team cannot fix the azure platform restriction and seems like the apiserver and sdn team will have to help fix this issue. So moving to networking team to help provide a fix. seems same issue with https://bugzilla.redhat.com/show_bug.cgi?id=1825219 *** Bug 1873000 has been marked as a duplicate of this bug. *** *** Bug 1869788 has been marked as a duplicate of this bug. *** @Fabiano were you able to verify this BZ while testing BZ#1878794? On a cluster running 4.6.0-0.nightly-2020-09-24-111253, I was able to confirm that the new `openshift-azure-routes` is landed on the masters and operating successfully. Will mark VERIFIED in a few days unless Fabiano comes back to say it wasn't properly fixed. ``` $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.0-0.nightly-2020-09-24-111253 True False 80m Cluster version is 4.6.0-0.nightly-2020-09-24-111253 $ oc get nodes NAME STATUS ROLES AGE VERSION ci-ln-2h9bjmk-002ac-mp4xd-master-0 Ready master 103m v1.19.0+8a39924 ci-ln-2h9bjmk-002ac-mp4xd-master-1 Ready master 104m v1.19.0+8a39924 ci-ln-2h9bjmk-002ac-mp4xd-master-2 Ready master 102m v1.19.0+8a39924 ci-ln-2h9bjmk-002ac-mp4xd-worker-centralus1-24tsn Ready worker 91m v1.19.0+8a39924 ci-ln-2h9bjmk-002ac-mp4xd-worker-centralus2-z7ghd Ready worker 91m v1.19.0+8a39924 ci-ln-2h9bjmk-002ac-mp4xd-worker-centralus3-cphdn Ready worker 91m v1.19.0+8a39924 $ oc debug node/ci-ln-2h9bjmk-002ac-mp4xd-master-0 Starting pod/ci-ln-2h9bjmk-002ac-mp4xd-master-0-debug ... To use host binaries, run `chroot /host` Pod IP: 10.0.0.6 If you don't see a command prompt, try pressing enter. sh-4.4# chroot /host sh-4.4# head /opt/libexec/openshift-azure-routes.sh #!/bin/bash # Prevent hairpin traffic when the apiserver is up # As per the Azure documentation (https://docs.microsoft.com/en-us/azure/load-balancer/concepts#limitations), # if a backend is load-balanced to itself, then the traffic will be dropped. # # This is because the L3LB does DNAT, so while the outgoing packet has a destination # IP of the VIP, the incoming load-balanced packet has a destination IP of the # host. That means that it "sees" a syn with the source and destination sh-4.4# systemctl status openshift-azure-routes ● openshift-azure-routes.service - Work around Azure load balancer hairpin Loaded: loaded (/etc/systemd/system/openshift-azure-routes.service; static; vendor preset: disabled) Active: inactive (dead) since Thu 2020-09-24 16:10:34 UTC; 1h 29min ago Process: 44091 ExecStart=/bin/bash /opt/libexec/openshift-azure-routes.sh start (code=exited, status=0/SUCCESS) Main PID: 44091 (code=exited, status=0/SUCCESS) CPU: 47ms Sep 24 16:10:34 ci-ln-2h9bjmk-002ac-mp4xd-master-0 systemd[1]: Started Work around Azure load balancer hairpin. Sep 24 16:10:34 ci-ln-2h9bjmk-002ac-mp4xd-master-0 openshift-azure-routes[44091]: processing v4 vip 10.0.0.8 Sep 24 16:10:34 ci-ln-2h9bjmk-002ac-mp4xd-master-0 openshift-azure-routes[44091]: ensuring rule for 10.0.0.8 for internal clients Sep 24 16:10:34 ci-ln-2h9bjmk-002ac-mp4xd-master-0 openshift-azure-routes[44091]: done applying vip rules Sep 24 16:10:34 ci-ln-2h9bjmk-002ac-mp4xd-master-0 systemd[1]: openshift-azure-routes.service: Consumed 47ms CPU time sh-4.4# journalctl -u openshift-azure-routes -- Logs begin at Thu 2020-09-24 15:44:44 UTC, end at Thu 2020-09-24 17:40:10 UTC. -- Sep 24 15:53:28 ci-ln-2h9bjmk-002ac-mp4xd-master-0 systemd[1]: Started Work around Azure load balancer hairpin. Sep 24 15:53:28 ci-ln-2h9bjmk-002ac-mp4xd-master-0 openshift-azure-routes[1899]: done applying vip rules Sep 24 15:53:28 ci-ln-2h9bjmk-002ac-mp4xd-master-0 systemd[1]: openshift-azure-routes.service: Consumed 61ms CPU time Sep 24 15:57:44 ci-ln-2h9bjmk-002ac-mp4xd-master-0 systemd[1]: Started Work around Azure load balancer hairpin. Sep 24 15:57:44 ci-ln-2h9bjmk-002ac-mp4xd-master-0 openshift-azure-routes[14270]: processing v4 vip 10.0.0.8 Sep 24 15:57:44 ci-ln-2h9bjmk-002ac-mp4xd-master-0 openshift-azure-routes[14270]: ensuring rule for 10.0.0.8 for internal clients Sep 24 15:57:44 ci-ln-2h9bjmk-002ac-mp4xd-master-0 openshift-azure-routes[14270]: done applying vip rules Sep 24 15:57:44 ci-ln-2h9bjmk-002ac-mp4xd-master-0 systemd[1]: openshift-azure-routes.service: Consumed 51ms CPU time Sep 24 16:01:12 ci-ln-2h9bjmk-002ac-mp4xd-master-0 systemd[1]: Started Work around Azure load balancer hairpin. Sep 24 16:01:12 ci-ln-2h9bjmk-002ac-mp4xd-master-0 openshift-azure-routes[23866]: removing stale vip 10.0.0.8 for local clients Sep 24 16:01:12 ci-ln-2h9bjmk-002ac-mp4xd-master-0 openshift-azure-routes[23866]: done applying vip rules Sep 24 16:01:12 ci-ln-2h9bjmk-002ac-mp4xd-master-0 systemd[1]: openshift-azure-routes.service: Consumed 43ms CPU time Sep 24 16:02:18 ci-ln-2h9bjmk-002ac-mp4xd-master-0 systemd[1]: Started Work around Azure load balancer hairpin. Sep 24 16:02:18 ci-ln-2h9bjmk-002ac-mp4xd-master-0 openshift-azure-routes[28256]: processing v4 vip 10.0.0.8 Sep 24 16:02:18 ci-ln-2h9bjmk-002ac-mp4xd-master-0 openshift-azure-routes[28256]: ensuring rule for 10.0.0.8 for internal clients Sep 24 16:02:18 ci-ln-2h9bjmk-002ac-mp4xd-master-0 openshift-azure-routes[28256]: done applying vip rules Sep 24 16:02:18 ci-ln-2h9bjmk-002ac-mp4xd-master-0 systemd[1]: openshift-azure-routes.service: Consumed 53ms CPU time Sep 24 16:08:30 ci-ln-2h9bjmk-002ac-mp4xd-master-0 systemd[1]: Started Work around Azure load balancer hairpin. Sep 24 16:08:30 ci-ln-2h9bjmk-002ac-mp4xd-master-0 openshift-azure-routes[40284]: removing stale vip 10.0.0.8 for local clients Sep 24 16:08:30 ci-ln-2h9bjmk-002ac-mp4xd-master-0 openshift-azure-routes[40284]: done applying vip rules Sep 24 16:08:30 ci-ln-2h9bjmk-002ac-mp4xd-master-0 systemd[1]: openshift-azure-routes.service: Consumed 43ms CPU time Sep 24 16:10:34 ci-ln-2h9bjmk-002ac-mp4xd-master-0 systemd[1]: Started Work around Azure load balancer hairpin. Sep 24 16:10:34 ci-ln-2h9bjmk-002ac-mp4xd-master-0 openshift-azure-routes[44091]: processing v4 vip 10.0.0.8 Sep 24 16:10:34 ci-ln-2h9bjmk-002ac-mp4xd-master-0 openshift-azure-routes[44091]: ensuring rule for 10.0.0.8 for internal clients Sep 24 16:10:34 ci-ln-2h9bjmk-002ac-mp4xd-master-0 openshift-azure-routes[44091]: done applying vip rules Sep 24 16:10:34 ci-ln-2h9bjmk-002ac-mp4xd-master-0 systemd[1]: openshift-azure-routes.service: Consumed 47ms CPU time ``` No updates for about a week; marking VERIFIED per comment #15 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196 |