Hide Forgot
+++ This bug was initially created as a clone of Bug #1934643 +++ Description of problem: With multiple external gateways we use ECMP in order to load-balance egress cluster traffic across multiple gateways. However, if one of the gateways goes down, we are essentially forwarding traffic to a black hole. To fix this, most external routers support bidirectional forwarding detection (bfd). OVN also now supports this. We can configure bfd on our ecmp routes, and as long as the gateway also uses bfd we can detect gateway routing failures quickly and remove those routes in OVN.
Verified on 4.7.0-0.nightly-2021-03-21-181832 sh-4.4# ovn-nbctl find bfd _uuid : 702f7fe9-7174-41a3-b6a0-a942ef8dcb6d detect_mult : [] dst_ip : "10.0.0.163" external_ids : {} logical_port : rtoe-GR_ip-10-0-151-254.compute.internal min_rx : [] min_tx : [] options : {} status : up sh-4.4# ovn-nbctl find Logical_Router_Static_Route bfd!=[] _uuid : b8e24084-22ba-437f-b9dd-def41948da1f bfd : 702f7fe9-7174-41a3-b6a0-a942ef8dcb6d external_ids : {} ip_prefix : "10.131.0.32" nexthop : "10.0.0.163" options : {ecmp_symmetric_reply="true"} output_port : rtoe-GR_ip-10-0-151-254.compute.internal policy : src-ip [root@ip-10-0-0-163 ec2-user]# vtysh -c "show bfd peers" BFD Peers: peer 10.0.151.254 ID: 1 Remote ID: 828088889 Status: up Uptime: 1 minute(s), 14 second(s) Diagnostics: ok Remote diagnostics: ok Local timers: Receive interval: 300ms Transmission interval: 300ms Echo transmission interval: disabled Remote timers: Receive interval: 1000ms Transmission interval: 1000ms Echo transmission interval: 0ms dropped privs to tcpdump tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes 23:34:39.683048 IP ip-10-0-0-163.compute.internal.49152 > ip-10-0-151-254.ca-central-1.compute.internal.bfd-control: BFDv1, Control, State Up, Flags: [none], length: 24 23:34:40.330692 IP ip-10-0-151-254.compute.internal.49152 > ip-10-0-0-163.ca-central-1.compute.internal.bfd-control: BFDv1, Control, State Up, Flags: [none], length: 24 23:34:40.613264 IP ip-10-0-0-163.compute.internal.49152 > ip-10-0-151-254.ca-central-1.compute.internal.bfd-control: BFDv1, Control, State Up, Flags: [none], length: 24 23:34:41.276886 IP ip-10-0-151-254.compute.internal.49152 > ip-10-0-0-163.ca-central-1.compute.internal.bfd-control: BFDv1, Control, State Up, Flags: [none], length: 24 23:34:41.403328 IP ip-10-0-0-163.compute.internal.49152 > ip-10-0-151-254.ca-central-1.compute.internal.bfd-control: BFDv1, Control, State Up, Flags: [none], length: 24 23:34:42.162088 IP ip-10-0-151-254.compute.internal.49152 > ip-10-0-0-163.ca-central-1.compute.internal.bfd-control: BFDv1, Control, State Up, Flags: [none], length: 24 23:34:42.163370 IP ip-10-0-0-163.compute.internal.49152 > ip-10-0-151-254.ca-central-1.compute.internal.bfd-control: BFDv1, Control, State Up, Flags: [none], length: 24 sh-4.4# tcpdump -i ens5 port '(3784 or 3785)' dropped privs to tcpdump tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on ens5, link-type EN10MB (Ethernet), capture size 262144 bytes 23:36:23.697664 IP ip-10-0-0-163.compute.internal.49152 > ip-10-0-151-254.ca-central-1.compute.internal.bfd-control: BFDv1, Control, State Up, Flags: [none], length: 24 23:36:23.836563 IP ip-10-0-151-254.compute.internal.49152 > ip-10-0-0-163.ca-central-1.compute.internal.bfd-control: BFDv1, Control, State Up, Flags: [none], length: 24 23:36:24.457697 IP ip-10-0-0-163.compute.internal.49152 > ip-10-0-151-254.ca-central-1.compute.internal.bfd-control: BFDv1, Control, State Up, Flags: [none], length: 24 23:36:24.794813 IP ip-10-0-151-254.compute.internal.49152 > ip-10-0-0-163.ca-central-1.compute.internal.bfd-control: BFDv1, Control, State Up, Flags: [none], length: 24 23:36:25.327817 IP ip-10-0-0-163.compute.internal.49152 > ip-10-0-151-254.ca-central-1.compute.internal.bfd-control: BFDv1, Control, State Up, Flags: [none], length: 24 23:36:25.674966 IP ip-10-0-151-254.compute.internal.49152 > ip-10-0-0-163.ca-central-1.compute.internal.bfd-control: BFDv1, Control, State Up, Flags: [none], length: 24 23:36:26.238108 IP ip-10-0-0-163.compute.internal.49152 > ip-10-0-151-254.ca-central-1.compute.internal.bfd-control: BFDv1, Control, State Up, Flags: [none], length: 24 sh-4.4# tcpdump -i br-ex port '(3784 or 3785)' dropped privs to tcpdump tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on br-ex, link-type EN10MB (Ethernet), capture size 262144 bytes 23:36:52.261362 IP ip-10-0-0-163.compute.internal.49152 > ip-10-0-151-254.ca-central-1.compute.internal.bfd-control: BFDv1, Control, State Up, Flags: [none], length: 24 23:36:53.171569 IP ip-10-0-0-163.compute.internal.49152 > ip-10-0-151-254.ca-central-1.compute.internal.bfd-control: BFDv1, Control, State Up, Flags: [none], length: 24 23:36:53.931627 IP ip-10-0-0-163.compute.internal.49152 > ip-10-0-151-254.ca-central-1.compute.internal.bfd-control: BFDv1, Control, State Up, Flags: [none], length: 24 23:36:54.891804 IP ip-10-0-0-163.compute.internal.49152 > ip-10-0-151-254.ca-central-1.compute.internal.bfd-control: BFDv1, Control, State Up, Flags: [none], length: 24
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.4 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:0957