Cause:
All OVN router policies were accidentally being removed for a pod anytime one of its external gateways was removed.
Consequence:
When scaling down a pod that had multiple external gateways, the pod would no longer send egress traffic to any of the still available external gateways. Instead it would send its egress cluster traffic to the default gateway of the node.
Fix:
When external gateways are scaled down, only remove a pods logical_router_policy on ovn_cluster_router when it has no external gateways left.
Result:
Pods now work correctly with external gateways when scaling down. Egress traffic is still sent to the remaining available external gateways and not to the node's default gateway.
Description of problem:
Consider a scenario where multiple pods to be external gateways for pod such as:
ovn-worker1 ovn-worker2
pod A----OVN--eth0 ----------- External GW Pod1 (172.0.0.4)
|
|----- External GW Pod2 (172.0.0.5)
|
|------ cluster default gateway (172.0.0.1)
pod A now has 2 ecmp routes to 172.0.0.4, and 172.0.0.5. Now, we delete External GW Pod1. pod A should still use 172.0.0.5 as its only other ECMP gateway. Instead, we see that deleting External GW Pod1, results in a delete for the ovn_cluster_router policy for this pod A. This causes traffic from pod A to now go via the default cluster gateway (172.0.0.1) .
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHSA-2020:5633
Description of problem: Consider a scenario where multiple pods to be external gateways for pod such as: ovn-worker1 ovn-worker2 pod A----OVN--eth0 ----------- External GW Pod1 (172.0.0.4) | |----- External GW Pod2 (172.0.0.5) | |------ cluster default gateway (172.0.0.1) pod A now has 2 ecmp routes to 172.0.0.4, and 172.0.0.5. Now, we delete External GW Pod1. pod A should still use 172.0.0.5 as its only other ECMP gateway. Instead, we see that deleting External GW Pod1, results in a delete for the ovn_cluster_router policy for this pod A. This causes traffic from pod A to now go via the default cluster gateway (172.0.0.1) .