Bug 1996108

Summary:	Allow backwards compatibility of shared gateway mode to inject host-based routes into OVN
Product:	OpenShift Container Platform	Reporter:	Tim Rozet <trozet>
Component:	Networking	Assignee:	Tim Rozet <trozet>
Networking sub component:	ovn-kubernetes	QA Contact:	Anurag saxena <anusaxen>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	high	CC:	akaris, alosadag, bschmaus, bzhai, djuran, fbaudin, fherrman, fminafra, jseunghw, kkarampo, lars, mapandey, mavazque, mmethot, rcegan, rcernin, sasun, shishika, skanakal, surya, vfarias
Version:	4.8	Flags:	skanakal: needinfo-
Target Milestone:	---
Target Release:	4.10.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: In 4.8, the gateway mode for OVN-Kubernetes deployments moved from "local" gateway mode to "shared" gateway mode. These modes affect ingress and egress traffic into a Kubernetes node. With local, all traffic is routed to the host kernel networking stack before egressing/ingressing a cluster. In other words, the host routing table and iptables are evaluated on ingress/egress packets before either entering OVN, or being sent to the next hop outside of the node, respectively. In shared gateway mode, ingress/egress traffic to/from OVN networked pods bypass the kernel routing table and are sent directly out of the NIC via OVS. The advantages of this include better performance, ability to hardware offload, and less SNAT'ing. One of the disadvantages of shared (bypassing the kernel) is that a user's custom routes/iptables rules are not respected for egress traffic. Consequence: Users with custom host routing/iptables rules that upgrade to a version 4.8 or newer may have unintended egress routing due to the fact that packets are bypassing the host kernel. There is currently no support to configure the equivalent routes inside of OVN. Fix: Allow users to configure the gateway mode so that users who depended on custom routing rules in the kernel to steer traffic will continue to have this desired behavior. Note: migrating gateway modes in a cluster may result in some temporary ingress/egress traffic outage. Result: In releases 4.8 and 4.9, a config map needs to be created that will signal to cluster network operator (CNO) to switch the gateway mode: apiVersion: v1 kind: ConfigMap metadata: name: gateway-mode-config namespace: openshift-network-operator data: mode: "local" immutable: true For releases > 4.10, a new API is exposed called "routingViaHost". By setting this config in CNO, traffic will be routed to the kernel before egressing the node: spec: defaultNetwork: type: OVNKubernetes ovnKubernetesConfig: mtu: 1400 genevePort: 6081 gatewayConfig: routingViaHost: true Workaround: For users who are on 4.8 or 4.9 versions in shared gw mode without the fixed versions of 4.8 and 4.9, they may attempt to use the config map previously mentioned. However, after doing this service traffic and egress firewall may no longer function correctly. In order to fix service traffic a route needs to be manually deleted on each node matching on the service CIDR. For example, assume a service CIDR of 10.96.0.0/16: 1. On shared gateway mode, there will be a route towards the br-ex interface like: [root@ovn-worker ~]# ip route show 10.96.0.0/16 10.96.0.0/16 via 172.18.0.1 dev br-ex mtu 1400 2. In local gateway mode for versions < 4.10, this route needs to point to ovn-k8s-mp0 interface. Manually remove the shared gateway route on each node: [root@ovn-worker ~]# ip route del 10.96.0.0/16 [root@ovn-worker ~]# ip route show 10.96.0.0/16 [root@ovn-worker ~]# 3. Restart ovnkube-node. ovnkube-node will now re-add the correct route which should point towards ovn-k8s-mp0. For example: [root@ovn-worker ~]# ip route show 10.96.0.0/16 10.96.0.0/16 via 10.244.1.1 dev ovn-k8s-mp0 Note, there is no current workaround to make egress firewall work. Users should upgrade to a fixed version to ensure egress firewall works.	Story Points:	---
Clone Of:
Clones:	2036977 (view as bug list)		Environment:
Last Closed:	2022-03-10 16:05:43 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2000007, 2036977

Description Tim Rozet 2021-08-20 14:58:13 UTC

Description of problem:
In previous local gateway behavior, all traffic went via the host for routing. Due to this behavior, a user could add more specific routes to direct some traffic to go to a non-default gateway. Now with shared gateway mode this behavior no longer works, because traffic egresses the node without going to the host for routing and only uses routes on in OVN.

Comment 1 Tim Rozet 2021-08-20 15:11:11 UTC

A potential solution here is to provide a config flag where --force-host-network, where the following would happen:
1. route policies on OVN DVR for all nodes added to redirect to mp0 (forcing the traffic into the host to be routed from OVN pods)
2. modify load balancers on GR (host -> service traffic):
 a. host-> service traffic still goes via br-ex
 b. only hairpin host network endpoint is rerouted back to the shared gw bridge and eventually the host.
 c. other host endpoints would need to be DNAT'ed to their host endpoints, then routes added to GR to force traffic towards the DVR (snat'ed to join subnet, 100.64.0.x)
 d. route policy in DVR (set from 1), force the traffic to mp0, and the host routes the traffic, and snats
 e. would need a route on the host for return traffic for join subnet to go back into mp0

Comment 2 Tim Rozet 2021-08-24 21:15:50 UTC

Now that I have a better understanding of the problem ignore comment 1 solution...and I've updated the description. A potential workaround is to add custom routes to each OVN gateway router (GR). This needs to be done manually on each node. In order to add a custom route to each GR:

1. exec into the leader nbdb container for the ovnkube-master pod. You may need to try each one to find the leader. Inside the container you can issue ovn-nbctl show. If this command succeeds you are in the leader.
2. check the current routes in your node's gateway router. The name of the gateway router is always "GR_<node name>". For example, if my node name is "ovn-worker":
[root@ovn-control-plane ~]#  ovn-nbctl lr-route-list GR_ovn-worker
IPv4 Routes
            10.244.0.0/16                100.64.0.1 dst-ip
                0.0.0.0/0                172.18.0.1 dst-ip rtoe-GR_ovn-worker

3. add your new route:
[root@ovn-control-plane ~]# ovn-nbctl lr-route-add GR_ovn-worker 192.168.1.0/24 192.168.0.3
[root@ovn-control-plane ~]#  ovn-nbctl lr-route-list GR_ovn-worker
IPv4 Routes
           192.168.1.0/24               192.168.0.3 dst-ip
            10.244.0.0/16                100.64.0.1 dst-ip
                0.0.0.0/0                172.18.0.1 dst-ip rtoe-GR_ovn-worker


There is no CRD or any API way exposed today to add custom routes into OVN.

Comment 5 Lars Kellogg-Stedman 2021-09-22 15:28:25 UTC

This seems like a critical problem: how are we supposed to access services on networks to which the host is directly attached?

Comment 6 Lars Kellogg-Stedman 2021-09-22 17:35:45 UTC

The workaround suggested here doesn't seem appropriate for directly connected host routes. We would need the OVN equivalent of:

  ip route add 10.253.0.0/23 dev vlan210

The lr-route-add command doesn't seem like it will work because it requires a `nexthop` argument, but there won't be one for directly attached networks.

> One other possible solution is to have ovnkube read the host routing table for the subnet attached to br-ex, and automatically add those routes to each GR. I'm not sure we want to support this or not.

I don't think this would be sufficient: you don't want the routing table "for the subnet attached to br-ex"; you'd the actual host routing table. Otherwise you still wouldn't have access to directly attached networks.

Tim has suggested via slack that we might be able to work around this with a route policy, such as:

  ovn-nbctl lr-policy-add ovn_cluster_router 1004 \
          'inport == "rtos-oct-03-26-compute" && ip4.dst == 10.253.0.0/23' \
          reroute 10.130.0.2

This worked partially: we saw traffic egressing on ovn-k8s-mp0, which is what we wanted to see, but it didn't appear to leave the host. This also has the disadvantages that (a) it requires knowledge of the host IP on the target network, which ideally wouldn't be necessary, and (b) it requires one rule per/host, which means we're stuck with manual work whenever we add a node.

David Guthrie has suggested adding the target network as an `additionalNetworks` configuration in the SDN config, so that pods get addresses directly on the target network. Does that seem like a viable solution? I guess it would solve the routing problem, but we would have to pay closer to attention to IP address availability if we were consuming one/container when we only need one/host.

Comment 7 Lars Kellogg-Stedman 2021-09-22 21:21:51 UTC

We were able to get the connectivity we wanted by implementing the routing policy that Tim suggested...

  ovn-nbctl lr-policy-add ovn_cluster_router 1004 \
          'inport == "rtos-oct-03-26-compute" && ip4.dst == 10.253.0.0/23' \
          reroute 10.130.0.2

...and then adding a NAT rule on the nodes:

  iptables -t nat -I POSTROUTING 1 -s 10.128.0.0/14 -d 10.253.0.0/23 -j MASQUERADE

(Where 10.128.0.0/14 is the cluster network)

We've opted to try switching to OpenShift-SDN instead because that ultimately seems like a simpler solution, since the above changes require manual operation on the OVN controllers for every node (and every new node as we add them), combined with a machineconfig to set up the iptables rules on the workers, and would probably result in supportability questions at some point.

Comment 8 Tim Rozet 2021-10-04 14:55:20 UTC

After some discussion we will support the previous functionality of routing all egress traffic via the kernel. This will allow the previous behavior to continue working. This mode of ovn-kubernetes is called "local gateway" mode, while the default mode in 4.8 and later is called "shared gateway" mode. Local gateway mode still exists in 4.8 and later, it is just only enabled right now via a hidden configuration. As a workaround for now, I'll provide the instructions for enabling this mode. However, we will come up with a proper configuration knob exposed via cluster network config to switch between gateway modes. Note, for now we only have validated that migrating from local gateway mode -> shared gateway mode works, and not the reverse. We will validate/fix this though. If a customer is relying on custom routes/iptables rules to steer egress traffic, it is advised to stay on local gateway mode when upgrading from 4.7->4.8->4.9. To do this:

So to deploy 4.8 or later with local gateway mode, you need to create a config-map to indicate that gateway mode and have it present at deploy/upgrade time. In order to do this for a fresh install:

1. put your install-config.yaml in the your <install folder>
2. openshift-install create-manifests --dir=<install folder>
3. create a file like this in the newly created manifests dir:
apiVersion: v1
kind: ConfigMap
metadata:
name: gateway-mode-config
namespace: openshift-network-operator
data:
mode: "local"
immutable: true
4. openshift-install create cluster --dir=<install folder>

I'll keep this bug open to address any issues with switching from shared gateway back to local gateway mode.

Comment 9 Andrew Stoycos 2021-10-14 16:02:28 UTC

*** Bug 2000007 has been marked as a duplicate of this bug. ***

Comment 10 Francesco Minafra 2021-11-09 08:45:03 UTC

Hello, my customer has already OCP 4.8 deployed (with default mode "shared gateway").
You only provided instructions on how to enable "local gateway" mode at installation time. 
Is it possible to change configuration for a cluster already installed?

What would be the change? Just adding the missing configmap in openshift-network-operator namespace?

Comment 13 Alberto Losada 2021-11-25 13:24:48 UTC

Hi,

We have just faced the problem with the default gateway mode set to shared as default in 4.8. The use case for the customer is setting up external ODF (ceph) cluster. All nodes are connected to the ceph external cluster network via a secondary interface which is not the br-ex one. Then we have set up a couple of routes in the host network routing table via nmstate configs. That way it was working OK in 4.6 EUS, however, we noticed this change in 4.8 and we needed to switch back to local gateway mode.

I think this is a valid use case since it makes sense to avoid mixing cluster traffic from storage traffic in the same br-ex interface. Customer would like to know if it is expected to be able to apply policies or rules can be implemented directly into OVN nbdb so we can keep shared gateway mode configured (since it is the default configuration and probably more efficient than local gateway mode).

Comment 14 Lars Kellogg-Stedman 2021-12-14 21:40:52 UTC

@trozet is it possible to switch the gateway mode of an existing cluster (absent either an upgrade or a fresh install)? We're hitting this on a second cluster that was upgraded from 4.6 -> 4.8 and suddenly stopped working. I can roll back to OpenShiftSDN, but I would be interested in trying to modify the gateway mode instead of that's possible.

Comment 24 errata-xmlrpc 2022-03-10 16:05:43 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056