1880259 – additional network + OVN network installation failed

Bug 1880259 - additional network + OVN network installation failed

Summary: additional network + OVN network installation failed

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Tim Rozet
QA Contact:	Qin Ping
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1883916
TreeView+	depends on / blocked

Reported:	2020-09-18 06:04 UTC by Qin Ping
Modified:	2021-02-24 15:20 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: The route metric setup for ovn-kubernetes default gateway may be higher than the default gateway route added via secondary interface. This is most likely to happen when hot plugging or adding a secondary NIC post-deployment. Consequence: Cluster traffic will no longer function correctly on this node if the newly added NIC provided a route with a lower metric than the OVN-Kubernetes default route for the br-ex interface. Fix: Enforce that the metric configured on the OVN-Kubernetes interface (br-ex) is always set to 100. Result: Any dynamically added default gateway routes should always have a higher metric than 100. Therefore OVN-Kubernetes default route with br-ex will always be preserved.
Clone Of:
Clones:	1883916 (view as bug list)
Environment:
Last Closed:	2021-02-24 15:18:37 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Logs get from bootstrap node. (2.10 MB, application/gzip) 2020-12-15 07:19 UTC, Qin Ping	no flags	Details
ovn-configuration.service log (10.39 KB, text/plain) 2020-12-15 07:19 UTC, Qin Ping	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 2136	None	closed	Bug 1880259: ovs-configuration: use NM default ethernet route metric	2021-02-11 21:40:12 UTC
Github	openshift machine-config-operator pull 2304	None	closed	Bug 1880259: Fixes setting route metric for ovs-config	2021-02-11 21:40:07 UTC
Red Hat Product Errata	RHSA-2020:5633	None	None	None	2021-02-24 15:20:53 UTC

Comment 2 Martin André 2020-09-18 14:44:14 UTC

This might be an infra issue with PSI. We'll look into it.

Comment 3 W. Trevor King 2020-09-23 21:54:52 UTC

Still NEW (not ASSIGNED) and 4.6 is closing in.  Punting to 4.7, because the "infra issue with PSI" theory does not sound like a product-side issue that would block 4.6 GA.

Comment 4 Martin André 2020-09-25 15:37:56 UTC

Could you please describe how you add the secondary network to your instances?
You should normally have an existing default route, with a lower metric than the new default of metric 101.

Here, we have a default gateway to 10.0.128.1, coming from the node subnet:

[core@mfedosin-6vvnj-bootstrap ~]$ route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         10.0.128.1      0.0.0.0         UG    100    0        0 ens3
10.0.128.0      0.0.0.0         255.255.128.0   U     100    0        0 ens3
169.254.169.254 10.0.128.11     255.255.255.255 UGH   100    0        0 ens3

Now we attach the instance to the manila subnet:

❯ openstack subnet show manila_subnet -f yaml -c gateway_ip -c network_id
gateway_ip: 172.16.32.1
network_id: 27671b90-c2bc-483f-b783-cc856f20ee5d
❯ openstack server add network 40ba2453-99a1-4b1e-bcfc-f75a9cb20030 27671b90-c2bc-483f-b783-cc856f20ee5d

We get a new default gateway of 172.16.32.1, but with higher metric:

[core@mfedosin-6vvnj-bootstrap ~]$ route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         10.0.128.1      0.0.0.0         UG    100    0        0 ens3
0.0.0.0         172.16.32.1     0.0.0.0         UG    101    0        0 ens6
10.0.128.0      0.0.0.0         255.255.128.0   U     100    0        0 ens3
169.254.169.254 10.0.128.11     255.255.255.255 UGH   100    0        0 ens3
169.254.169.254 172.16.34.2     255.255.255.255 UGH   101    0        0 ens6
172.16.32.0     0.0.0.0         255.255.240.0   U     101    0        0 ens6

We're also still able to talk to api endpoints.

Comment 5 Qin Ping 2020-09-27 02:35:37 UTC

Hello Martin,

Attached the second network interface from the PSI web console.

I did not reproduce this issue on OCP with Openshift SDN network cluster.

The partial route information of the node with this issue is:
# route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
default         _gateway        0.0.0.0         UG    101    0        0 ens6
default         host-192-168-0- 0.0.0.0         UG    800    0        0 br-ex

The I realize it might be a specific issue of OCP with OVN network, let me try to reproduce with OVN network.

Comment 7 Qin Ping 2020-09-27 06:19:33 UTC

This is the route info for OVN before attach the second network interface:
sh-4.4# route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
default         host-192-168-0- 0.0.0.0         UG    800    0        0 br-ex
10.128.0.0      10.128.2.1      255.252.0.0     UG    0      0        0 ovn-k8s-mp0
10.128.2.0      0.0.0.0         255.255.254.0   U     0      0        0 ovn-k8s-mp0
link-local      0.0.0.0         255.255.240.0   U     0      0        0 ovn-k8s-gw0
169.254.169.254 host-192-168-0- 255.255.255.255 UGH   800    0        0 br-ex
172.30.0.0      10.128.2.1      255.255.0.0     UG    0      0        0 ovn-k8s-mp0
192.168.0.0     0.0.0.0         255.255.192.0   U     800    0        0 br-ex

Comment 8 Qin Ping 2020-09-28 06:01:32 UTC

Tried to install OCP cluster with additional network + OVN network, installation failed.

The root cause looks like, when the master has ovn and additional network configured, it can not communicated with internet.

The route info of master node:
$ route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
default         _gateway        0.0.0.0         UG    101    0        0 ens4
default         _gateway        0.0.0.0         UG    800    0        0 br-ex
169.254.169.254 host-172-16-34- 255.255.255.255 UGH   101    0        0 ens4
169.254.169.254 192.168.0.11    255.255.255.255 UGH   800    0        0 br-ex
172.16.32.0     0.0.0.0         255.255.240.0   U     101    0        0 ens4
192.168.0.0     0.0.0.0         255.255.192.0   U     800    0        0 br-ex

Kubelet logs:
Sep 28 05:33:45 piqin-9281-wx96r-master-1 hyperkube[3292]: E0928 05:33:45.895590    3292 controller.go:136] failed to ensure node lease exists, will retry in 7s, error: Get "https://api-int.piqin-9281.0928-ew6.qe.rhcloud.com:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/piqin-9281-wx96r-master-1?timeout=10s": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

So, I'll update the bug title and increase the Severity and Priority.

It blocks the additional network + OVN profile test.

Comment 10 Martin André 2020-09-29 16:07:08 UTC

Moving this over to Networking -> ovn-kubernetes as this doesn't seems like an issue with the installer.

Comment 11 Dan Williams 2020-09-29 21:38:19 UTC

Why does ens4 get the default route?

It's overriding the normal node default route, and that means all your internet traffic is going out ens4, not br-ex (which is where it should be going).

Whatever is adding ens4, and setting the default route, is the likely culprit.  Is that the PSI web console?

Comment 16 Tim Rozet 2020-09-30 13:17:48 UTC

Adding an interface post install with a default route with a lower metric than br-ex is going to cause this to happen. We can investigate using the same metric as the original interface when we install instead of getting 800, but this isn't a blocker for 4.6. Moving to 4.6z.

Comment 20 Qin Ping 2020-12-15 07:19:05 UTC

Created attachment 1739244 [details]
Logs get from bootstrap node.

Comment 21 Qin Ping 2020-12-15 07:19:55 UTC

Created attachment 1739245 [details]
ovn-configuration.service log

Comment 23 Tim Rozet 2020-12-15 21:40:53 UTC

It looks like the fix is not right. The metric needs to be placed on the interface instead of the bridge:

sh-4.4# nmcli conn modify ovs-if-br-ex ipv4.route-metric 100 
sh-4.4# nmcli conn modify br-ex ipv4.route-metric -1 
sh-4.4# ip route
default via 10.0.32.1 dev br-ex proto dhcp metric 800 
10.0.32.1 dev br-ex proto dhcp scope link metric 800 
10.128.0.0/16 via 10.128.8.1 dev ovn-k8s-mp0 
10.128.8.0/23 dev ovn-k8s-mp0 proto kernel scope link src 10.128.8.2 
169.254.0.0/20 dev ovn-k8s-gw0 proto kernel scope link src 169.254.0.1 
172.30.0.0/16 via 10.128.8.1 dev ovn-k8s-mp0 
sh-4.4# systemctl restart NetworkManager
sh-4.4# ip route
default via 10.0.32.1 dev br-ex proto dhcp metric 100 
10.0.32.1 dev br-ex proto dhcp scope link metric 100 
10.128.0.0/16 via 10.128.8.1 dev ovn-k8s-mp0 
10.128.8.0/23 dev ovn-k8s-mp0 proto kernel scope link src 10.128.8.2 
169.254.0.0/20 dev ovn-k8s-gw0 proto kernel scope link src 169.254.0.1 
172.30.0.0/16 via 10.128.8.1 dev ovn-k8s-mp0

Comment 27 Qin Ping 2021-01-06 02:20:28 UTC

Verified with: 4.7.0-fc.1

# ip route
default via 192.168.0.1 dev br-ex proto dhcp metric 100 
default via 172.16.32.1 dev ens4 proto dhcp metric 101 
10.128.0.0/14 via 10.128.2.1 dev ovn-k8s-mp0 
10.128.2.0/23 dev ovn-k8s-mp0 proto kernel scope link src 10.128.2.2 
169.254.0.0/20 dev ovn-k8s-gw0 proto kernel scope link src 169.254.0.1 
169.254.169.254 via 192.168.0.10 dev br-ex proto dhcp metric 100 
169.254.169.254 via 172.16.34.1 dev ens4 proto dhcp metric 101 
172.16.32.0/20 dev ens4 proto kernel scope link src 172.16.35.179 metric 101 
172.30.0.0/16 via 10.128.2.1 dev ovn-k8s-mp0 
192.168.0.0/18 dev br-ex proto kernel scope link src 192.168.1.16 metric 100

Comment 29 errata-xmlrpc 2021-02-24 15:18:37 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Note You need to log in before you can comment on or make changes to this bug.