Bug 1880259 - additional network + OVN network installation failed
Summary: additional network + OVN network installation failed
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.6
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.7.0
Assignee: Tim Rozet
QA Contact: Qin Ping
URL:
Whiteboard:
Depends On:
Blocks: 1883916
TreeView+ depends on / blocked
 
Reported: 2020-09-18 06:04 UTC by Qin Ping
Modified: 2021-02-24 15:20 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: The route metric setup for ovn-kubernetes default gateway may be higher than the default gateway route added via secondary interface. This is most likely to happen when hot plugging or adding a secondary NIC post-deployment. Consequence: Cluster traffic will no longer function correctly on this node if the newly added NIC provided a route with a lower metric than the OVN-Kubernetes default route for the br-ex interface. Fix: Enforce that the metric configured on the OVN-Kubernetes interface (br-ex) is always set to 100. Result: Any dynamically added default gateway routes should always have a higher metric than 100. Therefore OVN-Kubernetes default route with br-ex will always be preserved.
Clone Of:
: 1883916 (view as bug list)
Environment:
Last Closed: 2021-02-24 15:18:37 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Logs get from bootstrap node. (2.10 MB, application/gzip)
2020-12-15 07:19 UTC, Qin Ping
no flags Details
ovn-configuration.service log (10.39 KB, text/plain)
2020-12-15 07:19 UTC, Qin Ping
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 2136 0 None closed Bug 1880259: ovs-configuration: use NM default ethernet route metric 2021-02-11 21:40:12 UTC
Github openshift machine-config-operator pull 2304 0 None closed Bug 1880259: Fixes setting route metric for ovs-config 2021-02-11 21:40:07 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:20:53 UTC

Comment 2 Martin André 2020-09-18 14:44:14 UTC
This might be an infra issue with PSI. We'll look into it.

Comment 3 W. Trevor King 2020-09-23 21:54:52 UTC
Still NEW (not ASSIGNED) and 4.6 is closing in.  Punting to 4.7, because the "infra issue with PSI" theory does not sound like a product-side issue that would block 4.6 GA.

Comment 4 Martin André 2020-09-25 15:37:56 UTC
Could you please describe how you add the secondary network to your instances?
You should normally have an existing default route, with a lower metric than the new default of metric 101.

Here, we have a default gateway to 10.0.128.1, coming from the node subnet:

[core@mfedosin-6vvnj-bootstrap ~]$ route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         10.0.128.1      0.0.0.0         UG    100    0        0 ens3
10.0.128.0      0.0.0.0         255.255.128.0   U     100    0        0 ens3
169.254.169.254 10.0.128.11     255.255.255.255 UGH   100    0        0 ens3

Now we attach the instance to the manila subnet:

❯ openstack subnet show manila_subnet -f yaml -c gateway_ip -c network_id
gateway_ip: 172.16.32.1
network_id: 27671b90-c2bc-483f-b783-cc856f20ee5d
❯ openstack server add network 40ba2453-99a1-4b1e-bcfc-f75a9cb20030 27671b90-c2bc-483f-b783-cc856f20ee5d

We get a new default gateway of 172.16.32.1, but with higher metric:

[core@mfedosin-6vvnj-bootstrap ~]$ route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         10.0.128.1      0.0.0.0         UG    100    0        0 ens3
0.0.0.0         172.16.32.1     0.0.0.0         UG    101    0        0 ens6
10.0.128.0      0.0.0.0         255.255.128.0   U     100    0        0 ens3
169.254.169.254 10.0.128.11     255.255.255.255 UGH   100    0        0 ens3
169.254.169.254 172.16.34.2     255.255.255.255 UGH   101    0        0 ens6
172.16.32.0     0.0.0.0         255.255.240.0   U     101    0        0 ens6

We're also still able to talk to api endpoints.

Comment 5 Qin Ping 2020-09-27 02:35:37 UTC
Hello Martin,

Attached the second network interface from the PSI web console.

I did not reproduce this issue on OCP with Openshift SDN network cluster.

The partial route information of the node with this issue is:
# route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
default         _gateway        0.0.0.0         UG    101    0        0 ens6
default         host-192-168-0- 0.0.0.0         UG    800    0        0 br-ex

The I realize it might be a specific issue of OCP with OVN network, let me try to reproduce with OVN network.

Comment 7 Qin Ping 2020-09-27 06:19:33 UTC
This is the route info for OVN before attach the second network interface:
sh-4.4# route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
default         host-192-168-0- 0.0.0.0         UG    800    0        0 br-ex
10.128.0.0      10.128.2.1      255.252.0.0     UG    0      0        0 ovn-k8s-mp0
10.128.2.0      0.0.0.0         255.255.254.0   U     0      0        0 ovn-k8s-mp0
link-local      0.0.0.0         255.255.240.0   U     0      0        0 ovn-k8s-gw0
169.254.169.254 host-192-168-0- 255.255.255.255 UGH   800    0        0 br-ex
172.30.0.0      10.128.2.1      255.255.0.0     UG    0      0        0 ovn-k8s-mp0
192.168.0.0     0.0.0.0         255.255.192.0   U     800    0        0 br-ex

Comment 8 Qin Ping 2020-09-28 06:01:32 UTC
Tried to install OCP cluster with additional network + OVN network, installation failed.

The root cause looks like, when the master has ovn and additional network configured, it can not communicated with internet.

The route info of master node:
$ route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
default         _gateway        0.0.0.0         UG    101    0        0 ens4
default         _gateway        0.0.0.0         UG    800    0        0 br-ex
169.254.169.254 host-172-16-34- 255.255.255.255 UGH   101    0        0 ens4
169.254.169.254 192.168.0.11    255.255.255.255 UGH   800    0        0 br-ex
172.16.32.0     0.0.0.0         255.255.240.0   U     101    0        0 ens4
192.168.0.0     0.0.0.0         255.255.192.0   U     800    0        0 br-ex

Kubelet logs:
Sep 28 05:33:45 piqin-9281-wx96r-master-1 hyperkube[3292]: E0928 05:33:45.895590    3292 controller.go:136] failed to ensure node lease exists, will retry in 7s, error: Get "https://api-int.piqin-9281.0928-ew6.qe.rhcloud.com:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/piqin-9281-wx96r-master-1?timeout=10s": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

So, I'll update the bug title and increase the Severity and Priority.

It blocks the additional network + OVN profile test.

Comment 10 Martin André 2020-09-29 16:07:08 UTC
Moving this over to Networking -> ovn-kubernetes as this doesn't seems like an issue with the installer.

Comment 11 Dan Williams 2020-09-29 21:38:19 UTC
Why does ens4 get the default route?

It's overriding the normal node default route, and that means all your internet traffic is going out ens4, not br-ex (which is where it should be going).

Whatever is adding ens4, and setting the default route, is the likely culprit.  Is that the PSI web console?

Comment 16 Tim Rozet 2020-09-30 13:17:48 UTC
Adding an interface post install with a default route with a lower metric than br-ex is going to cause this to happen. We can investigate using the same metric as the original interface when we install instead of getting 800, but this isn't a blocker for 4.6. Moving to 4.6z.

Comment 20 Qin Ping 2020-12-15 07:19:05 UTC
Created attachment 1739244 [details]
Logs get from bootstrap node.

Comment 21 Qin Ping 2020-12-15 07:19:55 UTC
Created attachment 1739245 [details]
ovn-configuration.service log

Comment 23 Tim Rozet 2020-12-15 21:40:53 UTC
It looks like the fix is not right. The metric needs to be placed on the interface instead of the bridge:

sh-4.4# nmcli conn modify ovs-if-br-ex ipv4.route-metric 100 
sh-4.4# nmcli conn modify br-ex ipv4.route-metric -1 
sh-4.4# ip route
default via 10.0.32.1 dev br-ex proto dhcp metric 800 
10.0.32.1 dev br-ex proto dhcp scope link metric 800 
10.128.0.0/16 via 10.128.8.1 dev ovn-k8s-mp0 
10.128.8.0/23 dev ovn-k8s-mp0 proto kernel scope link src 10.128.8.2 
169.254.0.0/20 dev ovn-k8s-gw0 proto kernel scope link src 169.254.0.1 
172.30.0.0/16 via 10.128.8.1 dev ovn-k8s-mp0 
sh-4.4# systemctl restart NetworkManager
sh-4.4# ip route
default via 10.0.32.1 dev br-ex proto dhcp metric 100 
10.0.32.1 dev br-ex proto dhcp scope link metric 100 
10.128.0.0/16 via 10.128.8.1 dev ovn-k8s-mp0 
10.128.8.0/23 dev ovn-k8s-mp0 proto kernel scope link src 10.128.8.2 
169.254.0.0/20 dev ovn-k8s-gw0 proto kernel scope link src 169.254.0.1 
172.30.0.0/16 via 10.128.8.1 dev ovn-k8s-mp0

Comment 27 Qin Ping 2021-01-06 02:20:28 UTC
Verified with: 4.7.0-fc.1

# ip route
default via 192.168.0.1 dev br-ex proto dhcp metric 100 
default via 172.16.32.1 dev ens4 proto dhcp metric 101 
10.128.0.0/14 via 10.128.2.1 dev ovn-k8s-mp0 
10.128.2.0/23 dev ovn-k8s-mp0 proto kernel scope link src 10.128.2.2 
169.254.0.0/20 dev ovn-k8s-gw0 proto kernel scope link src 169.254.0.1 
169.254.169.254 via 192.168.0.10 dev br-ex proto dhcp metric 100 
169.254.169.254 via 172.16.34.1 dev ens4 proto dhcp metric 101 
172.16.32.0/20 dev ens4 proto kernel scope link src 172.16.35.179 metric 101 
172.30.0.0/16 via 10.128.2.1 dev ovn-k8s-mp0 
192.168.0.0/18 dev br-ex proto kernel scope link src 192.168.1.16 metric 100

Comment 29 errata-xmlrpc 2021-02-24 15:18:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.