Bug 1880259
Summary: | additional network + OVN network installation failed | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Qin Ping <piqin> | ||||||
Component: | Networking | Assignee: | Tim Rozet <trozet> | ||||||
Networking sub component: | ovn-kubernetes | QA Contact: | Qin Ping <piqin> | ||||||
Status: | CLOSED ERRATA | Docs Contact: | |||||||
Severity: | medium | ||||||||
Priority: | medium | CC: | anusaxen, aos-bugs, bbennett, dcbw, jsafrane, m.andre, trozet, wking | ||||||
Version: | 4.6 | ||||||||
Target Milestone: | --- | ||||||||
Target Release: | 4.7.0 | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: |
Cause:
The route metric setup for ovn-kubernetes default gateway may be higher than the default gateway route added via secondary interface. This is most likely to happen when hot plugging or adding a secondary NIC post-deployment.
Consequence:
Cluster traffic will no longer function correctly on this node if the newly added NIC provided a route with a lower metric than the OVN-Kubernetes default route for the br-ex interface.
Fix:
Enforce that the metric configured on the OVN-Kubernetes interface (br-ex) is always set to 100.
Result:
Any dynamically added default gateway routes should always have a higher metric than 100. Therefore OVN-Kubernetes default route with br-ex will always be preserved.
|
Story Points: | --- | ||||||
Clone Of: | |||||||||
: | 1883916 (view as bug list) | Environment: | |||||||
Last Closed: | 2021-02-24 15:18:37 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 1883916 | ||||||||
Attachments: |
|
Comment 2
Martin André
2020-09-18 14:44:14 UTC
Still NEW (not ASSIGNED) and 4.6 is closing in. Punting to 4.7, because the "infra issue with PSI" theory does not sound like a product-side issue that would block 4.6 GA. Could you please describe how you add the secondary network to your instances? You should normally have an existing default route, with a lower metric than the new default of metric 101. Here, we have a default gateway to 10.0.128.1, coming from the node subnet: [core@mfedosin-6vvnj-bootstrap ~]$ route -n Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface 0.0.0.0 10.0.128.1 0.0.0.0 UG 100 0 0 ens3 10.0.128.0 0.0.0.0 255.255.128.0 U 100 0 0 ens3 169.254.169.254 10.0.128.11 255.255.255.255 UGH 100 0 0 ens3 Now we attach the instance to the manila subnet: ❯ openstack subnet show manila_subnet -f yaml -c gateway_ip -c network_id gateway_ip: 172.16.32.1 network_id: 27671b90-c2bc-483f-b783-cc856f20ee5d ❯ openstack server add network 40ba2453-99a1-4b1e-bcfc-f75a9cb20030 27671b90-c2bc-483f-b783-cc856f20ee5d We get a new default gateway of 172.16.32.1, but with higher metric: [core@mfedosin-6vvnj-bootstrap ~]$ route -n Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface 0.0.0.0 10.0.128.1 0.0.0.0 UG 100 0 0 ens3 0.0.0.0 172.16.32.1 0.0.0.0 UG 101 0 0 ens6 10.0.128.0 0.0.0.0 255.255.128.0 U 100 0 0 ens3 169.254.169.254 10.0.128.11 255.255.255.255 UGH 100 0 0 ens3 169.254.169.254 172.16.34.2 255.255.255.255 UGH 101 0 0 ens6 172.16.32.0 0.0.0.0 255.255.240.0 U 101 0 0 ens6 We're also still able to talk to api endpoints. Hello Martin, Attached the second network interface from the PSI web console. I did not reproduce this issue on OCP with Openshift SDN network cluster. The partial route information of the node with this issue is: # route Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface default _gateway 0.0.0.0 UG 101 0 0 ens6 default host-192-168-0- 0.0.0.0 UG 800 0 0 br-ex The I realize it might be a specific issue of OCP with OVN network, let me try to reproduce with OVN network. This is the route info for OVN before attach the second network interface: sh-4.4# route Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface default host-192-168-0- 0.0.0.0 UG 800 0 0 br-ex 10.128.0.0 10.128.2.1 255.252.0.0 UG 0 0 0 ovn-k8s-mp0 10.128.2.0 0.0.0.0 255.255.254.0 U 0 0 0 ovn-k8s-mp0 link-local 0.0.0.0 255.255.240.0 U 0 0 0 ovn-k8s-gw0 169.254.169.254 host-192-168-0- 255.255.255.255 UGH 800 0 0 br-ex 172.30.0.0 10.128.2.1 255.255.0.0 UG 0 0 0 ovn-k8s-mp0 192.168.0.0 0.0.0.0 255.255.192.0 U 800 0 0 br-ex Tried to install OCP cluster with additional network + OVN network, installation failed. The root cause looks like, when the master has ovn and additional network configured, it can not communicated with internet. The route info of master node: $ route Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface default _gateway 0.0.0.0 UG 101 0 0 ens4 default _gateway 0.0.0.0 UG 800 0 0 br-ex 169.254.169.254 host-172-16-34- 255.255.255.255 UGH 101 0 0 ens4 169.254.169.254 192.168.0.11 255.255.255.255 UGH 800 0 0 br-ex 172.16.32.0 0.0.0.0 255.255.240.0 U 101 0 0 ens4 192.168.0.0 0.0.0.0 255.255.192.0 U 800 0 0 br-ex Kubelet logs: Sep 28 05:33:45 piqin-9281-wx96r-master-1 hyperkube[3292]: E0928 05:33:45.895590 3292 controller.go:136] failed to ensure node lease exists, will retry in 7s, error: Get "https://api-int.piqin-9281.0928-ew6.qe.rhcloud.com:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/piqin-9281-wx96r-master-1?timeout=10s": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) So, I'll update the bug title and increase the Severity and Priority. It blocks the additional network + OVN profile test. Moving this over to Networking -> ovn-kubernetes as this doesn't seems like an issue with the installer. Why does ens4 get the default route? It's overriding the normal node default route, and that means all your internet traffic is going out ens4, not br-ex (which is where it should be going). Whatever is adding ens4, and setting the default route, is the likely culprit. Is that the PSI web console? Adding an interface post install with a default route with a lower metric than br-ex is going to cause this to happen. We can investigate using the same metric as the original interface when we install instead of getting 800, but this isn't a blocker for 4.6. Moving to 4.6z. Created attachment 1739244 [details]
Logs get from bootstrap node.
Created attachment 1739245 [details]
ovn-configuration.service log
It looks like the fix is not right. The metric needs to be placed on the interface instead of the bridge: sh-4.4# nmcli conn modify ovs-if-br-ex ipv4.route-metric 100 sh-4.4# nmcli conn modify br-ex ipv4.route-metric -1 sh-4.4# ip route default via 10.0.32.1 dev br-ex proto dhcp metric 800 10.0.32.1 dev br-ex proto dhcp scope link metric 800 10.128.0.0/16 via 10.128.8.1 dev ovn-k8s-mp0 10.128.8.0/23 dev ovn-k8s-mp0 proto kernel scope link src 10.128.8.2 169.254.0.0/20 dev ovn-k8s-gw0 proto kernel scope link src 169.254.0.1 172.30.0.0/16 via 10.128.8.1 dev ovn-k8s-mp0 sh-4.4# systemctl restart NetworkManager sh-4.4# ip route default via 10.0.32.1 dev br-ex proto dhcp metric 100 10.0.32.1 dev br-ex proto dhcp scope link metric 100 10.128.0.0/16 via 10.128.8.1 dev ovn-k8s-mp0 10.128.8.0/23 dev ovn-k8s-mp0 proto kernel scope link src 10.128.8.2 169.254.0.0/20 dev ovn-k8s-gw0 proto kernel scope link src 169.254.0.1 172.30.0.0/16 via 10.128.8.1 dev ovn-k8s-mp0 Verified with: 4.7.0-fc.1 # ip route default via 192.168.0.1 dev br-ex proto dhcp metric 100 default via 172.16.32.1 dev ens4 proto dhcp metric 101 10.128.0.0/14 via 10.128.2.1 dev ovn-k8s-mp0 10.128.2.0/23 dev ovn-k8s-mp0 proto kernel scope link src 10.128.2.2 169.254.0.0/20 dev ovn-k8s-gw0 proto kernel scope link src 169.254.0.1 169.254.169.254 via 192.168.0.10 dev br-ex proto dhcp metric 100 169.254.169.254 via 172.16.34.1 dev ens4 proto dhcp metric 101 172.16.32.0/20 dev ens4 proto kernel scope link src 172.16.35.179 metric 101 172.30.0.0/16 via 10.128.2.1 dev ovn-k8s-mp0 192.168.0.0/18 dev br-ex proto kernel scope link src 192.168.1.16 metric 100 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633 |