Bug 1879077

Summary:	Nodes tainted after configuring additional host iface
Product:	OpenShift Container Platform	Reporter:	Michal Minar <miminar>
Component:	Networking	Assignee:	Federico Paolinelli <fpaoline>
Networking sub component:	openshift-sdn	QA Contact:	Federico Paolinelli <fpaoline>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	unspecified	CC:	aconstan, anbhat, bbennett, danw, dcbw, djdumas, fpaoline, jhopper, vpickard, zzhao
Version:	4.5	Keywords:	Reopened
Target Milestone:	---
Target Release:	4.8.0
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-07-27 22:32:55 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1962637

Description Michal Minar 2020-09-15 12:02:06 UTC

Description of problem:
  Nodes get tainted when additional interfaces get enabled. The additional interfaces have lower MTU than the initial interfaces.

Version-Release number of selected component (if applicable):
  4.5.7

How reproducible: once out of 1 attempts

Steps to Reproduce:
1. deploy OCP on hosts with multiple interfaces, configure only one iface connected to PLAN with jumbo frames enabled (in our case MTU=5550)
2. after the installation, verify cluster is operational (OpenShift SDN's MTU was automatically configured to MTU 5500)
3. configure additional interface on compute nodes connected to the management network with standard MTU (1500)

Actual results:
   all the nodes with configured management interface get tainted with network.openshift.io/mtu-too-small:NoSchedule

Expected results:
  the nodes with an additional interface configured are not tainted and schedulable without tolerations

Additional info:
- the tainted nodes are schedulable with pod tolerations
- removing the taint lasts for a couple of seconds until an operator sets it again
- overriding the tain with network.openshift.io/mtu-too-small:PreferNoSchedule makes the nodes schedulable for daemonsets without adding tolerations
  - this taint is not overridden by the operator
- our environment is bare metal

Iface overview:
    PLAN interface: eno4
    Mgmt interface: eno1np0
    [core@compute1 ~]$ ip a
    1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
        link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
        inet 127.0.0.1/8 scope host lo
           valid_lft forever preferred_lft forever
        inet6 ::1/128 scope host
           valid_lft forever preferred_lft forever
    2: eno3: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
        link/ether b0:26:28:12:60:88 brd ff:ff:ff:ff:ff:ff
    3: eno1np0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
        link/ether b0:26:28:12:60:8a brd ff:ff:ff:ff:ff:ff
        inet 10.76.34.24/23 brd 10.76.35.255 scope global dynamic noprefixroute eno1np0
           valid_lft 27001sec preferred_lft 27001sec
    4: eno4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 5550 qdisc mq state UP group default qlen 1000
        link/ether b0:26:28:12:60:89 brd ff:ff:ff:ff:ff:ff
        inet 192.168.51.36/24 brd 192.168.51.255 scope global dynamic noprefixroute eno4
           valid_lft 459813sec preferred_lft 459813sec
        inet6 fe80::b226:28ff:fe12:6089/64 scope link
           valid_lft forever preferred_lft forever
    5: eno2np1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
        link/ether b0:26:28:12:60:8b brd ff:ff:ff:ff:ff:ff
    10: ovs-system: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
        link/ether f6:91:1c:4f:ea:30 brd ff:ff:ff:ff:ff:ff
    11: br0: <BROADCAST,MULTICAST> mtu 5500 qdisc noop state DOWN group default qlen 1000
        link/ether d2:00:ec:99:09:42 brd ff:ff:ff:ff:ff:ff
    ...
    
    [core@compute1 ~]$ ip route
    default via 10.76.34.1 dev eno1np0 proto dhcp metric 100
    default via 192.168.51.32 dev eno4 proto dhcp metric 500
    10.76.34.0/23 dev eno1np0 proto kernel scope link src 10.76.34.24 metric 100
    10.128.0.0/14 dev tun0 scope link
    172.30.0.0/16 dev tun0
    192.168.51.0/24 dev eno4 proto kernel scope link src 192.168.51.36 metric 500

    [root@compute1 network-scripts]# grep '.' ifcfg-*
    ifcfg-eno1np0:# management network
    ifcfg-eno1np0:NAME=eno1np0
    ifcfg-eno1np0:DEVICE=eno1np0
    ifcfg-eno1np0:BROWSER_ONLY=no
    ifcfg-eno1np0:DEFROUTE=yes
    ifcfg-eno1np0:IPV4_FAILURE_FATAL="no"
    ifcfg-eno1np0:IPV4_ROUTE_METRIC=100
    ifcfg-eno1np0:METRIC=100
    ifcfg-eno1np0:NM_CONTROLLED=yes
    ifcfg-eno1np0:ONBOOT=yes
    ifcfg-eno1np0:PEERDNS=no
    ifcfg-eno1np0:PEERROUTES=yes
    ifcfg-eno1np0:TYPE=Ethernet
    ifcfg-eno1np0:PROXY_METHOD=none
    ifcfg-eno1np0:IPV6_DISABLED=yes
    ifcfg-eno1np0:BOOTPROTO=dhcp
    ifcfg-eno4:DEVICE=eno4
    ifcfg-eno4:ONBOOT=yes
    ifcfg-eno4:BOOTPROTO=dhcp
    ifcfg-eno4:IPV6INIT=no
    ifcfg-eno4:IPV6_AUTOCONF=no
    ifcfg-eno4:TYPE=Ethernet
    ifcfg-eno4:NAME="OpenShift Private VLAN"
    ifcfg-eno4:METRIC=500
    ifcfg-eno4:PROXY_METHOD=none
    ifcfg-eno4:BROWSER_ONLY=no
    ifcfg-eno4:DEFROUTE=yes
    ifcfg-eno4:IPV4_FAILURE_FATAL=yes
    ifcfg-eno4:IPV4_ROUTE_METRIC=500
    ifcfg-eno4:MTU=5550

SDN pod log:
    *$ oc logs sdn-ghgns | grep -i mtu
    I0908 08:42:16.881304    3065 node.go:245] Checking default interface MTU
    I0908 10:36:08.521126    2782 node.go:245] Checking default interface MTU
    I0908 10:48:53.378994    2723 node.go:245] Checking default interface MTU
    I0908 11:00:51.293230    2754 node.go:245] Checking default interface MTU
    I0908 14:31:11.635782    3956 node.go:245] Checking default interface MTU
    I0908 14:31:11.651405    3956 node.go:296] Default interface MTU is less than VXLAN overhead, tainting node...
    I0908 14:34:08.655947    2917 node.go:245] Checking default interface MTU

No idea, how the default interface is determined (alphabetically?). METRIC does not seem to play the role.

Comment 2 Michal Minar 2020-09-23 13:26:14 UTC

I was wrong about this one: "this taint is not overridden by the operator"

OpenShift SDN container taints the node on its startup.

To amend that, we are now using this work-around: https://gist.github.com/miminar/1399627ef114f96245f011185fa3747b

Comment 9 Dan Williams 2020-10-14 01:45:55 UTC

openshift-sdn does not currently support installation on systems with multiple interfaces, where the VXLAN interface does not have the lowest-metric (eg most preferred) default route. It has always been this way, and there are already RFEs to support multiple interfaces, but other features have been a priority.

Comment 14 Dan Winship 2021-03-01 17:38:26 UTC

(In reply to Dan Williams from comment #9)
> openshift-sdn does not currently support installation on systems with
> multiple interfaces, where the VXLAN interface does not have the
> lowest-metric (eg most preferred) default route.

Actually, the MTU-tainting code is buggy; it taints the node if there is _any_ interface with a default route and a too-small MTU, even if it's not the interface that will actually get used.

It probably ought to only check the interface that holds the primary node IP, since that's guaranteed to be the interface that at least inbound VXLAN traffic will use. If someone has a cluster with zany asymmetric routing such that outbound VXLAN uses a different interface, then they can just be responsible for sanity-checking the MTU on that interface themselves.

Comment 15 Federico Paolinelli 2021-04-19 12:20:36 UTC

@vpickard taking this assuming you are not working on it, feel free to ping in case you are

Comment 16 Victor Pickard 2021-04-19 12:53:48 UTC

@fpaoline Thanks for taking this one!

Comment 18 zhaozhanqi 2021-05-08 14:14:36 UTC

Verified this bug on 4.8.0-0.nightly-2021-05-07-075528

There are two interface eno1 and eno2, and set eno2 with lower MTU 1400 than 1500


step 1:

sh-4.4# ip a show eno1
2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether dc:f4:01:e7:5d:84 brd ff:ff:ff:ff:ff:ff
    inet 10.73.116.54/23 brd 10.73.117.255 scope global dynamic noprefixroute eno1
       valid_lft 34688sec preferred_lft 34688sec
    inet6 2620:52:0:4974:25d3:20de:2f60:293/64 scope global dynamic noprefixroute 
       valid_lft 2591941sec preferred_lft 604741sec
    inet6 fe80::8e:f470:4074:97e9/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever


sh-4.4# ip a show eno2
3: eno2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc mq state UP group default qlen 1000
    link/ether dc:f4:01:e7:5d:85 brd ff:ff:ff:ff:ff:ff
    inet 192.168.222.112/24 brd 192.168.222.255 scope global dynamic noprefixroute eno2
       valid_lft 1524sec preferred_lft 1524sec
    inet6 fe80::def4:1ff:fee7:5d85/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever

sh-4.4# ip a show br0
14: br0: <BROADCAST,MULTICAST> mtu 1450 qdisc noop state DOWN group default qlen 1000
    link/ether b6:5f:07:08:a2:4a brd ff:ff:ff:ff:ff:ff


sh-4.4# ip route
default via 192.168.222.101 dev eno2 
default via 10.73.117.254 dev eno1 proto dhcp metric 100 
10.73.116.0/23 dev eno1 proto kernel scope link src 10.73.116.54 metric 100 
10.128.0.0/14 dev tun0 scope link 
172.30.0.0/16 dev tun0 
192.168.222.0/24 dev eno2 proto kernel scope link src 192.168.222.112 metric 101


step 2:

Delete sdn pod to make it recreated on this node

step 3: 

Check the logs of sdn new created

# oc logs sdn-lf7xb -n openshift-sdn -c sdn | grep -i mtu
I0508 14:06:25.729983  398667 node.go:247] Checking default interface MTU


step 4:

Create one rc and scale 20 pods and found pods can schedule to this node.

Comment 21 errata-xmlrpc 2021-07-27 22:32:55 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438