Bug 1877570

Summary: Configuring Jumbo Frame MTU results in network.openshift.io/mtu-too-small and cluster install failure
Product: OpenShift Container Platform Reporter: Robert Bost <rbost>
Component: DocumentationAssignee: Vikram Goyal <vigoyal>
Status: CLOSED CURRENTRELEASE QA Contact: Xiaoli Tian <xtian>
Severity: medium Docs Contact: Vikram Goyal <vigoyal>
Priority: medium    
Version: 4.5CC: aconstan, adahiya, agarcial, alchan, aos-bugs, bleanhar, danw, eminguez, jboxman, jcallen, jocolema, jokerman, jupittma, shsaxena, walters
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-04-07 15:49:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Robert Bost 2020-09-09 20:53:40 UTC
Description of problem:

When configuring MTU -> 8950 (for Jumbo Frames MTU 9000) in cluster-network-03-config.yml, the cluster fails to install due to all masters being tainted with network.openshift.io/mtu-too-small. 

Manually inspecting the masters shows the default interface has small MTU (1500 in vSphere) which I believe is triggering the taint to be applied. 

Version-Release number of the following components:
openshift-installer 4.5.8
vSphere IPI

How reproducible: Always when configuring MTU to 8950 in cluster-network-03-config.yml

Steps to Reproduce:
1. Create cluster-network-03-config.yml. See example config below.
2. Configure MTU to 8950 (50 smaller than 9000 for Jumbo Frames)
3. Start installation
4. Watch for pods to hang in Pending and Events will show they are stuck because of the mtu-too-small taint. 

Example Config:
apiVersion: operator.openshift.io/v1
kind: Network
metadata:
  name: cluster
spec:
  clusterNetwork:
  - cidr: 192.168.0.0/16
    hostPrefix: 25
  serviceNetwork:
  - 172.28.128.0/17
  defaultNetwork:
    type: OpenShiftSDN
    openshiftSDNConfig:
      mode: NetworkPolicy
      mtu: 8950
      vxlanPort: 4789
  kubeProxyConfig:
    iptablesSyncPeriod: 30s
    proxyArguments:
      iptables-min-sync-period:
      - 30s

Expected results:
Expect vSphere IPI to configure the MTU properly on nodes to install can complete. 

Additional Information:
I was able to reproduce a similar issue using AWS IPI but the customer impacted by this is on vSphere.

Comment 4 Abhinav Dahiya 2020-09-10 18:13:49 UTC
This modification is done using openshift-install create manifests, and there are no guarantees on ALL changes to work only the ones documented. So can you please provide where did you find the documentation?

The network operator needs to materialize the API it apparently supports. The compute machines are created by the machine api and configured by machine configs. So it think the network operator needs to work with machine-api operator or machine-config operator to make the network API field real or prevent people from setting such invalid values.

There is nothing that installer can do to fix this afaik.

Comment 5 Robert Bost 2020-09-10 18:15:21 UTC
> This modification is done using openshift-install create manifests, and there are no guarantees on ALL changes to work only the ones documented. So can you please provide where did you find the documentation?

https://docs.openshift.com/container-platform/4.5/installing/installing_vsphere/installing-vsphere-installer-provisioned-network-customizations.html

Comment 7 zhaozhanqi 2020-09-11 10:44:31 UTC
How about using 8951 I remember the aws default interface MTU is 9001.  I tried using 8951, it works well in my side

Comment 8 Robert Bost 2020-09-11 15:52:43 UTC
The main issue is in vSphere but I could reproduce similar behaviour in AWS (AWS is all I have access to for a reproducer). The issue is that the default interface MTU doesn't seem to be affected by the network config mentioned in c#0

Comment 10 Dan Winship 2020-09-14 15:13:21 UTC
(In reply to Robert Bost from comment #8)
> The main issue is in vSphere but I could reproduce similar behaviour in AWS
> (AWS is all I have access to for a reproducer). The issue is that the
> default interface MTU doesn't seem to be affected by the network config
> mentioned in c#0

It's not supposed to be affected by that. The network operator config specifies the configuration of the SDN, not of the underlying network. You're saying "regardless of what the underlay network MTU appears to be, make the openshift-sdn tunnel MTU be 8950", and openshift-sdn is saying "I can't do that, that MTU is bigger than the underlay network".


In AWS, etc, nodes come up with the correct interface MTU automatically. I would expect that to happen in vsphere as well. Are you certain that the vsphere cluster is configured correctly and actually has a 9000 MTU?

Comment 11 John Coleman 2020-09-16 16:58:08 UTC
Hi Dan,

Robert is currently out of office - I will ask the customer to be sure of this - I did not see verification in the case.  I'll let you know, thanks!

Comment 12 Juan Luis de Sousa-Valadas 2020-09-23 14:30:07 UTC
We requested more info from the customer the 14th, and we didn't get a reply so far, so I'm lowering the severity and priority to medium.

Comment 20 Dan Winship 2020-10-01 15:44:14 UTC
ah, on AWS:

Oct 01 14:53:43 ip-10-0-135-76 NetworkManager[744]: <info>  [1601564023.5503] dhcp4 (ens5): option interface_mtu        => '9001'

and [NetworkManager does process this option](https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/blob/master/src/dhcp/nm-dhcp-utils.c#L536). So the reason nodes get the right MTU on AWS is because Amazon's DHCP servers tell us what it is. So probably the fix here is that we need to figure out how to make the vSphere DHCP servers tell us the MTU that was configured in vSphere for the node network.

Since this is IPI, let's try reassigning to "Installer", though again, this might need to get assigned to somewhere more vSphere-specific, but I don't know where that would be. (It is also possible that there is no way to get vSphere's DHCP server to do this, in which case the customer would have to fix the node MTU by hand via MachineConfigs, and this would become a documentation bug.)

Comment 23 Abhinav Dahiya 2020-10-01 16:40:19 UTC
(In reply to Dan Winship from comment #20)
> ah, on AWS:
> 
> Oct 01 14:53:43 ip-10-0-135-76 NetworkManager[744]: <info> 
> [1601564023.5503] dhcp4 (ens5): option interface_mtu        => '9001'
> 
> and [NetworkManager does process this
> option](https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/blob/
> master/src/dhcp/nm-dhcp-utils.c#L536). So the reason nodes get the right MTU
> on AWS is because Amazon's DHCP servers tell us what it is. So probably the
> fix here is that we need to figure out how to make the vSphere DHCP servers
> tell us the MTU that was configured in vSphere for the node network.
> 
> Since this is IPI, let's try reassigning to "Installer", though again, this
> might need to get assigned to somewhere more vSphere-specific, but I don't
> know where that would be. (It is also possible that there is no way to get
> vSphere's DHCP server to do this, in which case the customer would have to
> fix the node MTU by hand via MachineConfigs, and this would become a
> documentation bug.)

Moving back to SDN team.

The team owns the network config object and allows users to set the MTU, when they can't really support this change in a sane way. On AWS seems like they allow high MTU, but seems like vSphere env used by reporter doesn't and still network component is accepting the larger value
when either it should have rejected it or not tried to use invalid value for pods causing failure.

This is not installer team's responsibility to doc or validate.

Comment 24 Dan Winship 2020-10-01 21:50:42 UTC
Sorry, it sucks when a bug gets huge and the "bug" under discussion changes halfway through. The original CNO configuration discussion is a red herring; the customer had an MTU-related problem, so they found the only configuration option in the OCP docs that had the word "MTU" in it, and tried to use it, hoping it was relevant to their problem. (Narrator: It _wasn't_ relevant.)

ie, what happened is:

  1. Customer sets up vSphere, creates a vSphere virtual network with MTU 9000

  2. Customer installs OpenShift, sees their nodes coming up with MTU 1500 instead of 9000,
     is unhappy

  3. Customer finds the incorrect configuration option in the docs and tries again, thinking
     that they're telling OCP "make the nodes have an MTU of 9000", when actually they're
     telling it "make the VXLAN tunnel MTU be 8950 regardless of what the node MTU is"

  4. openshift-sdn starts on each node, sees that the node it is running on has an MTU of 1500,
     and says "I can't create a VXLAN tunnel with an MTU of 8950 on this node"

Yes there are improvements that can occur in step 4, but that's just improving the error message that tells the user that they changed the wrong option anyway. The problem is in step 2; the customer expected the cluster to come up with nodes with MTU 9000, and that didn't happen. So how does the customer make the nodes come up with the correct MTU? CNO can't help here because it's not involved in the configuration of the node network. But the installer is, so it seemed to me like maybe it could do something; the installer already pokes at the configuration of the cloud DHCP server on AWS at least; maybe it could poke at the configuration of the cloud DHCP server on vSphere too, to get it to send the MTU option.

(If there had been a "vSphere experts" component in bugzilla, I would have assigned the bug to that, but there's not...)

If there is nothing the installer can do, and there are no other vSphere-specific components that can do something, then as I said, I think this goes to Documentation, so they can explain how to use MachineConfigs to override the autodetected MTU on the nodes.

Comment 26 Joseph Callen 2020-10-02 14:07:24 UTC
vSphere is not a cloud so assume there is just virtual machines and virtual switches (layer 2).
vSphere provides no DHCP, DNS, LB out of the box (yes there is NSX-T we have to assume they don't have it).

I think having a customer set option 26 on their DHCP scope would be problematic. 

1.) There is an assumption that all the nodes within that scope should have a different MTU which may not be the case.
2.) Based on slack msgs and email I have recieved customers do not want to run DHCP.

There needs to be minimally manual documented process to change the MTU.

Comment 27 Colin Walters 2020-10-02 17:06:10 UTC
The network interfaces can be configured by writing a MachineConfig that writes network configuration files, either traditional RH initscripts networking or NetworkManager keyfiles.

It's basically: take https://access.redhat.com/solutions/6305 and encode in MachineConfig.

Comment 28 Colin Walters 2020-10-02 17:07:04 UTC
And yes we need an "advanced OS networking" guide with examples for these in the docs.

Comment 29 Colin Walters 2020-10-02 17:27:58 UTC
Also worth mentioning that using the new vSphere afterburn networking support: https://github.com/openshift/installer/pull/4121
One can specify the mtu as part of the cmdline, which should work.  More info in https://man7.org/linux/man-pages/man7/dracut.cmdline.7.html

A large benefit of using this approach is that the correct MTU would then be set before the system sends any packets at all.

Comment 30 Colin Walters 2020-10-02 17:50:49 UTC
Reassigning to Documentation per the above - I think the RHCOS team would be happy to help expand the
https://docs.openshift.com/container-platform/4.5/installing/installing_bare_metal/installing-bare-metal-network-customizations.html
page with some of this if the docs team starts a draft based on the above.

Comment 32 Dan Winship 2020-11-04 14:10:20 UTC
*** Bug 1891887 has been marked as a duplicate of this bug. ***

Comment 33 Dan Winship 2021-02-01 17:46:50 UTC
*** Bug 1920983 has been marked as a duplicate of this bug. ***

Comment 34 Jason Boxman 2021-03-03 02:17:43 UTC
In the meantime, Dan Winship created a docs PR[0] for this. And I expect to have that merged soon.

[0] https://github.com/openshift/openshift-docs/pull/29016

Comment 35 Jason Boxman 2021-03-16 04:18:05 UTC
The docs PR is now merged.

As a docs issue bug, I think this is now closed.

But if we need to create new content, feel free to create an OSDOCS project Jira issue for new work and we'll work to get that content included if it makes sense.

Thanks!