Bug 1846485 - Deployment with Jumbo Frames (MTU 9000) on Baremetal fails
Summary: Deployment with Jumbo Frames (MTU 9000) on Baremetal fails
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.4
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: 4.6.0
Assignee: Antoni Segura Puimedon
QA Contact: Nataf Sharabi
URL:
Whiteboard:
: 1846499 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-06-11 16:58 UTC by Sai Sindhur Malleni
Modified: 2020-08-04 13:37 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-08-04 13:37:25 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Sai Sindhur Malleni 2020-06-11 16:58:06 UTC
Description of problem:
Trying to deploy an OpenShift on Baremetal environment with jumbo frames enabled fails as the install never completes with some pods never coming up (kube-api-server etc). I believe it is strongly tied to the jumbo MTU configuration as the install on the same hardware with same OCP version and default MTU succeeds flawlessly.


Version-Release number of selected component (if applicable):
4.4.6

How reproducible:
100%

Steps to Reproduce:
1. Follow the instructions at https://docs.openshift.com/container-platform/4.4/installing/installing_bare_metal/installing-bare-metal-network-customizations.html to se teh MTU to 8950 in cluster-network-03-config.yml for OpenShiftSDN
2. Since the manifests do not configure the MTU on the baremetal interface (configures only MTU for VXLAN interface and veths), use ignition files to create ifcfg files to set the MTU to 9000 on the baremetal interface
3. Run openshift install

Actual results:
All the MTU on the vxlan and veth interfaces is configured correctly to 8950 and the baremetal interface has n MTU of 9000 as expected,the install fails.


Expected results:


Additional info:

Comment 1 Dan Williams 2020-06-11 19:16:23 UTC
It would be interesting to know what is different at a NIC/SDN level between this setup, and an AWS setup which also uses jumbo frames and which we know works correctly because we test it every commit in CI...

Comment 2 Sai Sindhur Malleni 2020-06-15 18:35:54 UTC
I initially thought this could be due to the bootstrap VM MTU and the baremetal bridge MTU not being set to 9000. On a subsequent deployment attempt, I set the bootstrap VM MTU as well as baremetal bridge/interface MTU on the provisioning hos to jumbo, yet deploy hangs

NAMESPACE                                               NAME                                                              READY   STATUS    RESTARTS   AGE
openshift-apiserver-operator                            openshift-apiserver-operator-f79557665-pktwj                      0/1     Pending   0          108m
openshift-authentication-operator                       authentication-operator-64d4ddc475-kgz6s                          0/1     Pending   0          108m
openshift-cluster-machine-approver                      machine-approver-6d54996f4-6g8p6                                  0/2     Pending   0          109m
openshift-cluster-node-tuning-operator                  cluster-node-tuning-operator-6dcd5cbfcc-p4pt8                     0/1     Pending   0          109m
openshift-cluster-storage-operator                      csi-snapshot-controller-operator-bf96f6cc7-852cv                  0/1     Pending   0          108m
openshift-cluster-version                               cluster-version-operator-7c44bdbb69-ncgrp                         0/1     Pending   0          109m
openshift-controller-manager-operator                   openshift-controller-manager-operator-7976ddf498-hxj9h            0/1     Pending   0          108m
openshift-dns-operator                                  dns-operator-778fd8fbb5-g2ssg                                     0/2     Pending   0          109m
openshift-etcd-operator                                 etcd-operator-5998db5474-c25rv                                    0/1     Pending   0          108m
openshift-kube-apiserver-operator                       kube-apiserver-operator-d4dfb74f8-lkhzd                           0/1     Pending   0          108m
openshift-kube-controller-manager-operator              kube-controller-manager-operator-787c59c5bf-qvttb                 0/1     Pending   0          108m
openshift-kube-scheduler-operator                       openshift-kube-scheduler-operator-59d76b9498-w4ncg                0/1     Pending   0          108m
openshift-kube-storage-version-migrator-operator        kube-storage-version-migrator-operator-55f49cff56-l68ng           0/1     Pending   0          109m
openshift-machine-config-operator                       machine-config-operator-54d7fd979-f8vcn                           0/1     Pending   0          109m
openshift-network-operator                              network-operator-66ddfd8657-4hbnc                                 0/1     Pending   0          108m
openshift-operator-lifecycle-manager                    catalog-operator-565b8d557b-hpdj4                                 0/1     Pending   0          109m
openshift-operator-lifecycle-manager                    olm-operator-5fc685ddfd-l5bfh                                     0/1     Pending   0          109m
openshift-service-ca-operator                           service-ca-operator-5b9647cbd6-bl4pz                              0/1     Pending   0          108m
openshift-service-catalog-apiserver-operator            openshift-service-catalog-apiserver-operator-66c64c4f64-4qsqj     0/1     Pending   0          108m
openshift-service-catalog-controller-manager-operator   openshift-service-catalog-controller-manager-operator-75c7kjxbl   0/1     Pending   0          108m

Comment 3 Beth White 2020-06-16 16:19:06 UTC
*** Bug 1846499 has been marked as a duplicate of this bug. ***

Comment 4 Steven Hardy 2020-06-16 16:20:07 UTC
Sounds like this may be a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1846499 ?

I think we need more information to proceed - can you provide the exact ignition customizations used, and also details of the provisioning host configuration - can you provide more details on exact steps to reproduce please?

Is it possible the provisioning host networking where the installer is run did not get correctly configured, as that is not automated by the installer atm?

Comment 5 Julia Kreger 2020-06-16 16:25:30 UTC
Sai, Out of curiosity could you provide us `ip link` output from the baremetal host and bootstrap vm?  I seem to remember there was a kernel behavior change with-in the last few years that started to truncate packets across bridges, so getting ip link information would be super helpful for us to try and understand exactly what is occurring.

Comment 6 Sai Sindhur Malleni 2020-06-16 16:33:42 UTC
Steve, Julia I will get you the information you need shortly and could possibly even give you access to the  environment when/where this happens if that is something you are interested in. 

However, I have a question here. This bug is about deployments that fail when jumbo mtu is being used. However https://bugzilla.redhat.com/show_bug.cgi?id=1846499 which as been marked as a duplicate of this bug addresses a very specific case of the MTU in manifests not being translated into the MTU on baremetal interfaces and needing a custom ignition file to configure MTU on baremetal. WHile these bugs are related, I don;t think https://bugzilla.redhat.com/show_bug.cgi?id=1846499 is a duplicate because that one is not about deployments failing. It's about the MTU on the baremetal interface not automcatically being driven off the SDN MTU in the manifests.

Comment 7 Sai Sindhur Malleni 2020-06-18 01:32:17 UTC
So, here's the manifest I used to set the custom MTU (cluster-network-03-config.yml)

apiVersion: operator.openshift.io/v1
kind: Network
metadata:
  name: cluster
spec: 
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  serviceNetwork:
  - 172.30.0.0/16
  defaultNetwork:
    type: OpenShiftSDN
    openshiftSDNConfig:
      mode: NetworkPolicy
      mtu: 8950
      vxlanPort: 4789

I set the MTU on the Baremetal interface also to 9000 (+50 bytes to account for vxlan overhead) using ignition files

master.ign

{"ignition": {"config": {"append": [{"source": "https://10.1.59.3:22623/config/master", "verification": {}}]}, "security": {"tls": {"certificateAuthorities": [{"source": "data:text/plain;charset=utf-8;base64,LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSURFRENDQWZpZ0F3SUJBZ0lJUldDbDRhK1dLNXd3RFFZSktvWklodmNOQVFFTEJRQXdKakVTTUJBR0ExVUUKQ3hNSmIzQmxibk5vYVdaME1SQXdEZ1lEVlFRREV3ZHliMjkwTFdOaE1CNFhEVEl3TURZeE9EQXdOVFUwTWxvWApEVE13TURZeE5qQXdOVFUwTWxvd0pqRVNNQkFHQTFVRUN4TUpiM0JsYm5Ob2FXWjBNUkF3RGdZRFZRUURFd2R5CmIyOTBMV05oTUlJQklqQU5CZ2txaGtpRzl3MEJBUUVGQUFPQ0FROEFNSUlCQ2dLQ0FRRUFvUVBId29LVGtXVy8KVnh2czdnKzB4THhjWTkwdlJvbmVVYlJDRm9PRk54WXVtODFoNUd6U0dQdkJRTEFPV3IzOG44MU9ZM2dUd2trYQpvNmRjd3VEZFFJMDZJUTJERlpocWwxRGZSbUxsd3Nyamd2S20zbVFrOGxUd2N1dDliRER3emFxaXI4R0FWbll5CnJ6MWJESWR6VWljc01WWCtkQmMzekpyZWhaczVCOVNvdkRpdTQySjloeXI4RlR5TldvSCsvV1RjNU1tVEM5cjYKODdUZnUvbFl5WWpwVTAwQmxOUVFVVEF5amNHOEV2YytOZnRLRVhHTk1PZmp5SWw3Y2NmZ1VnOE5vbnJ6bDJJMwo1TktPZkt0SXNPbnpqYkRqZWtkanFrQXlaNE9IUlVuclBmZTJCZVEyN1k5cVZYOFdXT2l6MG1BNWlhc25pL3p1CnJ0U0VCajVxcFFJREFRQUJvMEl3UURBT0JnTlZIUThCQWY4RUJBTUNBcVF3RHdZRFZSMFRBUUgvQkFVd0F3RUIKL3pBZEJnTlZIUTRFRmdRVThlWFBTY253QkxQcGxpcjF4QzRFeG1pNVdFTXdEUVlKS29aSWh2Y05BUUVMQlFBRApnZ0VCQUNWRHcxZWJZR3JiNmIyWVd6amNQcGt4OWcySCtqNGt4cndsU2Y5Ylg1TGpHOXFiMXU0WU9pTVpJMFp4CjRJdkZsRmErSDFONWxWUlBxemxIL3JHY3JtdWFzTDR5ZU55b0hzRHBsanVOVkZuRW12YVNUcEYrdXMzd09GQzQKUkdMMC9XZjZ2UVdjbEV3dDVKTHNRTVpMd3R6bExtd2gyOE9nZUsyU0xaanJnYlFvSURxUlNyM1NNRG8zVk9tVwoxd1Uxd3dPNEN2K2ZNMnhQMzNKQXFzdndPTHlHcElDNU91UHZpc3dRclNVemlaZTh2UVRGWEVxTUlqUnhJWHZICk5QMXVDVm5SdmdjSU4yOXQ0UzQ4R0VwTjVwQzFQb1hPVE5KbzV4RW9oa05VU3ZNa1dKV2dNVS9nOVZBeFRvajQKd2NuU2owYmJWUmVLMWF2Vnd0Q3FydWZFalRzPQotLS0tLUVORCBDRVJUSUZJQ0FURS0tLS0tCg==", "verification": {}}]}}, "timeouts": {}, "version": "2.2.0"}, "networkd": {}, "passwd": {}, "storage": {"files": [{"path": "/etc/sysconfig/network-scripts/ifcfg-ens2f1", "filesystem": "root", "mode": 436, "contents": {"source": "data:,DEVICE%3Dens2f1%0ABOOTPROTO%3Ddhcp%0AONBOOT%3Dyes%0AMTU%3D9000%0A"}}, {"path": "/etc/sysconfig/network-scripts/ifcfg-eno1", "filesystem": "root", "mode": 436, "contents": {"source": "data:,DEVICE%3Deno1%0ABOOTPROTO%3Dnone%0AONBOOT%3Dno%0A"}}]}, "systemd": {}}

worker.ign

{"ignition": {"config": {"append": [{"source": "https://10.1.59.3:22623/config/worker", "verification": {}}]}, "security": {"tls": {"certificateAuthorities": [{"source": "data:text/plain;charset=utf-8;base64,LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSURFRENDQWZpZ0F3SUJBZ0lJUldDbDRhK1dLNXd3RFFZSktvWklodmNOQVFFTEJRQXdKakVTTUJBR0ExVUUKQ3hNSmIzQmxibk5vYVdaME1SQXdEZ1lEVlFRREV3ZHliMjkwTFdOaE1CNFhEVEl3TURZeE9EQXdOVFUwTWxvWApEVE13TURZeE5qQXdOVFUwTWxvd0pqRVNNQkFHQTFVRUN4TUpiM0JsYm5Ob2FXWjBNUkF3RGdZRFZRUURFd2R5CmIyOTBMV05oTUlJQklqQU5CZ2txaGtpRzl3MEJBUUVGQUFPQ0FROEFNSUlCQ2dLQ0FRRUFvUVBId29LVGtXVy8KVnh2czdnKzB4THhjWTkwdlJvbmVVYlJDRm9PRk54WXVtODFoNUd6U0dQdkJRTEFPV3IzOG44MU9ZM2dUd2trYQpvNmRjd3VEZFFJMDZJUTJERlpocWwxRGZSbUxsd3Nyamd2S20zbVFrOGxUd2N1dDliRER3emFxaXI4R0FWbll5CnJ6MWJESWR6VWljc01WWCtkQmMzekpyZWhaczVCOVNvdkRpdTQySjloeXI4RlR5TldvSCsvV1RjNU1tVEM5cjYKODdUZnUvbFl5WWpwVTAwQmxOUVFVVEF5amNHOEV2YytOZnRLRVhHTk1PZmp5SWw3Y2NmZ1VnOE5vbnJ6bDJJMwo1TktPZkt0SXNPbnpqYkRqZWtkanFrQXlaNE9IUlVuclBmZTJCZVEyN1k5cVZYOFdXT2l6MG1BNWlhc25pL3p1CnJ0U0VCajVxcFFJREFRQUJvMEl3UURBT0JnTlZIUThCQWY4RUJBTUNBcVF3RHdZRFZSMFRBUUgvQkFVd0F3RUIKL3pBZEJnTlZIUTRFRmdRVThlWFBTY253QkxQcGxpcjF4QzRFeG1pNVdFTXdEUVlKS29aSWh2Y05BUUVMQlFBRApnZ0VCQUNWRHcxZWJZR3JiNmIyWVd6amNQcGt4OWcySCtqNGt4cndsU2Y5Ylg1TGpHOXFiMXU0WU9pTVpJMFp4CjRJdkZsRmErSDFONWxWUlBxemxIL3JHY3JtdWFzTDR5ZU55b0hzRHBsanVOVkZuRW12YVNUcEYrdXMzd09GQzQKUkdMMC9XZjZ2UVdjbEV3dDVKTHNRTVpMd3R6bExtd2gyOE9nZUsyU0xaanJnYlFvSURxUlNyM1NNRG8zVk9tVwoxd1Uxd3dPNEN2K2ZNMnhQMzNKQXFzdndPTHlHcElDNU91UHZpc3dRclNVemlaZTh2UVRGWEVxTUlqUnhJWHZICk5QMXVDVm5SdmdjSU4yOXQ0UzQ4R0VwTjVwQzFQb1hPVE5KbzV4RW9oa05VU3ZNa1dKV2dNVS9nOVZBeFRvajQKd2NuU2owYmJWUmVLMWF2Vnd0Q3FydWZFalRzPQotLS0tLUVORCBDRVJUSUZJQ0FURS0tLS0tCg==", "verification": {}}]}}, "timeouts": {}, "version": "2.2.0"}, "networkd": {}, "passwd": {}, "storage": {"files": [{"path": "/etc/sysconfig/network-scripts/ifcfg-ens2f1", "filesystem": "root", "mode": 436, "contents": {"source": "data:,DEVICE%3Dens2f1%0ABOOTPROTO%3Ddhcp%0AONBOOT%3Dyes%0AMTU%3D9000%0A"}}, {"path": "/etc/sysconfig/network-scripts/ifcfg-eno1", "filesystem": "root", "mode": 436, "contents": {"source": "data:,DEVICE%3Deno1%0ABOOTPROTO%3Dnone%0AONBOOT%3Dno%0A"}}]}, "systemd": {}}


Before kicking the deploy, I changed the MTU of the Baremetal interface on my provisioning host, here's the output of ip a on the provisioning host
=======================================================================================================================================================

[kni@e19-h24-b04-fc640 clusterconfigs]$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 00:4e:01:3f:31:78 brd ff:ff:ff:ff:ff:ff
    inet 10.1.39.6/22 brd 10.1.39.255 scope global dynamic noprefixroute eno1
       valid_lft 231182sec preferred_lft 231182sec
3: eno2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 00:4e:01:3f:31:79 brd ff:ff:ff:ff:ff:ff
4: ens2f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master provisioning state UP group default qlen 1000
    link/ether 3c:fd:fe:e7:94:50 brd ff:ff:ff:ff:ff:ff
5: ens2f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq master baremetal state UP group default qlen 1000
    link/ether 3c:fd:fe:e7:94:51 brd ff:ff:ff:ff:ff:ff
9: baremetal: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000
    link/ether 3c:fd:fe:e7:94:51 brd ff:ff:ff:ff:ff:ff
    inet 10.1.59.1/24 brd 10.1.59.255 scope global noprefixroute baremetal
       valid_lft forever preferred_lft forever
    inet6 2620:52:0:13b:bbd2:3811:34b3:452a/64 scope global dynamic noprefixroute 
       valid_lft 2591791sec preferred_lft 604591sec
    inet6 fe80::c8ce:23b7:4e83:2880/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
11: virbr0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
    link/ether 52:54:00:49:21:3f brd ff:ff:ff:ff:ff:ff
    inet 192.168.122.1/24 brd 192.168.122.255 scope global virbr0
       valid_lft forever preferred_lft forever
12: virbr0-nic: <BROADCAST,MULTICAST> mtu 1500 qdisc fq_codel master virbr0 state DOWN group default qlen 1000
    link/ether 52:54:00:49:21:3f brd ff:ff:ff:ff:ff:ff
14: provisioning: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 3c:fd:fe:e7:94:50 brd ff:ff:ff:ff:ff:ff
    inet 172.22.0.1/24 brd 172.22.0.255 scope global noprefixroute provisioning
       valid_lft forever preferred_lft forever
15: vnet0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc fq_codel master baremetal state UNKNOWN group default qlen 1000
    link/ether fe:54:00:da:5e:68 brd ff:ff:ff:ff:ff:ff
16: vnet1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel master provisioning state UNKNOWN group default qlen 1000
    link/ether fe:54:00:82:32:24 brd ff:ff:ff:ff:ff:ff
=========================================================================================================================================================


I do not set the MTU to Jumbo in my bootstrap ignition but as soon as the bootstrap VM was spawned I logged into the VM and changed MTU (before the pods got spawned
==========================================================================================================================================================

Here' the output of ip a inside the bootstrap VM
[root@localhost core]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc fq_codel state UP group default qlen 1000
    link/ether 52:54:00:da:5e:68 brd ff:ff:ff:ff:ff:ff
    inet 10.1.59.79/24 brd 10.1.59.255 scope global dynamic noprefixroute ens3
       valid_lft 3469sec preferred_lft 3469sec
    inet 10.1.59.2/24 scope global secondary ens3
       valid_lft forever preferred_lft forever
    inet 10.1.59.3/24 scope global secondary ens3
       valid_lft forever preferred_lft forever
    inet6 2620:52:0:13b:13b3:8dad:7631:7615/64 scope global dynamic noprefixroute 
       valid_lft 2591871sec preferred_lft 604671sec
    inet6 fe80::a335:27d0:a4db:8a24/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
3: ens4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 52:54:00:82:32:24 brd ff:ff:ff:ff:ff:ff
    inet 172.22.0.2/24 brd 172.22.0.255 scope global noprefixroute ens4
       valid_lft forever preferred_lft forever
    inet6 fe80::9de8:19ba:dd86:48f8/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
=========================================================================================================================================================

Since the pods use host networking, they also have the jumbo MTU

============================================================================================================================================================
[root@localhost core]# podman exec -it mariadb bash
[root@localhost /]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc fq_codel state UP group default qlen 1000
    link/ether 52:54:00:da:5e:68 brd ff:ff:ff:ff:ff:ff
    inet 10.1.59.79/24 brd 10.1.59.255 scope global dynamic noprefixroute ens3
       valid_lft 3462sec preferred_lft 3462sec
    inet 10.1.59.2/24 scope global secondary ens3
       valid_lft forever preferred_lft forever
    inet 10.1.59.3/24 scope global secondary ens3
       valid_lft forever preferred_lft forever
    inet6 2620:52:0:13b:13b3:8dad:7631:7615/64 scope global dynamic noprefixroute 
       valid_lft 2591864sec preferred_lft 604664sec
    inet6 fe80::a335:27d0:a4db:8a24/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
3: ens4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 52:54:00:82:32:24 brd ff:ff:ff:ff:ff:ff
    inet 172.22.0.2/24 brd 172.22.0.255 scope global noprefixroute ens4
       valid_lft forever preferred_lft forever
    inet6 fe80::9de8:19ba:dd86:48f8/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever

===================================================================================================================================================

Comment 8 Sai Sindhur Malleni 2020-06-18 01:54:12 UTC
Install keeps getting stuck and does not make progress
No resources found in openshift-kni-infra namespace.
[kni@e19-h24-b04-fc640 clusterconfigs]$ oc get pods -A                                                                                                                                                                                       
NAMESPACE                                               NAME                                                              READY   STATUS    RESTARTS   AGE                                                                                   
openshift-apiserver-operator                            openshift-apiserver-operator-8596449546-z54mc                     0/1     Pending   0          41m                                                                                   
openshift-authentication-operator                       authentication-operator-66f85cff9-tj6bh                           0/1     Pending   0          41m                                                                                   
openshift-cloud-credential-operator                     cloud-credential-operator-695f4895db-dvdjh                        0/1     Pending   0          42m                                                                                   
openshift-cluster-machine-approver                      machine-approver-685c8468fb-7928j                                 0/2     Pending   0          41m                                                                                   
openshift-cluster-node-tuning-operator                  cluster-node-tuning-operator-6688b7b566-bp7nt                     0/1     Pending   0          42m                                                                                   
openshift-cluster-storage-operator                      csi-snapshot-controller-operator-84dd5b859b-89dxz                 0/1     Pending   0          41m                                                                                   
openshift-cluster-version                               cluster-version-operator-79bbd9b569-whbsl                         0/1     Pending   0          42m
openshift-controller-manager-operator                   openshift-controller-manager-operator-7ff98b7969-7p9gf            0/1     Pending   0          41m
openshift-dns-operator                                  dns-operator-7c947d89c6-mqrb6                                     0/2     Pending   0          42m
openshift-etcd-operator                                 etcd-operator-5d97b6445f-cj7f4                                    0/1     Pending   0          41m
openshift-kube-apiserver-operator                       kube-apiserver-operator-8d9b94dbb-mht5m                           0/1     Pending   0          41m
openshift-kube-controller-manager-operator              kube-controller-manager-operator-6fdcc5987c-dsxn5                 0/1     Pending   0          41m
openshift-kube-scheduler-operator                       openshift-kube-scheduler-operator-68c9564886-95vjp                0/1     Pending   0          41m
openshift-kube-storage-version-migrator-operator        kube-storage-version-migrator-operator-5fd77bc4c8-4pjrx           0/1     Pending   0          41m
openshift-machine-api                                   machine-api-operator-5c4dd5d794-484p7                             0/2     Pending   0          41m
openshift-machine-config-operator                       machine-config-operator-78db57d645-m5qrm                          0/1     Pending   0          42m
openshift-must-gather-6q6jv                             must-gather-dn7m7                                                 0/1     Pending   0          12m
openshift-network-operator                              network-operator-7856c8dd68-4l56z                                 0/1     Pending   0          41m
openshift-operator-lifecycle-manager                    catalog-operator-59dc594f8f-9fxws                                 0/1     Pending   0          42m
openshift-operator-lifecycle-manager                    olm-operator-778c69c9f6-h9p22                                     0/1     Pending   0          42m
openshift-service-ca-operator                           service-ca-operator-5b4f8f7649-4x8xl                              0/1     Pending   0          41m
openshift-service-catalog-apiserver-operator            openshift-service-catalog-apiserver-operator-5f5f55469f-pclbl     0/1     Pending   0          41m
openshift-service-catalog-controller-manager-operator   openshift-service-catalog-controller-manager-operator-79784tcm2   0/1     Pending   0          41m
[kni@e19-h24-b04-fc640 clusterconfigs]$ oc get co
NAME               VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
cloud-credential             True        False         False      41m
[kni@e19-h24-b04-fc640 clusterconfigs]$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version             False       True          42m     Working towards 4.4.4: 73% complete

Some pods seems to be running on the masters

[core@master-0 ~]$ sudo su
[systemd]
Failed Units: 1
  NetworkManager-wait-online.service
[root@master-0 core]# crictl ps
CONTAINER           IMAGE                                                                                                                    CREATED             STATE               NAME                 ATTEMPT             POD ID         
773b67ad92827       ee7065c322c2add50de27f32cc37656366c004cd5868b5993f50a37d9dea2a76                                                         20 minutes ago      Running             coredns-monitor      0                   4d23a14542b3a  
86bf319780841       quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6594664ba965e195e06a70814e88895b2e92dc4746bdb1ec17b068f082405baf   20 minutes ago      Running             mdns-publisher       0                   f650eecb018e8  
e68f6e275c2ba       quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:29ddffd83d2035f76a649223a0fa850ad63a3ca441f6d217a721574465f47338   20 minutes ago      Running             coredns              0                   4d23a14542b3a  
ae6fd5b009327       ee7065c322c2add50de27f32cc37656366c004cd5868b5993f50a37d9dea2a76                                                         20 minutes ago      Running             keepalived-monitor   0                   e4d9465bda1df  
33f9413033c84       quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:78241892eb5ceb7e7c8f6d2b2f890b8a0514a94152ed81b2781024385d984b42   20 minutes ago      Running             keepalived           0                   e4d9465bda1df  
1703529fae9c3       quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e0f9b3b61b5bfdc543373de25764958c4c1bbc639501924268c6cf4cd455f53e   20 minutes ago      Running             haproxy-monitor      0                   9864c2194e4d7  
3d0107ea55fcb       quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d34b20bb3302bbd408a46002c47678f5bf613cf6c15966126967e7abd26c49d3   20 minutes ago      Running             haproxy              0                   9864c2194e4d7  


ip a output from master during deploy (after deploy stopped making progress)

[root@master-0 core]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 00:4e:01:40:b9:51 brd ff:ff:ff:ff:ff:ff
3: eno2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 00:4e:01:40:b9:52 brd ff:ff:ff:ff:ff:ff
4: ens2f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 3c:fd:fe:e7:a3:00 brd ff:ff:ff:ff:ff:ff
    inet 172.22.0.207/24 brd 172.22.0.255 scope global dynamic noprefixroute ens2f0
       valid_lft 2207sec preferred_lft 2207sec
    inet6 fe80::2420:3bd9:dc23:afdd/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
5: ens2f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 1000
    link/ether 3c:fd:fe:e7:a3:01 brd ff:ff:ff:ff:ff:ff
    inet 10.1.59.10/24 brd 10.1.59.255 scope global dynamic noprefixroute ens2f1
       valid_lft 2207sec preferred_lft 2207sec
    inet6 fe80::3efd:feff:fee7:a301/64 scope link 
       valid_lft forever preferred_lft forever

Comment 11 Dan Winship 2020-07-30 13:32:32 UTC
> 1. Follow the instructions at
> https://docs.openshift.com/container-platform/4.4/installing/
> installing_bare_metal/installing-bare-metal-network-customizations.html to
> se teh MTU to 8950 in cluster-network-03-config.yml for OpenShiftSDN
> 2. Since the manifests do not configure the MTU on the baremetal interface
> (configures only MTU for VXLAN interface and veths), use ignition files to
> create ifcfg files to set the MTU to 9000 on the baremetal interface

You do not need to create a cluster-network-03-config.yml manifest. You only need to set the MTU manually if the MTU is not consistent across the cluster and therefore can't be autodetected. If the MTU is correctly configured on the baremetal interfaces, CNO will configure the VXLAN MTU correctly automatically.

Comment 13 Sai Sindhur Malleni 2020-07-30 16:24:20 UTC
(In reply to Dan Winship from comment #11)
> > 1. Follow the instructions at
> > https://docs.openshift.com/container-platform/4.4/installing/
> > installing_bare_metal/installing-bare-metal-network-customizations.html to
> > se teh MTU to 8950 in cluster-network-03-config.yml for OpenShiftSDN
> > 2. Since the manifests do not configure the MTU on the baremetal interface
> > (configures only MTU for VXLAN interface and veths), use ignition files to
> > create ifcfg files to set the MTU to 9000 on the baremetal interface
> 
> You do not need to create a cluster-network-03-config.yml manifest. You only
> need to set the MTU manually if the MTU is not consistent across the cluster
> and therefore can't be autodetected. If the MTU is correctly configured on
> the baremetal interfaces, CNO will configure the VXLAN MTU correctly
> automatically.
So looks like we need to fix our docs too then...

Comment 14 Sai Sindhur Malleni 2020-08-03 18:19:12 UTC
I just did a deployment with 4.5.4 with OVNKubernetes and configured MTU on the baremetal interface using dhcp option 26 (instead of ifcfg files through ignition like I previously did) and did not muck with any OpenShift Manifests. Install went through successfully. I'm going to try again with OpenShiftSDN and report back here, as the original bug was against OpenShiftSDN. BTW, I think we need to fix our docs as stated in my previous comment as we no longer need to muck with the manifests to set the SDN MTU as CNO should be able to set the appropriate MTU by reading the MTU of the baremetal interface.

Comment 15 Sai Sindhur Malleni 2020-08-03 19:46:35 UTC
This is working with OpenShiftSDN as well on 4.5.4, I think we can close this bug safely.


Note You need to log in before you can comment on or make changes to this bug.