Bug 1441310 - 4.10 mtu restriction causes failures in openshift networking
Summary: 4.10 mtu restriction causes failures in openshift networking
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 25
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-04-11 16:11 UTC by Dusty Mabe
Modified: 2017-04-17 20:52 UTC (History)
8 users (show)

Fixed In Version: kernel-4.10.10-100.fc24 kernel-4.10.10-200.fc25
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-04-17 20:52:08 UTC
Type: Bug


Attachments (Terms of Use)

Description Dusty Mabe 2017-04-11 16:11:12 UTC
Description of problem:

In 4.10 kernels we can't bring up openshift. Here is example status output from the failing service:

```
 origin-node[10803]: I0411 00:15:34.995594   10859 kubelet_node_status.go:377] Recording NodeHasNoDiskPressure event message for node 10.0.111.54
 origin-node[10803]: I0411 00:15:34.995740   10859 kubelet_node_status.go:73] Attempting to register node 10.0.111.54
 origin-node[10803]: I0411 00:15:35.001804   10859 kubelet_node_status.go:112] Node 10.0.111.54 was previously registered
 origin-node[10803]: I0411 00:15:35.001845   10859 kubelet_node_status.go:76] Successfully registered node 10.0.111.54
 origin-node[10803]: I0411 00:15:35.033412   10859 kubelet_node_status.go:377] Recording NodeNotSchedulable event message for node 10.0.111.54
 origin-node[10803]: I0411 00:15:35.073467   10859 manager.go:290] Recovery completed
 origin-node[10803]: I0411 00:15:35.159477   10859 conversion.go:133] failed to handle multiple devices for container. Skipping Filesystem stats
 origin-node[10803]: E0411 00:15:35.878364   10859 controller.go:176] Error removing 10.128.0.0/23 route from dev tun0: timed out waiting for the condition; if the route appears later it will not be deleted.
 origin-node[10803]: F0411 00:15:35.878402   10859 node.go:350] error: SDN node startup failed: exit status 2
 systemd[1]: origin-node.service: Main process exited, code=exited, status=255/n/a

```

After some investigation with the openshift networking team this was the smoking gun:

```
[ 2584.781290] tun0: Invalid MTU 8951 requested, hw max 1500
```

Which means the kernel either refuses to "up" tun0 or it takes it down right after we "up" it.

The team found this upstream kernel commit which appears to fix the problem:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=425df17c


Version-Release number of selected component (if applicable):

I notice the change when performing this upgrade on the updates stream for 25 Fedora Atomic Host:

```
ostree diff commit old: 89f419931b1e78065d06a9115a361251bce0974f444833707dcf18028db1da1d^^^^^^^^^^^^^^ (6a71adb06bc296c19839e951c38dc0b71ee5d7a82262fef9612f256f0c2a70da)
ostree diff commit new: 89f419931b1e78065d06a9115a361251bce0974f444833707dcf18028db1da1d^^^^^^^^^^^^^ (6113186465394b1dc798d46464f50590fee32b0419efba7d37608b903f91bec7)
Upgraded:
  kernel 4.9.14-200.fc25.x86_64 -> 4.10.5-200.fc25.x86_64
  kernel-core 4.9.14-200.fc25.x86_64 -> 4.10.5-200.fc25.x86_64
  kernel-modules 4.9.14-200.fc25.x86_64 -> 4.10.5-200.fc25.x86_64
```

How reproducible:

Always

Steps to Reproduce:
1. Boot atomic host (i used  AWS c4.xlarge instances) 
2. install openshift using openshift-ansible@de6629a
3. notice origin-node service fails to start

Comment 1 Laura Abbott 2017-04-11 16:19:25 UTC
Dropped the fix into F24/F25 branches. It should show up the next time we do a stable release.

Comment 2 Dusty Mabe 2017-04-11 16:27:49 UTC
I'm assuming the dev build is going to be a while from now considering there is already one currently in testing. We are going to wait on this change for the next Fedora 25 Atomic release, so anything we could do to have that sooner than later would be great! I understand that it will probably not be "this week".

Comment 3 Dusty Mabe 2017-04-11 21:34:48 UTC
I have tested this with a scratch build kernel [1] and openshift networking is restored to normal. /me will patiently wait for this to hit updates. 

Thanks for everyone who helped me debug this. 

[1] https://koji.fedoraproject.org/koji/taskinfo?taskID=18932765

Comment 4 Fedora Update System 2017-04-13 15:32:37 UTC
kernel-4.10.10-200.fc25 has been submitted as an update to Fedora 25. https://bodhi.fedoraproject.org/updates/FEDORA-2017-26c9ecd7a4

Comment 5 Fedora Update System 2017-04-13 15:33:42 UTC
kernel-4.10.10-100.fc24 has been submitted as an update to Fedora 24. https://bodhi.fedoraproject.org/updates/FEDORA-2017-8e7549fb91

Comment 6 Fedora Update System 2017-04-14 23:52:49 UTC
kernel-4.10.10-100.fc24 has been pushed to the Fedora 24 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2017-8e7549fb91

Comment 7 Fedora Update System 2017-04-15 00:28:12 UTC
kernel-4.10.10-200.fc25 has been pushed to the Fedora 25 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2017-26c9ecd7a4

Comment 8 Fedora Update System 2017-04-17 20:52:08 UTC
kernel-4.10.10-100.fc24 has been pushed to the Fedora 24 stable repository. If problems still persist, please make note of it in this bug report.

Comment 9 Fedora Update System 2017-04-17 20:52:35 UTC
kernel-4.10.10-200.fc25 has been pushed to the Fedora 25 stable repository. If problems still persist, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.