1890341 – OpenShift 4.5 cluster went unresponsive API server logs are flooded with panic and TLS handshake error

Bug 1890341 - OpenShift 4.5 cluster went unresponsive API server logs are flooded with panic and TLS handshake error

Summary: OpenShift 4.5 cluster went unresponsive API server logs are flooded with pani...

Keywords:
Status:	CLOSED DUPLICATE of bug 1825219
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Ben Bennett
QA Contact:	zhaozhanqi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-10-21 22:13 UTC by Shivkumar Ople
Modified:	2024-03-25 16:47 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-11-06 17:54:09 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Comment 1 Abu Kashem 2020-10-21 22:32:37 UTC

Dan Winship has been helping me find the root cause of this issue. The slack thread for reference: https://coreos.slack.com/archives/CDCP2LA9L/p1603141588175200

Summary:
> We see 'net/http: TLS handshake timeout' errors in kas log, and oas has 'TLS handshake error from 10.130.0.1:36798: EOF' 

Dan's suggestion is:
If you set the VXLAN MTU to the wrong value, then when the server tries to send its certificate, the packets won't get delivered, and then both sides think they're waiting for the other one to talk.



So we need to ask the customer if they have tweaked the MTU settings.
Also  sople is working to get us a fresh set of data capture from the cluster, must-gather, prometheus metrics dump.

Comment 2 Ben Bennett 2020-10-22 14:29:54 UTC

This is likely the same issue as https://bugzilla.redhat.com/show_bug.cgi?id=1825219

There is a Knowledge Base article in progress at https://access.redhat.com/solutions/5252831 that documents a daemonset you can deploy to work around the issue.

The kernel team has investigated this extensively, and found that we are erroneously getting a "needs fragmentation" packet from the underlay (either kernel or netowork) with the ip address of an openshift node.  This causes the host kernel to change the PMTU and that causes packets to get fragmented which breaks the VxLAN networking.  There is a kernel change to the PMTU code to make it handle this weird case, but the actual source of the packet has not yet been identified.

Comment 3 Ben Bennett 2020-10-22 18:27:29 UTC

Assigning to 4.7 to identify the issue and the fix.  If appropriate we will backport to an earlier release.

Comment 4 Rigel Di Scala 2020-10-23 08:56:43 UTC

@bbennett 

> There is a Knowledge Base article in progress at https://access.redhat.com/solutions/5252831 that documents a daemonset you can deploy to work around the issue.

I worked with the client today, following the instructions in solution #5252831, but could not reproduce the problem. 

We connected via SSH to each Master node and attempted to craft an ICMP message according to the instructions.  Here's an excerpt that demonstrates that the nodes are not affected by the problem regarding the MTU:

$ ping -M do -c4 -s 1500 10.13.86.71
ping: local error: Message too long, mtu=1500

$ ping -M do -c4 -s 1473 10.13.86.71
ping: local error: Message too long, mtu=1500

$ ping -M do -c4 -s 1472 10.13.86.71
1480 bytes from 10.13.86.71: icmp_seq=1 ttl=64 time=0.866 ms

However, an anomaly was observed when trying to ping certain master nodes. In certain cases, crafting messages that were ten times larger than the expected MTU limit of 1500 still resulted in a successful response. This only happened in specific situations:

From  Master 0:
 - ping Master 1, size 15K => unexpected success (no size limit)
 - ping Master 2, size 15K => expected failure (MTU 1500)

From Master 1:
 - ping Master 0,  size 15K => expected failure (MTU 1500)
 - ping Master 2, size 15K => unexpected success (no size limit)

From Master 2:
 - ping Master 0,  size 15K => expected failure (MTU 1500)
 - ping Master 1, size 15K => expected failure (MTU 1500)

Is this an expected behaviour or an anomaly?

Comment 5 Ben Bennett 2020-10-23 16:37:32 UTC

We had a call to sync about this, and the comment above happened after the node was rebooted so the PMTU was cleared and the commands above would not have shown the problem.

Comment 9 Rigel Di Scala 2020-11-04 06:55:40 UTC

Hello @bbennett the DaemonSet has produced these logs (echoed lines have been removed for brevity):

Ciao Rigel,

 

di seguito i log richiesti oggi in call:

 

[root@aznpi000070 ~]# for name in $(oc get pod -n openshift-network-operator | grep "cachefix" | awk '{ print $1}'); do oc logs $name -n openshift-network-operator | grep -v "+"; done

I1026 12:08:02.781740835 - cachefix - start cachefix ocp-np-b8blg-worker-northeurope2-hr5lw

I1026 12:07:57.924032965 - cachefix - start cachefix ocp-np-b8blg-master-2

I1026 12:08:02.989800775 - cachefix - start cachefix ocp-np-b8blg-master-0

10.13.94.6 via 10.13.86.65 dev eth0

    cache expires 574sec mtu 1450

10.13.94.6 via 10.13.86.65 dev eth0

    cache expires 574sec mtu 1450

10.13.94.6 via 10.13.86.65 dev eth0

    cache expires 574sec mtu 1450

10.13.94.5 via 10.13.86.65 dev eth0

    cache expires 566sec mtu 1450

10.13.94.6 via 10.13.86.65 dev eth0

    cache

10.13.94.5 via 10.13.86.65 dev eth0

    cache expires 566sec mtu 1450

10.13.94.6 via 10.13.86.65 dev eth0

    cache

10.13.94.5 via 10.13.86.65 dev eth0

    cache expires 566sec mtu 1450

10.13.94.6 via 10.13.86.65 dev eth0

    cache

10.13.94.5 via 10.13.86.65 dev eth0

    cache

10.13.94.6 via 10.13.86.65 dev eth0

    cache

10.13.86.71 dev eth0

    cache expires 569sec mtu 1450

10.13.94.5 via 10.13.86.65 dev eth0

    cache

10.13.94.6 via 10.13.86.65 dev eth0

    cache

10.13.94.5 via 10.13.86.65 dev eth0

    cache

10.13.94.6 via 10.13.86.65 dev eth0

    cache

10.13.94.5 via 10.13.86.65 dev eth0

    cache

10.13.94.6 via 10.13.86.65 dev eth0

    cache

10.13.86.71 dev eth0

    cache expires 560sec mtu 1450

10.13.94.5 via 10.13.86.65 dev eth0

    cache

10.13.94.6 via 10.13.86.65 dev eth0

    cache

10.13.94.5 via 10.13.86.65 dev eth0

    cache

10.13.94.6 via 10.13.86.65 dev eth0

    cache

I1026 12:08:02.970139025 - cachefix - start cachefix ocp-np-b8blg-master-1

10.13.94.6 via 10.13.86.65 dev eth0

    cache expires 581sec mtu 1450

10.13.94.6 via 10.13.86.65 dev eth0

    cache expires 581sec mtu 1450

10.13.94.6 via 10.13.86.65 dev eth0

    cache expires 581sec mtu 1450

I1026 12:08:02.617898458 - cachefix - start cachefix ocp-np-b8blg-worker-northeurope1-v8rxq

I1026 12:08:02.564567057 - cachefix - start cachefix ocp-np-b8blg-worker-northeurope3-pkgpg


What is your opinion? Are we seeing the MTU problem manifesting itself here?

Comment 11 Ben Bennett 2020-11-06 17:54:09 UTC


*** This bug has been marked as a duplicate of bug 1825219 ***

Comment 12 Red Hat Bugzilla 2023-09-14 06:09:26 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days

Note You need to log in before you can comment on or make changes to this bug.