2049613 – MTU migration on SDN IPv4 causes API alerts

Bug 2049613 - MTU migration on SDN IPv4 causes API alerts

Summary: MTU migration on SDN IPv4 causes API alerts

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	4.11.0
Assignee:	Patryk Diak
QA Contact:	zhaozhanqi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-02-02 13:10 UTC by Jaime Caamaño Ruiz
Modified:	2023-09-15 01:51 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-08-10 10:46:22 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-network-operator pull 1299	0	None	open	Bug 2049613: Use a separate configmap for mtu migration config to avoid pod restart	2022-02-02 13:15:39 UTC
Red Hat Product Errata	RHSA-2022:5069	0	None	None	None	2022-08-10 10:46:54 UTC

Description Jaime Caamaño Ruiz 2022-02-02 13:10:28 UTC

Description of problem:

When preforming an MTU increase using an MTU migration procedure, on the final step when the MTU migration configuration is cleared and while/after the master nodes reboot, unexpected KubeAggregatedAPIErrors and KubeAggregatedAPIDown alerts temporarily appear on firing state.

Version-Release number of selected component (if applicable): 4.10 nightly


How reproducible: always


Steps to Reproduce:
1. Apply an MTU migration procedure for an MTU increase


Actual results:

KubeAggregatedAPIErrors and KubeAggregatedAPIDown alerts on firing state during the procedure


Expected results:

No KubeAggregatedAPIErrors and KubeAggregatedAPIDown alerts on firing state during the procedure

Additional info:

It looks like this is caused by pods that are being relocated during the node reboots having incorrect MTU settings.

In turn this looks to be caused by openshift-sdn pods being restarted immediately upon being configured with new MTU settings and having those settings immediately in effect whereas it is expected for them to be only in effect after the node reboot. This is because MTU settings are rendered on sdn-config config map and openshift-sdn watches for changes on this config map restarting if changes are detected. An alternative could be to store the MTU settings on a different config map that is not being watched for changes by openshift-sdn.

Comment 1 Surya Seetharaman 2022-02-02 14:09:49 UTC

qq: Is this really a bug? Isn't MTU migration something that comes with "expected disruption?". The alerts are a way of saying, "hey something important is happening", so for a few mins having alert is in firing state is not detrimental right - infact that tells the cluster-admin sdn pods are rolling-in? When the alert fades away we also know things are fine?

Comment 2 Jaime Caamaño Ruiz 2022-02-02 14:17:17 UTC

(In reply to Surya Seetharaman from comment #1)
> qq: Is this really a bug? Isn't MTU migration something that comes with
> "expected disruption?". The alerts are a way of saying, "hey something
> important is happening", so for a few mins having alert is in firing state
> is not detrimental right - infact that tells the cluster-admin sdn pods are
> rolling-in? When the alert fades away we also know things are fine?

The alerts have a threshold that already account for temporary disruption. When they fire the disruption was higher than the threshold and more than expected.

Comment 3 Jaime Caamaño Ruiz 2022-02-02 14:24:26 UTC

(In reply to Jaime Caamaño Ruiz from comment #2)
> (In reply to Surya Seetharaman from comment #1)
> > qq: Is this really a bug? Isn't MTU migration something that comes with
> > "expected disruption?". The alerts are a way of saying, "hey something
> > important is happening", so for a few mins having alert is in firing state
> > is not detrimental right - infact that tells the cluster-admin sdn pods are
> > rolling-in? When the alert fades away we also know things are fine?
> 
> The alerts have a threshold that already account for temporary disruption.
> When they fire the disruption was higher than the threshold and more than
> expected.

And just a note that this specific to API availability alerts.

The MTU migration procedure reboots nodes in sequence with compatible MTU settings across nodes at all times with the intention of keeping the cluster operative during the procedure with minimal or nonexistent disruption.

If we identify disruption for a reason that we can fix or improve upon, I guess it is all right to do it ;)

Comment 11 errata-xmlrpc 2022-08-10 10:46:22 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Comment 12 Red Hat Bugzilla 2023-09-15 01:51:33 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days

Note You need to log in before you can comment on or make changes to this bug.