Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2049613

Summary: MTU migration on SDN IPv4 causes API alerts
Product: OpenShift Container Platform Reporter: Jaime Caamaño Ruiz <jcaamano>
Component: NetworkingAssignee: Patryk Diak <pdiak>
Networking sub component: openshift-sdn QA Contact: zhaozhanqi <zzhao>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: high CC: anusaxen, surya
Version: 4.10   
Target Milestone: ---   
Target Release: 4.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-08-10 10:46:22 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jaime Caamaño Ruiz 2022-02-02 13:10:28 UTC
Description of problem:

When preforming an MTU increase using an MTU migration procedure, on the final step when the MTU migration configuration is cleared and while/after the master nodes reboot, unexpected KubeAggregatedAPIErrors and KubeAggregatedAPIDown alerts temporarily appear on firing state.

Version-Release number of selected component (if applicable): 4.10 nightly


How reproducible: always


Steps to Reproduce:
1. Apply an MTU migration procedure for an MTU increase


Actual results:

KubeAggregatedAPIErrors and KubeAggregatedAPIDown alerts on firing state during the procedure


Expected results:

No KubeAggregatedAPIErrors and KubeAggregatedAPIDown alerts on firing state during the procedure

Additional info:

It looks like this is caused by pods that are being relocated during the node reboots having incorrect MTU settings.

In turn this looks to be caused by openshift-sdn pods being restarted immediately upon being configured with new MTU settings and having those settings immediately in effect whereas it is expected for them to be only in effect after the node reboot. This is because MTU settings are rendered on sdn-config config map and openshift-sdn watches for changes on this config map restarting if changes are detected. An alternative could be to store the MTU settings on a different config map that is not being watched for changes by openshift-sdn.

Comment 1 Surya Seetharaman 2022-02-02 14:09:49 UTC
qq: Is this really a bug? Isn't MTU migration something that comes with "expected disruption?". The alerts are a way of saying, "hey something important is happening", so for a few mins having alert is in firing state is not detrimental right - infact that tells the cluster-admin sdn pods are rolling-in? When the alert fades away we also know things are fine?

Comment 2 Jaime Caamaño Ruiz 2022-02-02 14:17:17 UTC
(In reply to Surya Seetharaman from comment #1)
> qq: Is this really a bug? Isn't MTU migration something that comes with
> "expected disruption?". The alerts are a way of saying, "hey something
> important is happening", so for a few mins having alert is in firing state
> is not detrimental right - infact that tells the cluster-admin sdn pods are
> rolling-in? When the alert fades away we also know things are fine?

The alerts have a threshold that already account for temporary disruption. When they fire the disruption was higher than the threshold and more than expected.

Comment 3 Jaime Caamaño Ruiz 2022-02-02 14:24:26 UTC
(In reply to Jaime Caamaño Ruiz from comment #2)
> (In reply to Surya Seetharaman from comment #1)
> > qq: Is this really a bug? Isn't MTU migration something that comes with
> > "expected disruption?". The alerts are a way of saying, "hey something
> > important is happening", so for a few mins having alert is in firing state
> > is not detrimental right - infact that tells the cluster-admin sdn pods are
> > rolling-in? When the alert fades away we also know things are fine?
> 
> The alerts have a threshold that already account for temporary disruption.
> When they fire the disruption was higher than the threshold and more than
> expected.

And just a note that this specific to API availability alerts.

The MTU migration procedure reboots nodes in sequence with compatible MTU settings across nodes at all times with the intention of keeping the cluster operative during the procedure with minimal or nonexistent disruption.

If we identify disruption for a reason that we can fix or improve upon, I guess it is all right to do it ;)

Comment 11 errata-xmlrpc 2022-08-10 10:46:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Comment 12 Red Hat Bugzilla 2023-09-15 01:51:33 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days