Description of problem: When preforming an MTU increase using an MTU migration procedure, on the final step when the MTU migration configuration is cleared and while/after the master nodes reboot, unexpected KubeAggregatedAPIErrors and KubeAggregatedAPIDown alerts temporarily appear on firing state. Version-Release number of selected component (if applicable): 4.10 nightly How reproducible: always Steps to Reproduce: 1. Apply an MTU migration procedure for an MTU increase Actual results: KubeAggregatedAPIErrors and KubeAggregatedAPIDown alerts on firing state during the procedure Expected results: No KubeAggregatedAPIErrors and KubeAggregatedAPIDown alerts on firing state during the procedure Additional info: It looks like this is caused by pods that are being relocated during the node reboots having incorrect MTU settings. In turn this looks to be caused by openshift-sdn pods being restarted immediately upon being configured with new MTU settings and having those settings immediately in effect whereas it is expected for them to be only in effect after the node reboot. This is because MTU settings are rendered on sdn-config config map and openshift-sdn watches for changes on this config map restarting if changes are detected. An alternative could be to store the MTU settings on a different config map that is not being watched for changes by openshift-sdn.
qq: Is this really a bug? Isn't MTU migration something that comes with "expected disruption?". The alerts are a way of saying, "hey something important is happening", so for a few mins having alert is in firing state is not detrimental right - infact that tells the cluster-admin sdn pods are rolling-in? When the alert fades away we also know things are fine?
(In reply to Surya Seetharaman from comment #1) > qq: Is this really a bug? Isn't MTU migration something that comes with > "expected disruption?". The alerts are a way of saying, "hey something > important is happening", so for a few mins having alert is in firing state > is not detrimental right - infact that tells the cluster-admin sdn pods are > rolling-in? When the alert fades away we also know things are fine? The alerts have a threshold that already account for temporary disruption. When they fire the disruption was higher than the threshold and more than expected.
(In reply to Jaime Caamaño Ruiz from comment #2) > (In reply to Surya Seetharaman from comment #1) > > qq: Is this really a bug? Isn't MTU migration something that comes with > > "expected disruption?". The alerts are a way of saying, "hey something > > important is happening", so for a few mins having alert is in firing state > > is not detrimental right - infact that tells the cluster-admin sdn pods are > > rolling-in? When the alert fades away we also know things are fine? > > The alerts have a threshold that already account for temporary disruption. > When they fire the disruption was higher than the threshold and more than > expected. And just a note that this specific to API availability alerts. The MTU migration procedure reboots nodes in sequence with compatible MTU settings across nodes at all times with the intention of keeping the cluster operative during the procedure with minimal or nonexistent disruption. If we identify disruption for a reason that we can fix or improve upon, I guess it is all right to do it ;)
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days