RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1812559 - Need better error/exception for MTU apply failure
Summary: Need better error/exception for MTU apply failure
Keywords:
Status: CLOSED DUPLICATE of bug 2044150
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: nmstate
Version: 8.2
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: rc
: 8.0
Assignee: Gris Ge
QA Contact: Mingyu Shi
URL:
Whiteboard:
Depends On:
Blocks: 1876539
TreeView+ depends on / blocked
 
Reported: 2020-03-11 15:17 UTC by Yossi Segev
Modified: 2023-06-09 20:26 UTC (History)
17 users (show)

Fixed In Version: nmstate-1.2.1-0.1.alpha1.el8
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-02-14 08:10:53 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
NNCE output (6.68 KB, text/plain)
2020-03-11 15:19 UTC, Yossi Segev
no flags Details
journalctl output (79.21 KB, text/plain)
2020-03-11 15:20 UTC, Yossi Segev
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker NMT-609 0 None None None 2023-06-09 20:26:21 UTC

Description Yossi Segev 2020-03-11 15:17:50 UTC
Description of problem:
When applying a network policy with an invalid MTU - it should be indicated in the NM interfaces that the failure is due to an invalid MTU.
* This was found on an OCP cluster with CNV installed (nmstate-handler is part of the CNv installation).


Version-Release number of selected component (if applicable):
NetworkManager-1.20.0-5.el8_1.x86_64


How reproducible:
Always


Steps to Reproduce:
1. Apply a configuration policy as this, which sets the MTU of an existing physical node interface to a value which is higher than the maximum supported:
apiVersion: nmstate.io/v1alpha1
kind: NodeNetworkConfigurationPolicy
metadata:
  name: ens7-state
spec:
  desiredState:
    interfaces:
    - name: ens7
      type: ethernet
      state: up
      mtu: 2000
  nodeSelector:
    kubernetes.io/hostname: "host-172-16-0-33"

2. After a timeout (about 30-60 seconds) the policy is declared as failed:
$ oc get nncp ens7-state 
NAME         STATUS
ens7-state   FailedToConfigure

3. Check the corresponding NNCE:
$ oc get nnce host-172-16-0-33.ens7-state -o yaml
(the result is attached in nnce.out)


Expected results:
An ERROR-labeled line specifying that the failure is due to an invalid MTU.


Actual results:
<BUG> Only DEBUG-labeled entries, and no indication that the policy failed due to invalid MTU.


Additional info:
I enabled TRACE output of NM (by running "nmcli general logging level TRACE domains ALL", without having to restart NM).
In the journalctl output (attached) you can see that there are explicit NM messages indicating that the configuration failed due to an invalid MTU, for example:

Mar 11 12:52:36 host-172-16-0-33 NetworkManager[1482]: <warn>  [1583931156.5405] platform-linux: do-change-link[4]: failure changing link: failure 22 (Invalid argument)
Mar 11 12:52:36 host-172-16-0-33 NetworkManager[1482]: <debug> [1583931156.5405] platform-linux: sysctl: setting '/proc/sys/net/ipv6/conf/ens7/mtu' to '2000' (current value is '1450')
Mar 11 12:52:36 host-172-16-0-33 NetworkManager[1482]: <debug> [1583931156.5406] platform-linux: sysctl: failed to set '/proc/sys/net/ipv6/conf/ens7/mtu' to '2000': (22) Invalid argument

This must be reflected up to the user via in the NNCE (if not in NNCP and NNS as well).

Comment 1 Yossi Segev 2020-03-11 15:19:22 UTC
Created attachment 1669337 [details]
NNCE output

Comment 2 Yossi Segev 2020-03-11 15:20:12 UTC
Created attachment 1669338 [details]
journalctl output

Comment 3 Gris Ge 2020-06-18 05:59:11 UTC
Hi Yossi,

In the `nnce.out`, nmstate state out the detailed error as:

```
libnmstate.error.NmstateVerificationError:

desired
=======
---
name: ens7
type: ethernet
state: up
ipv4:
  address:
      []
  auto-dns: true
  auto-gateway: true
  auto-routes: true
  dhcp: true

      \ enabled: true
ipv6:
  enabled: false
mac-address: FA:16:3E:9D:E8:A3
mtu:
      2000

current
=======
---
name: ens7
type: ethernet
state: up
ipv4:

      \ address: []
  auto-dns: true
  auto-gateway: true
  auto-routes: true

      \ dhcp: true
  enabled: true
ipv6:
  enabled: false
mac-address: FA:16:3E:9D:E8:A3
mtu:
      1450

difference
==========
--- desired
+++ current
@@ -12,4 +12,4 @@

      ipv6:
   enabled: false
 mac-address: FA:16:3E:9D:E8:A3
-mtu: 2000
+mtu:
      1450


```

Which means, nmstate try to apply 2000, but got 1450 after applied, hence rollback.

Comment 4 Yossi Segev 2020-06-18 10:21:22 UTC
The diff in the NNS is good, but it's not enough.
If the NNCP state is "FailedToConfigure", then it necessarily means that an error occurred, therefore an ERROR line should appear in the NNCE.
The NM publishes this error via journalctl, in this line which I also added in the bug descrition:

Mar 11 12:52:36 host-172-16-0-33 NetworkManager[1482]: <debug> [1583931156.5406] platform-linux: sysctl: failed to set '/proc/sys/net/ipv6/conf/ens7/mtu' to '2000': (22) Invalid argument

So why not forwarding this line - as an ERROR message - to the NNCE/NNCP? It would enable a much easier and intuitive debugging for the user.

Comment 5 Gris Ge 2020-06-19 02:51:41 UTC
Hi Thomas,

When NM failed to set the MTU, it still indicate the activation finished.

Is that possible for NM to fail the activation with error message it stated in log:

sysctl: failed to set '/proc/sys/net/ipv6/conf/ens7/mtu' to '2000': (22) Invalid argument

Thank you.

Comment 6 Thomas Haller 2020-07-27 07:14:31 UTC
That doesn't seem so easy.

For one, when you try to configure an MTU of an interface, kernel requires that the underlying interfaces' MTU is large enough. That means for example the MTU of a VLAN must be not larger than the MTU of ethernet below it. Or the MTU of a SRIOv VF might need to be no larger than the MTU of the PF. Understanding the logic how kernel rejects and enforces MTU sizes is not trivial, so NetworkManager doesn't even try. Also, the MTU of a device gets reconfigured, when the MTU of the underlying device changes. That means, there are cases where the MTU of the interface cannot be configured until some time later, when the parent device is ready. Coordinating that (to consistently fail) is non-trivial.

Also, various link settings don't lead to a failure of the activation. E.g. if there is a failure to set autoneg/speed/duplex, then the activation just proceeds, it doesn't fail. For one, that is again because it's hard to understand why kernel fails to comply, and how to properly handle that. Second, it's not clear that every such condition constitutes a hard failre.

So, maybe it's possible. But doesn't seem easy. And is it really useful? Why? If you merely want to detect that the MTU was not in fact correctly set, then we could instead expose that on D-Bus (or you could check yourself).

Comment 7 Gris Ge 2020-07-27 07:46:44 UTC
Hi Thomas,

Hiding error is not a good practice of API. If you think some failures should not block/fail the activation.
Please report those errors in another way, some properties/methods of `NM.ActiveConnection`.

Showing a warning message in journal/syslog and treating it like pass is not OK for me in this case and very hard
to normal user to know what failed.

Comment 8 Petr Horáček 2020-08-22 13:02:12 UTC
Is there any progress, are there plans to tackle this issue?

Comment 9 Gris Ge 2020-08-24 06:10:55 UTC
Hi Thmoas,

Is it still possible to request NetworkManager fail the activation on MTU apply failure?

Comment 10 Thomas Haller 2020-08-24 13:38:36 UTC
(In reply to Gris Ge from comment #9)
> Hi Thmoas,
> 
> Is it still possible to request NetworkManager fail the activation on MTU
> apply failure?

you mean for 8.3? No, given the schedule, that is almost impossible (require a very strong effort).

Beside, the biggest problem is the change in behavior here (of starting to fail). Handling that without breaking existing setups, is what makes it harder.


In general, there are plans to tackle this issue (otherwise, we would have closed the bug).

Comment 11 Gris Ge 2020-08-25 06:14:30 UTC
(In reply to Thomas Haller from comment #10)
> (In reply to Gris Ge from comment #9)
> > Hi Thmoas,
> > 
> > Is it still possible to request NetworkManager fail the activation on MTU
> > apply failure?
> 
> you mean for 8.3? No, given the schedule, that is almost impossible (require
> a very strong effort).

RHEL 8.4 is OK for me.

> 
> Beside, the biggest problem is the change in behavior here (of starting to
> fail). Handling that without breaking existing setups, is what makes it
> harder.
> 
> 
> In general, there are plans to tackle this issue (otherwise, we would have
> closed the bug).

Do you need me to create RFE bug to NM for this request?


Thank you!

Comment 12 Thomas Haller 2020-09-04 10:25:58 UTC
(In reply to Gris Ge from comment #11)
> Do you need me to create RFE bug to NM for this request?

No. I think this bz suffices.

Comment 13 Gris Ge 2020-09-06 14:33:14 UTC
(In reply to Thomas Haller from comment #12)
> (In reply to Gris Ge from comment #11)
> > Do you need me to create RFE bug to NM for this request?
> 
> No. I think this bz suffices.

Should I change component for NetworkManager?

Comment 14 Gris Ge 2020-09-07 13:03:46 UTC
Created bug 1876539 for NetworkManager to improve error handling on MTU apply failure.

Comment 19 Gris Ge 2020-11-09 14:58:24 UTC
Hi Yossi Segev and Petr Horáček,


Currently, nmstate will fail with `libnmstate.error.NmstateVerificationError` with MTU difference in the output.

Are you expecting nmstate to raise specific error like `libnmstate.error.NmstateMtuApplyError` or just a error/warning
log line in the log context of nmstate?

Thank you!

Comment 20 Yossi Segev 2020-11-09 15:21:15 UTC
As a start - an actual ERROR report is better than the current state, where there is only DEBUG report without any indication about an invalid MTU.
If an MTU failure will now result in an actual ERROR-labeled message, with an indication that the origin of the failure is MTU, then we should be fine.
@Gris - can you please add an example of both the NNCE output and the nmstate-handler log upon this failure? It would help me understand if it satisfies my expectations when I submitted this bug.

Comment 21 Gris Ge 2020-11-09 15:53:22 UTC
Hi Yossi,


Currently, NetworkManager only generate a line to journald about MTU invalid error, nmstate cannot receive any indication
on the source of failure. Nmstate can only verify whether user get that they asked and state the difference for the root cause.

Without NetworkManager buy-in, the only effort nmstate can do raise a dedicate exception when NmstateVerificationError happens, looks into whether MTU is the only root cause of verification failure.

It might take me a week or so to learn this NNCE stuff(I assume it is from kubernetes-nmstate). Will provide the example
later.

Comment 22 Petr Horáček 2020-12-07 10:19:55 UTC
Gris, can we help you with the kubernetes-nmstate part? Although, I believe this could be reproducible with nmstatectl alone.

Comment 23 Gris Ge 2020-12-07 12:21:33 UTC
(In reply to Petr Horáček from comment #22)
> Gris, can we help you with the kubernetes-nmstate part? Although, I believe
> this could be reproducible with nmstatectl alone.

Yes please. Could check whether NNCP/NNCE/NNS contains NmstateVerificationError with mtu difference?

To reproduce the problem, simply set MTU to a very big number.

Comment 24 Yossi Segev 2020-12-07 12:45:37 UTC
From the NNCE attached to this BZ:
libnmstate.error.NmstateVerificationError:
      \ndesired\n=======\n---\nname: ens7\ntype: ethernet\nstate: up\nipv4:\n  address:
      []\n  auto-dns: true\n  auto-gateway: true\n  auto-routes: true\n  dhcp: true\n
      \ enabled: true\nipv6:\n  enabled: false\nmac-address: FA:16:3E:9D:E8:A3\nmtu:
      2000\n\ncurrent\n=======\n---\nname: ens7\ntype: ethernet\nstate: up\nipv4:\n
      \ address: []\n  auto-dns: true\n  auto-gateway: true\n  auto-routes: true\n
      \ dhcp: true\n  enabled: true\nipv6:\n  enabled: false\nmac-address: FA:16:3E:9D:E8:A3\nmtu:
      1450\n\ndifference\n==========\n--- desired\n+++ current\n@@ -12,4 +12,4 @@\n
      ipv6:\n   enabled: false\n mac-address: FA:16:3E:9D:E8:A3\n-mtu: 2000\n+mtu:
      1450\n\n\n'"

So there is an NmstateVerificationError, but it doesn't specify that the error is due to the invalid MTU - it just compares the desired state to the current state.

Comment 25 Gris Ge 2020-12-31 04:55:51 UTC
Hi Yossi,

The NmstateVerificationError has identified the cause of failure as MTU does not match with desired state.

What's your preferred way of error reporting on this?

Comment 27 Gris Ge 2021-01-01 05:39:21 UTC
Not sure kernel dmesg can redirect to netlink or not.
But yes, nmstate/NetworkManager should do better on showing the error message instead of `NmstateVerificationError`.
But I don't know how to do that yet. Let me investigate a little bit.

If you are asking the error message format change to only including the difference without context. I can do that in RHEL 8.5.
Is the error message chanage enough for you?

Comment 28 Yossi Segev 2021-01-03 10:19:24 UTC
> If you are asking the error message format change to only including the difference without context. I can do that in RHEL 8.5.
> Is the error message chanage enough for you?


In a lack of a better option - this is a compromise I can live with.
But I would really prefer a clear and explicit ERROR message, e.g.:
       12:52:37,337 root         ERROR    Unsupported MTU 2000 requested.

I believe that if NmstateVerificationError exists, then it should and can be "transformed" to a relevant ERROR level message.

Comment 33 RHEL Program Management 2021-09-11 07:26:55 UTC
After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.

Comment 34 Gris Ge 2021-09-13 12:34:56 UTC
Reopen to continue to work.

Comment 36 Gris Ge 2021-09-17 09:57:59 UTC
The action plan for this bug is request NetworkManager to raise the log priority of MTU failure from trace to warning.

Comment 37 Gris Ge 2021-10-12 06:12:30 UTC
Hi Yossi,

NetworkManager only pass logs to nmstate through their API, hence nmstate could not help on identifying why the desire state verification failure.

With NetworkManager-1.32.10-2.el8.x86_64, the MTU failure is shown as warn instead of debug message, for example:

Oct 12 14:04:52 el8 NetworkManager[951]: <warn>  [1634018692.1414] platform-linux: do-change-link[3]: failure changing link: failure 22 (Invalid argument - mtu greater than device maximum)

This could help you debug this issue in the future.

Could you try it in your system and see whether it meet your expectation?


Thank you!

Comment 38 Yossi Segev 2021-10-21 07:15:04 UTC
Hi Gris,

Our product currently uses NetworkManager v1.30.0-10 (on our Openshift 4.9 clusters, running RHEL 8.4 nodes), so I can't reproduce the issue and test whether this warning solution you suggested is sufficient.
Can you tell when NetworkManager v1.32.10 is going to be available, i.e. on which Openshift/RHEL versions it's expected to be used.

Thanks you very much!
Yossi

Comment 51 Gris Ge 2022-01-24 06:31:58 UTC
Hi Petr,

I have created https://bugzilla.redhat.com/show_bug.cgi?id=2044150 for tracking the effort. Please check whether my proposed solution works or not.

This bug will focusing on getting MTU error show in NetworkManager with proper level(not trace/debug).

Thanks!

Comment 54 Yossi Segev 2022-01-24 15:04:19 UTC
If the NM messages you specified appear in an nmstate entity - most important in an NNCE, but can also appear in NNS and NNCP - then that would meet my expectation.
Otherwise, if these messages only appear in journalctl, then I am afraid it doesn't change the current state, where one must drill through jourbalctl in order to find these NM messages, instead of viewing them in nmstate output.

Comment 55 Gris Ge 2022-01-25 00:35:24 UTC
Hi Yossi,

Thanks for the feedback! I will try my ideas to see whether it works or not.

Comment 58 Mingyu Shi 2022-01-25 03:49:57 UTC
(In reply to Gris Ge from comment #55)
> Hi Yossi,
> 
> Thanks for the feedback! I will try my ideas to see whether it works or not.

Hi Gris,

As I see you've opened https://bugzilla.redhat.com/show_bug.cgi?id=2044150 to make the solution, shall we verify the current one or wait?

Comment 61 Gris Ge 2022-02-14 08:10:53 UTC

*** This bug has been marked as a duplicate of bug 2044150 ***


Note You need to log in before you can comment on or make changes to this bug.