Bug 2314429

Summary: [ML2/OVN] ovn_emit_need_to_frag should be enabled by default
Product: Red Hat OpenStack Reporter: Lucas Alvares Gomes <lmartins>
Component: openstack-neutronAssignee: OSP Team <rhos-maint>
Status: CLOSED MIGRATED QA Contact: Fiorella Yanac <fyanac>
Severity: urgent Docs Contact:
Priority: high    
Version: 17.1 (Wallaby)CC: bcafarel, bmv, chrisw, gregraka, ihrachys, mburns, mtomaska, scohen, ykarel
Target Milestone: asyncKeywords: Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Known Issue
Doc Text:
Currently, when MTUs mismatch, the communicating peers are unaware of the discrepancy, and the Networking service (neutron) can silently drop the packets. OVN is the cause of this problem because it fails to emit the message, `ICMP Fragmentation Needed`. *Workaround:* the preferred method is to adjust the MTU value to prevent packets that are too large from being transmitted. An alternative method is to set the `OVNEmitNeedToFrag` option in the tripleo templates. For more information, see the Knowledgebase solution, link:https://access.redhat.com/solutions/7092922[Neutron ML2/OVN packet fragmentation problems].
Story Points: ---
Clone Of:
: 2322938 (view as bug list) Environment:
Last Closed: 2025-01-10 09:51:38 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2322938    

Description Lucas Alvares Gomes 2024-09-24 13:03:19 UTC
Description of problem:

Enabling the "ovn_emit_need_to_frag" configuration option solves a gap between ML2/OVN and ML2/OVS but, it was left disabled by default because of a kernel requirement [0]. Now it seems that we could consider enabling this in the product for newer versions of OSP as the kernel requirements are no longer a problem.

I am opening this BZ as a request from a discussion had about a customer use-case on slack, see the internal discussion for more information: https://redhat-internal.slack.com/archives/C046JULBVJ7/p1727170946243269

[0] This feature requires kernel version >= 5.2 (upstream) before being enabled. One could run the following command to verify it before enabling in our product:

$ ovs-appctl -t ovs-vswitchd dpif/show-dp-features br-int

If the output contains: "Check pkt length action: Yes" means it is supported and can be enabled.

Enabling it on a non-supported kernel version will lead to critical performance degradation.

Version-Release number of selected component (if applicable):
17.1.3

Comment 4 Ihar Hrachyshka 2024-09-30 17:47:04 UTC
@Lucas, so what's the implication of the TRAC discussion? Let me write down what I think we should, but please confirm I am not making things up.

1. Since the blocker is not accepted, and changing the default in 17.x may have unintended consequences, we are not going to backport the patch to 17.x. (There is also a kernel version concern, though I don't think it's valid.)

2. Instead, for 17.x, we are going to document the option in a configuration guide.

3. For 18.x, we are going to set the option in neutron-operator in the generated config. We are not going to backport neutron patch.

4. We are not going to backport it upstream either (it's not backportable, as per policy).

---

That said, shouldn't the argument from (1) about unintended consequences apply to (3) too then? We have released 18 GA. Is it ok to change behavior in 18.x line? Or is 18.x case somehow different from 17.x? Is it because of 17.x age?

Comment 5 Lucas Alvares Gomes 2024-10-01 08:31:34 UTC
(In reply to Ihar Hrachyshka from comment #4)
> @Lucas, so what's the implication of the TRAC discussion? Let me write down
> what I think we should, but please confirm I am not making things up.
> 
> 1. Since the blocker is not accepted, and changing the default in 17.x may
> have unintended consequences, we are not going to backport the patch to
> 17.x. (There is also a kernel version concern, though I don't think it's
> valid.)
> 
> 2. Instead, for 17.x, we are going to document the option in a configuration
> guide.
> 
> 3. For 18.x, we are going to set the option in neutron-operator in the
> generated config. We are not going to backport neutron patch.
> 
> 4. We are not going to backport it upstream either (it's not backportable,
> as per policy).
> 
> ---
> 
> That said, shouldn't the argument from (1) about unintended consequences
> apply to (3) too then? We have released 18 GA. Is it ok to change behavior
> in 18.x line? Or is 18.x case somehow different from 17.x? Is it because of
> 17.x age?

Whoever gets to change this default should check the kernel version delivered by our product to see if this is supported, the kernel change was merged upstream in 2019 so I think we are safe but, needs to be double-checked anyway. AFAIK there's no problem to have this option enabled as long as the kernel supports it.

For point 2. I think we should document this in the migration guide since ML2/OVS and ML2/OVN have different behaviors for this specific feature.

I agree with 3. and 4. I don't think we can backport changing a default upstream, so we need to change it in OSP directly.

Comment 7 Ihar Hrachyshka 2024-10-14 13:29:46 UTC
For documentation purposes, the kernel patch that is required for this feature to work is:

```
commit 4d5ec89fc8d14dcdab7214a0c13a1c7321dc6ea9
Author: Numan Siddique <nusiddiq>
Date:   Tue Mar 26 06:13:46 2019 +0530

    net: openvswitch: Add a new action check_pkt_len

    This patch adds a new action - 'check_pkt_len' which checks the
    packet length and executes a set of actions if the packet
    length is greater than the specified length or executes
    another set of actions if the packet length is lesser or equal to.
```

Comment 8 Ihar Hrachyshka 2024-10-14 13:38:07 UTC
The kernel patch was backported back in 2019 for rhel8 branch, as:

```
* Tue Oct 08 2019 Phillip Lougher <plougher> [4.18.0-147.5.el8]
```

So it should be safe to enable the feature for both branches.

Comment 9 Ihar Hrachyshka 2024-10-14 14:11:26 UTC
Greg, I provided a known issue text. Please adjust if needed.

---

The current plan for the issue is:

1. Docs.

a) deliver the Known Issue (with proposed workarounds) to 17.1.x customers.
b) add a KCS on how and when to enable OVNEmitNeedToFrag.
c) update https://docs.redhat.com/en/documentation/red_hat_openstack_platform/17.1/html-single/overcloud_parameters/index#ref_networking-neutron-parameters_overcloud_parameters so that it does NOT claim that OVNEmitNeedToFrag is for "host kernel (version >= 5.2)" because it's not valid for RHEL kernel that has extensive backports to older versions. In reality, all RHEL8/9 kernels support this feature.

2. 17.x.
a) backport the fix to flip the default for ovn_emit_need_to_frag to True in neutron.
b) set default for tripleo OVNEmitNeedToFrag to True in wallaby.

3. 18.x.
a) backport the fix to flip the default for ovn_emit_need_to_frag to True in neutron.
b) set ovn_emit_need_to_frag=True in neutron-operator default config template.

Comment 10 Ihar Hrachyshka 2024-10-14 14:21:55 UTC
This bz will be used for the following from the plan above:

```
2. 17.x.
a) backport the fix to flip the default for ovn_emit_need_to_frag to True in neutron.
```

The rest will get their own bzs / jiras.

Comment 11 Ihar Hrachyshka 2024-10-14 14:30:57 UTC
Final tally:

1. docs: known issue doc text updated in this bz; kcs request: https://bugzilla.redhat.com/show_bug.cgi?id=2318544 ; fix OVNEmitNeedToFrag description in docs: https://bugzilla.redhat.com/show_bug.cgi?id=2318545
2. neutron backport: this BZ; tripleo enable by default: https://bugzilla.redhat.com/show_bug.cgi?id=2318546
3. 18.x Jira to enable it in operator and in neutron: https://issues.redhat.com/browse/OSPRH-10684

Comment 12 Ihar Hrachyshka 2024-10-14 14:34:52 UTC
I'm moving this to NEW since the backport was not posted. I also unassign myself to give a chance to the team to consider it for planning / someone else to pick it up.