The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.
Bug 2218631 - After creating close to 1220 VMs on a single compute, networking breaks for all of them and we see the following errors in the logs
Summary: After creating close to 1220 VMs on a single compute, networking breaks for a...
Keywords:
Status: CLOSED EOL
Alias: None
Product: Red Hat Enterprise Linux Fast Datapath
Classification: Red Hat
Component: openvswitch2.15
Version: FDP 23.J
Hardware: All
OS: Linux
unspecified
urgent
Target Milestone: ---
: ---
Assignee: Aaron Conole
QA Contact: ovs-qe
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-06-29 17:36 UTC by David Hill
Modified: 2024-10-08 17:49 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2024-10-08 17:49:14 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker FD-2987 0 None None None 2023-06-29 17:37:29 UTC
Red Hat Knowledge Base (Solution) 7022257 0 None None None 2023-06-29 18:57:22 UTC

Description David Hill 2023-06-29 17:36:19 UTC
Description of problem:
After creating close to 1220 VMs on a single compute, networking breaks for all of them and we see the following errors in the logs:

Jun 29 17:20:57 compute00 kernel: openvswitch: netlink: Flow action size exceeds max 32768
Jun 29 17:20:57 compute00 kernel: openvswitch: netlink: Flow action size exceeds max 32768
Jun 29 17:20:59 compute00 kernel: openvswitch: netlink: Flow action size exceeds max 32768
Jun 29 17:20:59 compute00 kernel: openvswitch: netlink: Flow action size exceeds max 32768
Jun 29 17:21:00 compute00 kernel: openvswitch: netlink: Flow action size exceeds max 32768
Jun 29 17:21:07 compute00 kernel: openvswitch: netlink: Flow action size exceeds max 32768
Jun 29 17:21:07 compute00 kernel: openvswitch: netlink: Flow action size exceeds max 32768
Jun 29 17:21:07 compute00 kernel: openvswitch: netlink: Flow action size exceeds max 32768
Jun 29 17:21:07 compute00 kernel: openvswitch: netlink: Flow action size exceeds max 32768
Jun 29 17:21:08 compute00 kernel: openvswitch: netlink: Flow action size exceeds max 32768
Jun 29 17:21:08 compute00 kernel: openvswitch: netlink: Flow action size exceeds max 32768
Jun 29 17:21:08 compute00 kernel: openvswitch: netlink: Flow action size exceeds max 32768
Jun 29 17:21:13 compute00 kernel: openvswitch: netlink: Flow action size exceeds max 32768
Jun 29 17:21:13 compute00 kernel: openvswitch: netlink: Flow action size exceeds max 32768
Jun 29 17:21:14 compute00 kernel: openvswitch: netlink: Flow action size exceeds max 32768
Jun 29 17:21:14 compute00 kernel: openvswitch: netlink: Flow action size exceeds max 32768
Jun 29 17:21:15 compute00 kernel: openvswitch: netlink: Flow action size exceeds max 32768
Jun 29 17:21:15 compute00 kernel: openvswitch: netlink: Flow action size exceeds max 32768


Version-Release number of selected component (if applicable):
RHOSP16.2.5

How reproducible:
Always

Steps to Reproduce:
1. Create close to 1220 VMs (at 1219 it should still work)
2.
3.

Actual results:
Global outage

Expected results:
Create 2000 vMs on a single compute

Additional info:

Comment 1 David Hill 2023-06-29 17:46:59 UTC
[dhill@knox ovs]$ grep MAX_ACTIONS_BUFSIZE * -r
#if LINUX_VERSION_CODE < KERNEL_VERSION(4,9,0)
#define MAX_ACTIONS_BUFSIZE     (16 * 1024)
#else
#define MAX_ACTIONS_BUFSIZE     (32 * 1024)
#endif

Comment 2 Ihar Hrachyshka 2023-06-29 17:48:07 UTC
Please share `dump-flows br-int` output before and after the error starts showing up. This will allow engineers to understand which flows are "overflowing" and perhaps find a way to adjust the flows so that they don't hit the kernel limit.

I also wonder if the limit is something that is enforced by e.g. OpenFlow protocol, or it's just an implementation detail in kernel and then in theory could be changed.

Comment 3 David Hill 2023-06-29 17:56:26 UTC
Reading the kernel code , I'm not getting where the short int would be used.   It's using short unsigned or long unsigned which gives up to 65k ... am I reading the kernel code wrong ?   I'm asking because it could also just be an OVS bug ...

Comment 4 Ihar Hrachyshka 2023-06-29 18:00:04 UTC
The request to share dump-flows not addressed, restoring needinfo.

Comment 7 David Hill 2023-06-29 18:47:55 UTC
I was looking at master kernel branch instead of 4.9.y ... yes we have this:

[dhill@knox openvswitch]$ grep -r MAX_ACTIONS_BUFSIZE *
flow_netlink.c:#define MAX_ACTIONS_BUFSIZE	(32 * 1024)
flow_netlink.c:	WARN_ON_ONCE(size > MAX_ACTIONS_BUFSIZE);
flow_netlink.c:	if (new_acts_size > MAX_ACTIONS_BUFSIZE) {
flow_netlink.c:		if ((next_offset + req_size) > MAX_ACTIONS_BUFSIZE) {
flow_netlink.c:				  MAX_ACTIONS_BUFSIZE);
flow_netlink.c:		new_acts_size = MAX_ACTIONS_BUFSIZE;
flow_netlink.c:	*sfa = nla_alloc_flow_actions(min(nla_len(attr), MAX_ACTIONS_BUFSIZE));

Comment 9 David Hill 2023-06-29 18:49:15 UTC
Would changing MAX_ACTIONS_BUFSIZE in both the kernel and OVS in our products be something possible ?   The code appears to have changed in linux master branch and might now be a bit bigger or unlimited ?

Comment 30 Ihar Hrachyshka 2023-07-12 19:39:27 UTC
- we are still waiting for info requested at https://bugzilla.redhat.com/show_bug.cgi?id=2218631#c25

Comment 31 Ihar Hrachyshka 2023-07-12 19:43:45 UTC
Customer should not use port security disabled for their VMs unless there's a clear need to do so.

Perhaps it's non-intuitive, but setting port_security=false on ports results in adding them to actions lists for flows that manage delivery of packets destined to "unknown" MAC addresses. This is because these ports have "unknown" address added to `addresses` field (the field also contains the VM IP and MAC, in addition to "unknown".) If you set port_security=false to a high number of ports in the same network and land all of them on the same chassis, then eventually you reach a ovs/kernel limit on the length of action list that fan out each packet to all "unknown" addressed ports, as in this bz.

The customer still hasn't explained why they use port_security=false for regular lab VMs. If there's no clear reason, they should switch back to default (enabled port security). Please clarify why this is not possible.

Comment 43 ovs-bot 2024-10-08 17:49:14 UTC
This bug did not meet the criteria for automatic migration and is being closed.
If the issue remains, please open a new ticket in https://issues.redhat.com/browse/FDP


Note You need to log in before you can comment on or make changes to this bug.