Bug 2218631
Summary: | After creating close to 1220 VMs on a single compute, networking breaks for all of them and we see the following errors in the logs | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux Fast Datapath | Reporter: | David Hill <dhill> |
Component: | openvswitch2.15 | Assignee: | Aaron Conole <aconole> |
Status: | CLOSED EOL | QA Contact: | ovs-qe |
Severity: | urgent | Docs Contact: | |
Priority: | unspecified | ||
Version: | FDP 23.J | CC: | aconole, ctrautma, fleitner, ftaylor, ihrachys, jhsiao, lmartins, ralongi, rpawlik |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2024-10-08 17:49:14 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
David Hill
2023-06-29 17:36:19 UTC
[dhill@knox ovs]$ grep MAX_ACTIONS_BUFSIZE * -r #if LINUX_VERSION_CODE < KERNEL_VERSION(4,9,0) #define MAX_ACTIONS_BUFSIZE (16 * 1024) #else #define MAX_ACTIONS_BUFSIZE (32 * 1024) #endif Please share `dump-flows br-int` output before and after the error starts showing up. This will allow engineers to understand which flows are "overflowing" and perhaps find a way to adjust the flows so that they don't hit the kernel limit. I also wonder if the limit is something that is enforced by e.g. OpenFlow protocol, or it's just an implementation detail in kernel and then in theory could be changed. Reading the kernel code , I'm not getting where the short int would be used. It's using short unsigned or long unsigned which gives up to 65k ... am I reading the kernel code wrong ? I'm asking because it could also just be an OVS bug ... The request to share dump-flows not addressed, restoring needinfo. I was looking at master kernel branch instead of 4.9.y ... yes we have this: [dhill@knox openvswitch]$ grep -r MAX_ACTIONS_BUFSIZE * flow_netlink.c:#define MAX_ACTIONS_BUFSIZE (32 * 1024) flow_netlink.c: WARN_ON_ONCE(size > MAX_ACTIONS_BUFSIZE); flow_netlink.c: if (new_acts_size > MAX_ACTIONS_BUFSIZE) { flow_netlink.c: if ((next_offset + req_size) > MAX_ACTIONS_BUFSIZE) { flow_netlink.c: MAX_ACTIONS_BUFSIZE); flow_netlink.c: new_acts_size = MAX_ACTIONS_BUFSIZE; flow_netlink.c: *sfa = nla_alloc_flow_actions(min(nla_len(attr), MAX_ACTIONS_BUFSIZE)); Would changing MAX_ACTIONS_BUFSIZE in both the kernel and OVS in our products be something possible ? The code appears to have changed in linux master branch and might now be a bit bigger or unlimited ? - we are still waiting for info requested at https://bugzilla.redhat.com/show_bug.cgi?id=2218631#c25 Customer should not use port security disabled for their VMs unless there's a clear need to do so. Perhaps it's non-intuitive, but setting port_security=false on ports results in adding them to actions lists for flows that manage delivery of packets destined to "unknown" MAC addresses. This is because these ports have "unknown" address added to `addresses` field (the field also contains the VM IP and MAC, in addition to "unknown".) If you set port_security=false to a high number of ports in the same network and land all of them on the same chassis, then eventually you reach a ovs/kernel limit on the length of action list that fan out each packet to all "unknown" addressed ports, as in this bz. The customer still hasn't explained why they use port_security=false for regular lab VMs. If there's no clear reason, they should switch back to default (enabled port security). Please clarify why this is not possible. This bug did not meet the criteria for automatic migration and is being closed. If the issue remains, please open a new ticket in https://issues.redhat.com/browse/FDP |