Bug 2104943

Summary: cluster unreachable after enabling IPFIX with sampling=1
Product: Red Hat Enterprise Linux Fast Datapath Reporter: Joel Takvorian <jtakvori>
Component: openvswitchAssignee: Timothy Redaelli <tredaelli>
openvswitch sub component: daemons and tools QA Contact: Jiying Qiu <jiqiu>
Status: CLOSED DUPLICATE Docs Contact:
Severity: urgent    
Priority: unspecified CC: ctrautma, davegord, ffernand, jhsiao, jiqiu, mmichels, mpattric, nweinber, ralongi
Version: FDP 22.L   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: kernel-5.14.0-90.el9 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-12-15 18:24:50 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2104957    

Description Joel Takvorian 2022-07-07 14:28:55 UTC
Description of problem:

When enabling IPFIX flows export via NetObserv in OCP 4.11.0-rc0, then setting ipfix sampling to 1, the cluster is unstable and becomes eventually unreachable. I loose all possibilities to interact with it via oc/kubectl.

Note that:
- The same scenario worked fine in OCP 4.10.x
- In 4.11.0-rc0, a higher sampling rate works fine. Even sampling=2 is still OK. It starts degrading only when setting sampling=1
- NetObserv has an alternative way to generate flows, by using an eBPF agent. Using that alternative instead of ovs/ipfix, with sampling=1, works correctly. 

Also, there's another bug opened that is similar - it could be the same cause, not sure: https://bugzilla.redhat.com/show_bug.cgi?id=2103136

Version-Release number of selected component (if applicable):

Seems to be openvswitch2.17-2.17.0-22.el8fdp.x86_64
(version used in OCP 4.11.0-rc0)

How reproducible:

It is reproduced consistently by several people.

Steps to Reproduce:
1. Setup an OCP 4.11.0-rc0 cluster with OVN-Kubernetes. My setup has 3 workers and 1 master, on aws, m6i.large
2. Install NetObserv: https://github.com/netobserv/network-observability-operator/#getting-started
3. Edit config (oc edit flowcollector cluster) to set ipfix sampling to 1 (in spec.ipfix.sampling)

Actual results:

Loose connectivity to cluster


Expected results:

Stable cluster

Additional info:

Comment 2 Joel Takvorian 2022-07-12 08:18:30 UTC
Heads up: setting IPFIX cache-max-flows param to 1000 (instead of the default 100) fixes the problem. My cluster is stable then, still with sampling=1
So, there's a part of the job we can do in NetObserv by setting better default values.

However it doesn't tell why this issue suddenly appeared in ocp4.11. There must be something that puts more pressure on OVS, or maybe there's an increased number of sampled packets, I'd still want to understand what differs with ocp4.10.

Comment 3 Joel Takvorian 2022-07-12 08:45:44 UTC
I spoke too fast :/
Actually I'm still loosing my cluster after some time, with cache-max-flows=1000

Comment 7 Mike Pattrick 2022-12-15 18:24:50 UTC
Closing as this issue is resolved with errata: https://access.redhat.com/errata/RHSA-2022:8267

*** This bug has been marked as a duplicate of bug 2080477 ***