Bug 2104943

Summary:	cluster unreachable after enabling IPFIX with sampling=1
Product:	Red Hat Enterprise Linux Fast Datapath	Reporter:	Joel Takvorian <jtakvori>
Component:	openvswitch	Assignee:	Timothy Redaelli <tredaelli>
openvswitch sub component:	daemons and tools	QA Contact:	Jiying Qiu <jiqiu>
Status:	CLOSED DUPLICATE	Docs Contact:
Severity:	urgent
Priority:	unspecified	CC:	ctrautma, davegord, ffernand, jhsiao, jiqiu, mmichels, mpattric, nweinber, ralongi
Version:	FDP 22.L
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	kernel-5.14.0-90.el9	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-12-15 18:24:50 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2104957

Description Joel Takvorian 2022-07-07 14:28:55 UTC

Description of problem:

When enabling IPFIX flows export via NetObserv in OCP 4.11.0-rc0, then setting ipfix sampling to 1, the cluster is unstable and becomes eventually unreachable. I loose all possibilities to interact with it via oc/kubectl.

Note that:
- The same scenario worked fine in OCP 4.10.x
- In 4.11.0-rc0, a higher sampling rate works fine. Even sampling=2 is still OK. It starts degrading only when setting sampling=1
- NetObserv has an alternative way to generate flows, by using an eBPF agent. Using that alternative instead of ovs/ipfix, with sampling=1, works correctly. 

Also, there's another bug opened that is similar - it could be the same cause, not sure: https://bugzilla.redhat.com/show_bug.cgi?id=2103136

Version-Release number of selected component (if applicable):

Seems to be openvswitch2.17-2.17.0-22.el8fdp.x86_64
(version used in OCP 4.11.0-rc0)

How reproducible:

It is reproduced consistently by several people.

Steps to Reproduce:
1. Setup an OCP 4.11.0-rc0 cluster with OVN-Kubernetes. My setup has 3 workers and 1 master, on aws, m6i.large
2. Install NetObserv: https://github.com/netobserv/network-observability-operator/#getting-started
3. Edit config (oc edit flowcollector cluster) to set ipfix sampling to 1 (in spec.ipfix.sampling)

Actual results:

Loose connectivity to cluster


Expected results:

Stable cluster

Additional info:

Comment 2 Joel Takvorian 2022-07-12 08:18:30 UTC

Heads up: setting IPFIX cache-max-flows param to 1000 (instead of the default 100) fixes the problem. My cluster is stable then, still with sampling=1
So, there's a part of the job we can do in NetObserv by setting better default values.

However it doesn't tell why this issue suddenly appeared in ocp4.11. There must be something that puts more pressure on OVS, or maybe there's an increased number of sampled packets, I'd still want to understand what differs with ocp4.10.

Comment 3 Joel Takvorian 2022-07-12 08:45:44 UTC

I spoke too fast :/
Actually I'm still loosing my cluster after some time, with cache-max-flows=1000

Comment 7 Mike Pattrick 2022-12-15 18:24:50 UTC

Closing as this issue is resolved with errata: https://access.redhat.com/errata/RHSA-2022:8267

*** This bug has been marked as a duplicate of bug 2080477 ***