2104943 – cluster unreachable after enabling IPFIX with sampling=1

The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.

Bug 2104943 - cluster unreachable after enabling IPFIX with sampling=1

Summary: cluster unreachable after enabling IPFIX with sampling=1

Keywords:
Status:	CLOSED DUPLICATE of bug 2080477
Alias:	None
Product:	Red Hat Enterprise Linux Fast Datapath
Classification:	Red Hat
Component:	openvswitch
Sub Component:
Version:	FDP 22.L
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Timothy Redaelli
QA Contact:	Jiying Qiu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2104957
TreeView+	depends on / blocked

Reported:	2022-07-07 14:28 UTC by Joel Takvorian
Modified:	2022-12-21 18:25 UTC (History)
CC List:	9 users (show)
Fixed In Version:	kernel-5.14.0-90.el9
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-12-15 18:24:50 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	FD-2099	0	None	None	None	2022-07-07 14:30:53 UTC

Description Joel Takvorian 2022-07-07 14:28:55 UTC

Description of problem:

When enabling IPFIX flows export via NetObserv in OCP 4.11.0-rc0, then setting ipfix sampling to 1, the cluster is unstable and becomes eventually unreachable. I loose all possibilities to interact with it via oc/kubectl.

Note that:
- The same scenario worked fine in OCP 4.10.x
- In 4.11.0-rc0, a higher sampling rate works fine. Even sampling=2 is still OK. It starts degrading only when setting sampling=1
- NetObserv has an alternative way to generate flows, by using an eBPF agent. Using that alternative instead of ovs/ipfix, with sampling=1, works correctly. 

Also, there's another bug opened that is similar - it could be the same cause, not sure: https://bugzilla.redhat.com/show_bug.cgi?id=2103136

Version-Release number of selected component (if applicable):

Seems to be openvswitch2.17-2.17.0-22.el8fdp.x86_64
(version used in OCP 4.11.0-rc0)

How reproducible:

It is reproduced consistently by several people.

Steps to Reproduce:
1. Setup an OCP 4.11.0-rc0 cluster with OVN-Kubernetes. My setup has 3 workers and 1 master, on aws, m6i.large
2. Install NetObserv: https://github.com/netobserv/network-observability-operator/#getting-started
3. Edit config (oc edit flowcollector cluster) to set ipfix sampling to 1 (in spec.ipfix.sampling)

Actual results:

Loose connectivity to cluster


Expected results:

Stable cluster

Additional info:

Comment 2 Joel Takvorian 2022-07-12 08:18:30 UTC

Heads up: setting IPFIX cache-max-flows param to 1000 (instead of the default 100) fixes the problem. My cluster is stable then, still with sampling=1
So, there's a part of the job we can do in NetObserv by setting better default values.

However it doesn't tell why this issue suddenly appeared in ocp4.11. There must be something that puts more pressure on OVS, or maybe there's an increased number of sampled packets, I'd still want to understand what differs with ocp4.10.

Comment 3 Joel Takvorian 2022-07-12 08:45:44 UTC

I spoke too fast :/
Actually I'm still loosing my cluster after some time, with cache-max-flows=1000

Comment 7 Mike Pattrick 2022-12-15 18:24:50 UTC

Closing as this issue is resolved with errata: https://access.redhat.com/errata/RHSA-2022:8267

*** This bug has been marked as a duplicate of bug 2080477 ***

Note You need to log in before you can comment on or make changes to this bug.