The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.
Bug 2104943 - cluster unreachable after enabling IPFIX with sampling=1
Summary: cluster unreachable after enabling IPFIX with sampling=1
Keywords:
Status: CLOSED DUPLICATE of bug 2080477
Alias: None
Product: Red Hat Enterprise Linux Fast Datapath
Classification: Red Hat
Component: openvswitch
Version: FDP 22.L
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: ---
Assignee: Timothy Redaelli
QA Contact: Jiying Qiu
URL:
Whiteboard:
Depends On:
Blocks: 2104957
TreeView+ depends on / blocked
 
Reported: 2022-07-07 14:28 UTC by Joel Takvorian
Modified: 2022-12-21 18:25 UTC (History)
9 users (show)

Fixed In Version: kernel-5.14.0-90.el9
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-12-15 18:24:50 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker FD-2099 0 None None None 2022-07-07 14:30:53 UTC

Description Joel Takvorian 2022-07-07 14:28:55 UTC
Description of problem:

When enabling IPFIX flows export via NetObserv in OCP 4.11.0-rc0, then setting ipfix sampling to 1, the cluster is unstable and becomes eventually unreachable. I loose all possibilities to interact with it via oc/kubectl.

Note that:
- The same scenario worked fine in OCP 4.10.x
- In 4.11.0-rc0, a higher sampling rate works fine. Even sampling=2 is still OK. It starts degrading only when setting sampling=1
- NetObserv has an alternative way to generate flows, by using an eBPF agent. Using that alternative instead of ovs/ipfix, with sampling=1, works correctly. 

Also, there's another bug opened that is similar - it could be the same cause, not sure: https://bugzilla.redhat.com/show_bug.cgi?id=2103136

Version-Release number of selected component (if applicable):

Seems to be openvswitch2.17-2.17.0-22.el8fdp.x86_64
(version used in OCP 4.11.0-rc0)

How reproducible:

It is reproduced consistently by several people.

Steps to Reproduce:
1. Setup an OCP 4.11.0-rc0 cluster with OVN-Kubernetes. My setup has 3 workers and 1 master, on aws, m6i.large
2. Install NetObserv: https://github.com/netobserv/network-observability-operator/#getting-started
3. Edit config (oc edit flowcollector cluster) to set ipfix sampling to 1 (in spec.ipfix.sampling)

Actual results:

Loose connectivity to cluster


Expected results:

Stable cluster

Additional info:

Comment 2 Joel Takvorian 2022-07-12 08:18:30 UTC
Heads up: setting IPFIX cache-max-flows param to 1000 (instead of the default 100) fixes the problem. My cluster is stable then, still with sampling=1
So, there's a part of the job we can do in NetObserv by setting better default values.

However it doesn't tell why this issue suddenly appeared in ocp4.11. There must be something that puts more pressure on OVS, or maybe there's an increased number of sampled packets, I'd still want to understand what differs with ocp4.10.

Comment 3 Joel Takvorian 2022-07-12 08:45:44 UTC
I spoke too fast :/
Actually I'm still loosing my cluster after some time, with cache-max-flows=1000

Comment 7 Mike Pattrick 2022-12-15 18:24:50 UTC
Closing as this issue is resolved with errata: https://access.redhat.com/errata/RHSA-2022:8267

*** This bug has been marked as a duplicate of bug 2080477 ***


Note You need to log in before you can comment on or make changes to this bug.