Description of problem: When enabling IPFIX flows export via NetObserv in OCP 4.11.0-rc0, then setting ipfix sampling to 1, the cluster is unstable and becomes eventually unreachable. I loose all possibilities to interact with it via oc/kubectl. Note that: - The same scenario worked fine in OCP 4.10.x - In 4.11.0-rc0, a higher sampling rate works fine. Even sampling=2 is still OK. It starts degrading only when setting sampling=1 - NetObserv has an alternative way to generate flows, by using an eBPF agent. Using that alternative instead of ovs/ipfix, with sampling=1, works correctly. Also, there's another bug opened that is similar - it could be the same cause, not sure: https://bugzilla.redhat.com/show_bug.cgi?id=2103136 Version-Release number of selected component (if applicable): Seems to be openvswitch2.17-2.17.0-22.el8fdp.x86_64 (version used in OCP 4.11.0-rc0) How reproducible: It is reproduced consistently by several people. Steps to Reproduce: 1. Setup an OCP 4.11.0-rc0 cluster with OVN-Kubernetes. My setup has 3 workers and 1 master, on aws, m6i.large 2. Install NetObserv: https://github.com/netobserv/network-observability-operator/#getting-started 3. Edit config (oc edit flowcollector cluster) to set ipfix sampling to 1 (in spec.ipfix.sampling) Actual results: Loose connectivity to cluster Expected results: Stable cluster Additional info:
Heads up: setting IPFIX cache-max-flows param to 1000 (instead of the default 100) fixes the problem. My cluster is stable then, still with sampling=1 So, there's a part of the job we can do in NetObserv by setting better default values. However it doesn't tell why this issue suddenly appeared in ocp4.11. There must be something that puts more pressure on OVS, or maybe there's an increased number of sampled packets, I'd still want to understand what differs with ocp4.10.
I spoke too fast :/ Actually I'm still loosing my cluster after some time, with cache-max-flows=1000
Closing as this issue is resolved with errata: https://access.redhat.com/errata/RHSA-2022:8267 *** This bug has been marked as a duplicate of bug 2080477 ***