Bug 2104957

Summary:	cluster becomes unstable after enabling IPFix exports with sampling of 1:1
Product:	OpenShift Container Platform	Reporter:	Mehul Modi <memodi>
Component:	Networking	Assignee:	ffernand <ffernand>
Networking sub component:	ovn-kubernetes	QA Contact:	Anurag saxena <anusaxen>
Status:	CLOSED CURRENTRELEASE	Docs Contact:
Severity:	high
Priority:	high	CC:	davegord, ffernandez, jtakvori, mifiedle, nweinber, rravaiol
Version:	4.11
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-01-05 15:51:38 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	2080477, 2104943
Bug Blocks:

Description Mehul Modi 2022-07-07 14:52:51 UTC

Description of problem:
Creating this tracker bug for issue in OVS https://bugzilla.redhat.com/show_bug.cgi?id=2104943

When enabling IPFIX flows export via NetObserv in OCP 4.11.0-rc0, then setting ipfix sampling to 1, the cluster is unstable and becomes eventually unreachable. I loose all possibilities to interact with it via oc/kubectl.



Version-Release number of selected component (if applicable):
OCP 4.11.0-rc0

How reproducible:
Consistently

Steps to Reproduce:
1. Setup an OCP 4.11.0-rc0 cluster with OVN-Kubernetes. My setup has 3 workers and 1 master, on aws, m6i.large
2. Install NetObserv: https://github.com/netobserv/network-observability-operator/#getting-started
3. Edit config (oc edit flowcollector cluster) to set ipfix sampling to 1 (in spec.ipfix.sampling)

Actual results:
cluster becomes unstable with sporadic connectivity issues

Expected results:
cluster should be stable

Additional info:

Comment 1 Mehul Modi 2022-07-07 16:49:53 UTC

Adding more info:

ip-10-0-141-43.us-east-2.compute.internal    Ready      worker   49m   v1.24.0+2dd8bb1
ip-10-0-148-46.us-east-2.compute.internal    NotReady   master   58m   v1.24.0+2dd8bb1
ip-10-0-154-197.us-east-2.compute.internal   Ready      worker   52m   v1.24.0+2dd8bb1
ip-10-0-182-158.us-east-2.compute.internal   Ready      worker   49m   v1.24.0+2dd8bb1
ip-10-0-182-170.us-east-2.compute.internal   Ready      master   59m   v1.24.0+2dd8bb1
ip-10-0-190-24.us-east-2.compute.internal    Ready      worker   49m   v1.24.0+2dd8bb1
ip-10-0-203-222.us-east-2.compute.internal   Ready      master   58m   v1.24.0+2dd8bb1
ip-10-0-203-62.us-east-2.compute.internal    Ready      worker   49m   v1.24.0+2dd8bb1
ip-10-0-207-237.us-east-2.compute.internal   Ready      worker   53m   v1.24.0+2dd8bb1

COs state could transition to Progressing state and eventually recovers 

$ oc get co
NAME                                       VERSION       AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.11.0-rc.1   True        False         False      4s
baremetal                                  4.11.0-rc.1   True        False         False      97m
cloud-controller-manager                   4.11.0-rc.1   True        False         False      101m
cloud-credential                           4.11.0-rc.1   True        False         False      101m
cluster-autoscaler                         4.11.0-rc.1   True        False         False      97m
config-operator                            4.11.0-rc.1   True        False         False      98m
console                                    4.11.0-rc.1   True        False         False      86m
csi-snapshot-controller                    4.11.0-rc.1   True        True          False      98m     CSISnapshotControllerProgressing: Waiting for Deployment to deploy csi-snapshot-controller pods
dns                                        4.11.0-rc.1   True        True          False      98m     DNS "default" reports Progressing=True: "Have 8 available DNS pods, want 9.\nHave 8 available node-resolver pods, want 9."
etcd                                       4.11.0-rc.1   True        False         False      96m
image-registry                             4.11.0-rc.1   True        False         False      92m
ingress                                    4.11.0-rc.1   True        False         False      92m
insights                                   4.11.0-rc.1   True        False         False      92m
kube-apiserver                             4.11.0-rc.1   True        False         False      93m
kube-controller-manager                    4.11.0-rc.1   True        False         False      94m
kube-scheduler                             4.11.0-rc.1   True        False         False      96m
kube-storage-version-migrator              4.11.0-rc.1   False       True          False      74s     KubeStorageVersionMigratorAvailable: Waiting for Deployment
machine-api                                4.11.0-rc.1   True        False         False      94m
machine-approver                           4.11.0-rc.1   True        False         False      98m
machine-config                             4.11.0-rc.1   True        False         False      97m
marketplace                                4.11.0-rc.1   True        False         False      98m
monitoring                                 4.11.0-rc.1   True        False         False      91m
network                                    4.11.0-rc.1   True        True          False      100m    DaemonSet "/openshift-multus/multus-admission-controller" is not available (awaiting 1 nodes)...
node-tuning                                4.11.0-rc.1   True        False         False      41m
openshift-apiserver                        4.11.0-rc.1   True        False         False      74s
openshift-controller-manager               4.11.0-rc.1   True        False         False      94m
openshift-samples                          4.11.0-rc.1   True        False         False      91m
operator-lifecycle-manager                 4.11.0-rc.1   True        False         False      97m
operator-lifecycle-manager-catalog         4.11.0-rc.1   True        False         False      98m
operator-lifecycle-manager-packageserver   4.11.0-rc.1   True        False         False      40m
service-ca                                 4.11.0-rc.1   True        False         False      98m
storage                                    4.11.0-rc.1   True        True          False      98m     AWSEBSCSIDriverOperatorCRProgressing: AWSEBSDriverControllerServiceControllerProgressing: Waiting for Deployment to deploy pods...

which in some cases are eventually recovered as it did in above, note that in above case I had AWS m5.2xlarge machines with 32GB and 8 vCPU

Comment 3 Mehul Modi 2022-07-07 17:42:17 UTC

Apologies for setting flags which I should not have. 
Removing the blocker+ flag and target release to let engineering teams decide based on their triage.

Comment 9 Red Hat Bugzilla 2023-09-18 04:41:22 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days