Bug 2104957

Summary: cluster becomes unstable after enabling IPFix exports with sampling of 1:1
Product: OpenShift Container Platform Reporter: Mehul Modi <memodi>
Component: NetworkingAssignee: ffernand <ffernand>
Networking sub component: ovn-kubernetes QA Contact: Anurag saxena <anusaxen>
Status: CLOSED CURRENTRELEASE Docs Contact:
Severity: high    
Priority: high CC: davegord, ffernandez, jtakvori, mifiedle, nweinber, rravaiol
Version: 4.11   
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-01-05 15:51:38 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2080477, 2104943    
Bug Blocks:    

Description Mehul Modi 2022-07-07 14:52:51 UTC
Description of problem:
Creating this tracker bug for issue in OVS https://bugzilla.redhat.com/show_bug.cgi?id=2104943

When enabling IPFIX flows export via NetObserv in OCP 4.11.0-rc0, then setting ipfix sampling to 1, the cluster is unstable and becomes eventually unreachable. I loose all possibilities to interact with it via oc/kubectl.



Version-Release number of selected component (if applicable):
OCP 4.11.0-rc0

How reproducible:
Consistently

Steps to Reproduce:
1. Setup an OCP 4.11.0-rc0 cluster with OVN-Kubernetes. My setup has 3 workers and 1 master, on aws, m6i.large
2. Install NetObserv: https://github.com/netobserv/network-observability-operator/#getting-started
3. Edit config (oc edit flowcollector cluster) to set ipfix sampling to 1 (in spec.ipfix.sampling)

Actual results:
cluster becomes unstable with sporadic connectivity issues

Expected results:
cluster should be stable

Additional info:

Comment 1 Mehul Modi 2022-07-07 16:49:53 UTC
Adding more info:

ip-10-0-141-43.us-east-2.compute.internal    Ready      worker   49m   v1.24.0+2dd8bb1
ip-10-0-148-46.us-east-2.compute.internal    NotReady   master   58m   v1.24.0+2dd8bb1
ip-10-0-154-197.us-east-2.compute.internal   Ready      worker   52m   v1.24.0+2dd8bb1
ip-10-0-182-158.us-east-2.compute.internal   Ready      worker   49m   v1.24.0+2dd8bb1
ip-10-0-182-170.us-east-2.compute.internal   Ready      master   59m   v1.24.0+2dd8bb1
ip-10-0-190-24.us-east-2.compute.internal    Ready      worker   49m   v1.24.0+2dd8bb1
ip-10-0-203-222.us-east-2.compute.internal   Ready      master   58m   v1.24.0+2dd8bb1
ip-10-0-203-62.us-east-2.compute.internal    Ready      worker   49m   v1.24.0+2dd8bb1
ip-10-0-207-237.us-east-2.compute.internal   Ready      worker   53m   v1.24.0+2dd8bb1

COs state could transition to Progressing state and eventually recovers 

$ oc get co
NAME                                       VERSION       AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.11.0-rc.1   True        False         False      4s
baremetal                                  4.11.0-rc.1   True        False         False      97m
cloud-controller-manager                   4.11.0-rc.1   True        False         False      101m
cloud-credential                           4.11.0-rc.1   True        False         False      101m
cluster-autoscaler                         4.11.0-rc.1   True        False         False      97m
config-operator                            4.11.0-rc.1   True        False         False      98m
console                                    4.11.0-rc.1   True        False         False      86m
csi-snapshot-controller                    4.11.0-rc.1   True        True          False      98m     CSISnapshotControllerProgressing: Waiting for Deployment to deploy csi-snapshot-controller pods
dns                                        4.11.0-rc.1   True        True          False      98m     DNS "default" reports Progressing=True: "Have 8 available DNS pods, want 9.\nHave 8 available node-resolver pods, want 9."
etcd                                       4.11.0-rc.1   True        False         False      96m
image-registry                             4.11.0-rc.1   True        False         False      92m
ingress                                    4.11.0-rc.1   True        False         False      92m
insights                                   4.11.0-rc.1   True        False         False      92m
kube-apiserver                             4.11.0-rc.1   True        False         False      93m
kube-controller-manager                    4.11.0-rc.1   True        False         False      94m
kube-scheduler                             4.11.0-rc.1   True        False         False      96m
kube-storage-version-migrator              4.11.0-rc.1   False       True          False      74s     KubeStorageVersionMigratorAvailable: Waiting for Deployment
machine-api                                4.11.0-rc.1   True        False         False      94m
machine-approver                           4.11.0-rc.1   True        False         False      98m
machine-config                             4.11.0-rc.1   True        False         False      97m
marketplace                                4.11.0-rc.1   True        False         False      98m
monitoring                                 4.11.0-rc.1   True        False         False      91m
network                                    4.11.0-rc.1   True        True          False      100m    DaemonSet "/openshift-multus/multus-admission-controller" is not available (awaiting 1 nodes)...
node-tuning                                4.11.0-rc.1   True        False         False      41m
openshift-apiserver                        4.11.0-rc.1   True        False         False      74s
openshift-controller-manager               4.11.0-rc.1   True        False         False      94m
openshift-samples                          4.11.0-rc.1   True        False         False      91m
operator-lifecycle-manager                 4.11.0-rc.1   True        False         False      97m
operator-lifecycle-manager-catalog         4.11.0-rc.1   True        False         False      98m
operator-lifecycle-manager-packageserver   4.11.0-rc.1   True        False         False      40m
service-ca                                 4.11.0-rc.1   True        False         False      98m
storage                                    4.11.0-rc.1   True        True          False      98m     AWSEBSCSIDriverOperatorCRProgressing: AWSEBSDriverControllerServiceControllerProgressing: Waiting for Deployment to deploy pods...

which in some cases are eventually recovered as it did in above, note that in above case I had AWS m5.2xlarge machines with 32GB and 8 vCPU

Comment 3 Mehul Modi 2022-07-07 17:42:17 UTC
Apologies for setting flags which I should not have. 
Removing the blocker+ flag and target release to let engineering teams decide based on their triage.

Comment 9 Red Hat Bugzilla 2023-09-18 04:41:22 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days