2104957 – cluster becomes unstable after enabling IPFix exports with sampling of 1:1

Bug 2104957 - cluster becomes unstable after enabling IPFix exports with sampling of 1:1

Summary: cluster becomes unstable after enabling IPFix exports with sampling of 1:1

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.11
Hardware:	x86_64
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	ffernand
QA Contact:	Anurag saxena
Docs Contact:
URL:
Whiteboard:
Depends On:	2080477 2104943
Blocks:
TreeView+	depends on / blocked

Reported:	2022-07-07 14:52 UTC by Mehul Modi
Modified:	2023-09-18 04:41 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-01-05 15:51:38 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Mehul Modi 2022-07-07 14:52:51 UTC

Description of problem:
Creating this tracker bug for issue in OVS https://bugzilla.redhat.com/show_bug.cgi?id=2104943

When enabling IPFIX flows export via NetObserv in OCP 4.11.0-rc0, then setting ipfix sampling to 1, the cluster is unstable and becomes eventually unreachable. I loose all possibilities to interact with it via oc/kubectl.



Version-Release number of selected component (if applicable):
OCP 4.11.0-rc0

How reproducible:
Consistently

Steps to Reproduce:
1. Setup an OCP 4.11.0-rc0 cluster with OVN-Kubernetes. My setup has 3 workers and 1 master, on aws, m6i.large
2. Install NetObserv: https://github.com/netobserv/network-observability-operator/#getting-started
3. Edit config (oc edit flowcollector cluster) to set ipfix sampling to 1 (in spec.ipfix.sampling)

Actual results:
cluster becomes unstable with sporadic connectivity issues

Expected results:
cluster should be stable

Additional info:

Comment 1 Mehul Modi 2022-07-07 16:49:53 UTC

Adding more info:

ip-10-0-141-43.us-east-2.compute.internal    Ready      worker   49m   v1.24.0+2dd8bb1
ip-10-0-148-46.us-east-2.compute.internal    NotReady   master   58m   v1.24.0+2dd8bb1
ip-10-0-154-197.us-east-2.compute.internal   Ready      worker   52m   v1.24.0+2dd8bb1
ip-10-0-182-158.us-east-2.compute.internal   Ready      worker   49m   v1.24.0+2dd8bb1
ip-10-0-182-170.us-east-2.compute.internal   Ready      master   59m   v1.24.0+2dd8bb1
ip-10-0-190-24.us-east-2.compute.internal    Ready      worker   49m   v1.24.0+2dd8bb1
ip-10-0-203-222.us-east-2.compute.internal   Ready      master   58m   v1.24.0+2dd8bb1
ip-10-0-203-62.us-east-2.compute.internal    Ready      worker   49m   v1.24.0+2dd8bb1
ip-10-0-207-237.us-east-2.compute.internal   Ready      worker   53m   v1.24.0+2dd8bb1

COs state could transition to Progressing state and eventually recovers 

$ oc get co
NAME                                       VERSION       AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.11.0-rc.1   True        False         False      4s
baremetal                                  4.11.0-rc.1   True        False         False      97m
cloud-controller-manager                   4.11.0-rc.1   True        False         False      101m
cloud-credential                           4.11.0-rc.1   True        False         False      101m
cluster-autoscaler                         4.11.0-rc.1   True        False         False      97m
config-operator                            4.11.0-rc.1   True        False         False      98m
console                                    4.11.0-rc.1   True        False         False      86m
csi-snapshot-controller                    4.11.0-rc.1   True        True          False      98m     CSISnapshotControllerProgressing: Waiting for Deployment to deploy csi-snapshot-controller pods
dns                                        4.11.0-rc.1   True        True          False      98m     DNS "default" reports Progressing=True: "Have 8 available DNS pods, want 9.\nHave 8 available node-resolver pods, want 9."
etcd                                       4.11.0-rc.1   True        False         False      96m
image-registry                             4.11.0-rc.1   True        False         False      92m
ingress                                    4.11.0-rc.1   True        False         False      92m
insights                                   4.11.0-rc.1   True        False         False      92m
kube-apiserver                             4.11.0-rc.1   True        False         False      93m
kube-controller-manager                    4.11.0-rc.1   True        False         False      94m
kube-scheduler                             4.11.0-rc.1   True        False         False      96m
kube-storage-version-migrator              4.11.0-rc.1   False       True          False      74s     KubeStorageVersionMigratorAvailable: Waiting for Deployment
machine-api                                4.11.0-rc.1   True        False         False      94m
machine-approver                           4.11.0-rc.1   True        False         False      98m
machine-config                             4.11.0-rc.1   True        False         False      97m
marketplace                                4.11.0-rc.1   True        False         False      98m
monitoring                                 4.11.0-rc.1   True        False         False      91m
network                                    4.11.0-rc.1   True        True          False      100m    DaemonSet "/openshift-multus/multus-admission-controller" is not available (awaiting 1 nodes)...
node-tuning                                4.11.0-rc.1   True        False         False      41m
openshift-apiserver                        4.11.0-rc.1   True        False         False      74s
openshift-controller-manager               4.11.0-rc.1   True        False         False      94m
openshift-samples                          4.11.0-rc.1   True        False         False      91m
operator-lifecycle-manager                 4.11.0-rc.1   True        False         False      97m
operator-lifecycle-manager-catalog         4.11.0-rc.1   True        False         False      98m
operator-lifecycle-manager-packageserver   4.11.0-rc.1   True        False         False      40m
service-ca                                 4.11.0-rc.1   True        False         False      98m
storage                                    4.11.0-rc.1   True        True          False      98m     AWSEBSCSIDriverOperatorCRProgressing: AWSEBSDriverControllerServiceControllerProgressing: Waiting for Deployment to deploy pods...

which in some cases are eventually recovered as it did in above, note that in above case I had AWS m5.2xlarge machines with 32GB and 8 vCPU

Comment 3 Mehul Modi 2022-07-07 17:42:17 UTC

Apologies for setting flags which I should not have. 
Removing the blocker+ flag and target release to let engineering teams decide based on their triage.

Comment 9 Red Hat Bugzilla 2023-09-18 04:41:22 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.