Description of problem: Creating this tracker bug for issue in OVS https://bugzilla.redhat.com/show_bug.cgi?id=2104943 When enabling IPFIX flows export via NetObserv in OCP 4.11.0-rc0, then setting ipfix sampling to 1, the cluster is unstable and becomes eventually unreachable. I loose all possibilities to interact with it via oc/kubectl. Version-Release number of selected component (if applicable): OCP 4.11.0-rc0 How reproducible: Consistently Steps to Reproduce: 1. Setup an OCP 4.11.0-rc0 cluster with OVN-Kubernetes. My setup has 3 workers and 1 master, on aws, m6i.large 2. Install NetObserv: https://github.com/netobserv/network-observability-operator/#getting-started 3. Edit config (oc edit flowcollector cluster) to set ipfix sampling to 1 (in spec.ipfix.sampling) Actual results: cluster becomes unstable with sporadic connectivity issues Expected results: cluster should be stable Additional info:
Adding more info: ip-10-0-141-43.us-east-2.compute.internal Ready worker 49m v1.24.0+2dd8bb1 ip-10-0-148-46.us-east-2.compute.internal NotReady master 58m v1.24.0+2dd8bb1 ip-10-0-154-197.us-east-2.compute.internal Ready worker 52m v1.24.0+2dd8bb1 ip-10-0-182-158.us-east-2.compute.internal Ready worker 49m v1.24.0+2dd8bb1 ip-10-0-182-170.us-east-2.compute.internal Ready master 59m v1.24.0+2dd8bb1 ip-10-0-190-24.us-east-2.compute.internal Ready worker 49m v1.24.0+2dd8bb1 ip-10-0-203-222.us-east-2.compute.internal Ready master 58m v1.24.0+2dd8bb1 ip-10-0-203-62.us-east-2.compute.internal Ready worker 49m v1.24.0+2dd8bb1 ip-10-0-207-237.us-east-2.compute.internal Ready worker 53m v1.24.0+2dd8bb1 COs state could transition to Progressing state and eventually recovers $ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.11.0-rc.1 True False False 4s baremetal 4.11.0-rc.1 True False False 97m cloud-controller-manager 4.11.0-rc.1 True False False 101m cloud-credential 4.11.0-rc.1 True False False 101m cluster-autoscaler 4.11.0-rc.1 True False False 97m config-operator 4.11.0-rc.1 True False False 98m console 4.11.0-rc.1 True False False 86m csi-snapshot-controller 4.11.0-rc.1 True True False 98m CSISnapshotControllerProgressing: Waiting for Deployment to deploy csi-snapshot-controller pods dns 4.11.0-rc.1 True True False 98m DNS "default" reports Progressing=True: "Have 8 available DNS pods, want 9.\nHave 8 available node-resolver pods, want 9." etcd 4.11.0-rc.1 True False False 96m image-registry 4.11.0-rc.1 True False False 92m ingress 4.11.0-rc.1 True False False 92m insights 4.11.0-rc.1 True False False 92m kube-apiserver 4.11.0-rc.1 True False False 93m kube-controller-manager 4.11.0-rc.1 True False False 94m kube-scheduler 4.11.0-rc.1 True False False 96m kube-storage-version-migrator 4.11.0-rc.1 False True False 74s KubeStorageVersionMigratorAvailable: Waiting for Deployment machine-api 4.11.0-rc.1 True False False 94m machine-approver 4.11.0-rc.1 True False False 98m machine-config 4.11.0-rc.1 True False False 97m marketplace 4.11.0-rc.1 True False False 98m monitoring 4.11.0-rc.1 True False False 91m network 4.11.0-rc.1 True True False 100m DaemonSet "/openshift-multus/multus-admission-controller" is not available (awaiting 1 nodes)... node-tuning 4.11.0-rc.1 True False False 41m openshift-apiserver 4.11.0-rc.1 True False False 74s openshift-controller-manager 4.11.0-rc.1 True False False 94m openshift-samples 4.11.0-rc.1 True False False 91m operator-lifecycle-manager 4.11.0-rc.1 True False False 97m operator-lifecycle-manager-catalog 4.11.0-rc.1 True False False 98m operator-lifecycle-manager-packageserver 4.11.0-rc.1 True False False 40m service-ca 4.11.0-rc.1 True False False 98m storage 4.11.0-rc.1 True True False 98m AWSEBSCSIDriverOperatorCRProgressing: AWSEBSDriverControllerServiceControllerProgressing: Waiting for Deployment to deploy pods... which in some cases are eventually recovered as it did in above, note that in above case I had AWS m5.2xlarge machines with 32GB and 8 vCPU
Apologies for setting flags which I should not have. Removing the blocker+ flag and target release to let engineering teams decide based on their triage.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days