Created attachment 1793243 [details] grafana screenshot in case the url doesnt load Description of problem: During our performance test runs we noticed that in OVN clusters on version 4.8+ have Prometheus consuming significantly more memory when subjecting the cluster to cluster_density tests at 25 and 50 node scales. At 50 node tests the memory usage is sufficient enough to have the Prometheus pods get OOMKilled and potentially exhaust the underlying node resources. I'm not sure but it might be related to this bug: https://bugzilla.redhat.com/show_bug.cgi?id=1925061 but we don't the see memory issue on 4.7 OVN Version-Release number of selected component (if applicable): 4.8/4.9 How reproducible: Somewhat easily Steps to Reproduce: 1. Have a 4.8 OVN Cluster at 50 nodes 2. Run a cluster_density test with kube-burner to generate k8s objects in the cluster 3. See prometheus memory Actual results: Prometheus uses more memory on OVN than SDN after 4.8 Expected results: Prometheus memory usage would stay relatively even across all versions of OVN/SDN Additional info: These clusters were tested as part of a workflow to performance test OCP. They all were part of a scheduled test run and each of them were given the exact same benchmark configurations defined here: https://github.com/whitleykeith/airflow-kubernetes/blob/master/dags/openshift_nightlies/tasks/benchmarks/defaults.json The underlying scripts that run the benchmarks are here: https://github.com/cloud-bulldozer/e2e-benchmarking 4.8 ovn install configs: https://github.com/whitleykeith/airflow-kubernetes/blob/master/dags/openshift_nightlies/releases/4.8/aws/ovn/install.json 4.9 ovn install configs: https://github.com/whitleykeith/airflow-kubernetes/blob/master/dags/openshift_nightlies/releases/4.9/aws/ovn/install.json Grafana with prom metrics from the clusters: http://dittybopper-dittybopper.apps.keith-cluster.perfscale.devcluster.openshift.com/d/oWe9aYxmke23/workload-metrics-thanos-ds-v2?orgId=1&from=1624288003575&to=1624298803576&var-platform=aws&var-openshift_version=4.8.0-0.nightly-2021-06-19-005119&var-openshift_version=4.9.0-0.nightly-2021-06-21-084703&var-openshift_version=4.7.0-0.nightly-2021-06-20-093308&var-network_type=OVNKubernetes&var-network_type=OpenShiftSDN&var-cluster_name=whitleykeith-4-7-aws-8fzqs&var-cluster_name=whitleykeith-4-7-aws-jfrz7&var-cluster_name=whitleykeith-4-8-aws-2vf5n&var-cluster_name=whitleykeith-4-9-aws-cgpbt&var-cluster_name=whitleykeith-4-8-aws-dc4cl&var-cluster_name=whitleykeith-4-9-aws-hkb26&var-machinesets=All&var-Node=All&var-Deployment=All&var-Statefulset=prometheus-k8s&var-Daemonset=All
https://bugzilla.redhat.com/show_bug.cgi?id=1925061 is likely unrelated. We know about memory usage increase on updates and its connected to series churn/restart all containers. This one seems to be specific to the underlying provider. One straight forward theory would be that on OVN prometheus ingests more series then with SDN. Is there a way to get a OVN/SDN pair of clusters, either 4.8 or 4.9, so we can investigate a bit?
I would agree about OVN, however, I don't see this issue on 4.7 OVN so it might be a relatively recent change there. We have our regular workloads running today so we should get 4.8/4.9 clusters up. They're almost done installing now but the workloads that reproduce this issue take a bit longer to run. I'll get the clusters into the state they were in so we can see what the differences are
From the monitoring perspective the serviceMonitor/openshift-ovn-kubernetes/monitor-ovn-master-metrics is the offender here. One metric in particular seems to cause the brunt of the resource usage: ovnkube_master_sync_service_latency_seconds_bucket This metric carries a label called name which has as its value a namespace ID (probably among other things). E.g. name="cluster-density-374ea166-191f-46ba-8626-5f7859567ab3-1/deployment-1pod-1-1". The scaling test, that exposed this creates many of these namespace and in turn we see a cardinality explosion for this metric (see screenshots attached). This dramatically increases prometheus' resource usage and slows down the exporter quite a bit. Using IDs that can grow without constraints should not be used as a label value.
I suspect that the main issue is that the ovnkube_master_sync_service_latency_seconds_bucket, once created, never go away when the respective namespace/pod is deleted. If I'm not mistaken the data is exported here https://github.com/ovn-org/ovn-kubernetes/blob/master/go-controller/pkg/ovn/controller/services/services_controller.go I'm not sure how valuable this metric is but either the metrics for deleted namespaces and pods mus also be deleted or it might be worth considering if the name label is needed or not.
This is my fault, since I really didn't understand well the implications of labels on prometheus metrics. We can just have only a global metrics, no need for such per service granularity https://github.com/ovn-org/ovn-kubernetes/pull/2279
This fix made it downstream in https://github.com/openshift/ovn-kubernetes/pull/600
Hi, I tested the OVN and SDN 4.8/4.9 side by side, with exact same kind of workload at 50 nodes scale. Based on my observations, Prometheus memory usage for OVN 4.9 showed improvements over OVN 4.8. Roughly it improved by ~16%. Just to note: Between OVN and SDN though, for 4.9, SDN used around ~12.6GB memory(avg across both replicas) while OVN ~22GB memory(avg across both replicas) of prometheus. Since the fix that was merged, supposed to improve OVN, and that is what I observed, I am marking this as Verified. @kwhitley please open a new BZ if you think this issue needs further improvements. Thanks, KK.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759