Description of problem: The root_ca_cert_publisher_sync_duration_seconds metric tracks the sync duration in the root CA cert publisher per code and namespace [1]. The namespace label is problematic because series for a given namespace will continue to be exposed even after the namespace has been deleted. On clusters with high churn of projects/namespaces (e.g. CI cluster [2]), it can lead kube-controller-manager to expose more than 100,000 series to Prometheus which is in the . Version-Release number of selected component (if applicable): 4.8 How reproducible: Always Steps to Reproduce: 1. Execute the following PromQL query in the Prometheus UI count(root_ca_cert_publisher_sync_duration_seconds_bucket) 2. Create and delete hundred projects for i in $(seq 0 99); do oc new-project "project-${i}"; done for i in $(seq 0 99); do oc delete "project-${i}"; done 3. Execute the same PromQL query. 4. Execute the following PromQL query: root_ca_cert_publisher_sync_duration_seconds_bucket{exported_namespace="project-1"} Actual results: The count of series for root_ca_cert_publisher_sync_duration_seconds_bucket has increased and stays the same even though the projects have been deleted. The last PromQL query returns result for a project that no longer exists. Expected results: Series for projects/namespaces that no longer exist shouldn't be exposed. Additional info: The cardinality issue was discussed in the upstream PR but AFAICT it wasn't flagged as a big concern because other metrics have higher numbers of metrics and nobody noticed the impact caused by namespace churn. [1] https://github.com/kubernetes/kubernetes/pull/98731 [2] https://prometheus-k8s-openshift-monitoring.apps.build01.ci.devcluster.openshift.com/tsdb-status
This one falls under the sig-auth, so sending over to auth team to investigate.
as discussed OOB: - shortterm: we'll disable the metric downstream for the time being - upstream: we'll suggest dropping the namespace label to reduce series churn.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438