Bug 1966126
| Summary: | root_ca_cert_publisher_sync_duration_seconds metric can have an excessive cardinality | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Simon Pasquier <spasquie> |
| Component: | apiserver-auth | Assignee: | Sergiusz Urbaniak <surbania> |
| Status: | CLOSED ERRATA | QA Contact: | liyao |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 4.8 | CC: | aos-bugs, mfojtik, surbania, xxia |
| Target Milestone: | --- | ||
| Target Release: | 4.8.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-07-27 23:10:35 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
This one falls under the sig-auth, so sending over to auth team to investigate. as discussed OOB: - shortterm: we'll disable the metric downstream for the time being - upstream: we'll suggest dropping the namespace label to reduce series churn. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |
Description of problem: The root_ca_cert_publisher_sync_duration_seconds metric tracks the sync duration in the root CA cert publisher per code and namespace [1]. The namespace label is problematic because series for a given namespace will continue to be exposed even after the namespace has been deleted. On clusters with high churn of projects/namespaces (e.g. CI cluster [2]), it can lead kube-controller-manager to expose more than 100,000 series to Prometheus which is in the . Version-Release number of selected component (if applicable): 4.8 How reproducible: Always Steps to Reproduce: 1. Execute the following PromQL query in the Prometheus UI count(root_ca_cert_publisher_sync_duration_seconds_bucket) 2. Create and delete hundred projects for i in $(seq 0 99); do oc new-project "project-${i}"; done for i in $(seq 0 99); do oc delete "project-${i}"; done 3. Execute the same PromQL query. 4. Execute the following PromQL query: root_ca_cert_publisher_sync_duration_seconds_bucket{exported_namespace="project-1"} Actual results: The count of series for root_ca_cert_publisher_sync_duration_seconds_bucket has increased and stays the same even though the projects have been deleted. The last PromQL query returns result for a project that no longer exists. Expected results: Series for projects/namespaces that no longer exist shouldn't be exposed. Additional info: The cardinality issue was discussed in the upstream PR but AFAICT it wasn't flagged as a big concern because other metrics have higher numbers of metrics and nobody noticed the impact caused by namespace churn. [1] https://github.com/kubernetes/kubernetes/pull/98731 [2] https://prometheus-k8s-openshift-monitoring.apps.build01.ci.devcluster.openshift.com/tsdb-status