Bug 2057832
Summary: | expr for record rule: "cluster:telemetry_selected_series:count" is improper | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Junqi Zhao <juzhao> |
Component: | Monitoring | Assignee: | Joao Marcal <jmarcal> |
Status: | CLOSED ERRATA | QA Contact: | Junqi Zhao <juzhao> |
Severity: | low | Docs Contact: | |
Priority: | low | ||
Version: | 4.10 | CC: | ademicev, amuller, anpicker, hongyli, spasquie, wking |
Target Milestone: | --- | ||
Target Release: | 4.11.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | No Doc Update | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2022-08-10 10:51:15 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Junqi Zhao
2022-02-24 06:44:29 UTC
The following doc also need update https://github.com/openshift/cluster-monitoring-operator/Documentation/telemeter_query https://github.com/openshift/cluster-monitoring-operator/Documentation/sample-metrics.md Indeed the rule expression isn't completely accurate since it may count more series than what is effectively sent to telemeter. Something like this would work: # series with 1 label selector on the metric name only count({__name__=~"cluster:usage:.*|count:up0|count:up1|cluster_version|..."}) + # series with more than one label selector count({__name__=~"ALERTS",alertstate="firing"}) + count({__name__=~"instance:etcd_disk_wal_fsync_duration_seconds:histogram_quantile|instance:etcd_network_peer_round_trip_time_seconds:histogram_quantile|instance:etcd_disk_backend_commit_duration_seconds:histogram_quantile",quantile="0.99"}) + ... Hi, Looks like that the issue appears in a bunch of techpreview jobs https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-cap[…]operator-main-e2e-aws-capi-techpreview/1510920053355188224 This is currently blocking our CI. an example with a full link: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-capi-operator/46/pull-ci-openshift-cluster-capi-operator-main-e2e-aws-capi-techpreview/1511259398029185024 I suspect that even if we fix the recording rule, the CI would still be failing. The test has a hardcoded value of 600 for what is considered to be the maximum number of series sent to telemeter. But it probably needs an update since we're continuously adding new metrics. Thinking more about the recording rule itself, it would be less error prone to use the metrics from telemeter-client which capture exactly the number of samples being sent: max(federate_samples - federate_filtered_samples) Obviously this would have to be revisited once/if we switch to Prometheus remote write but there's no concrete date for now. checked with 4.11.0-0.nightly-2022-06-11-120123, the expression now changed to below and the result is the same with the previous correct expression # oc -n openshift-monitoring get PrometheusRule telemetry -oyaml apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: creationTimestamp: "2022-06-13T03:02:13Z" generation: 1 name: telemetry namespace: openshift-monitoring resourceVersion: "21091" uid: 1b3bb6fb-1d3d-4983-9cb1-82660c126b92 spec: groups: - name: telemeter.rules rules: - expr: max(federate_samples - federate_filtered_samples) record: cluster:telemetry_selected_series:count Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069 |