2057832 – expr for record rule: "cluster:telemetry_selected_series:count" is improper

Bug 2057832 - expr for record rule: "cluster:telemetry_selected_series:count" is improper

Summary: expr for record rule: "cluster:telemetry_selected_series:count" is improper

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	low
Target Milestone:	---
Target Release:	4.11.0
Assignee:	Joao Marcal
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-02-24 06:44 UTC by Junqi Zhao
Modified:	2022-08-10 10:51 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-08-10 10:51:15 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-monitoring-operator pull 1646	0	None	open	Bug 2057832: Updates recording rule cluster:telemetry_selected_series:count	2022-05-16 16:39:00 UTC
Red Hat Product Errata	RHSA-2022:5069	0	None	None	None	2022-08-10 10:51:34 UTC

Description Junqi Zhao 2022-02-24 06:44:29 UTC

Description of problem:
the bug is found when review https://github.com/openshift/openshift-docs/pull/42163, record rule "cluster:telemetry_selected_series:count" is wrong, which should update to the correct expr after the doc is approved.
# oc get prometheusrules telemetry -n openshift-monitoring -oyaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  creationTimestamp: "2022-02-23T23:20:35Z"
  generation: 1
  name: telemetry
  namespace: openshift-monitoring
  resourceVersion: "20147"
  uid: 979901ae-b88d-4ccc-8f94-734ca846a1d2
spec:
  groups:
  - name: telemeter.rules
    rules:
    - expr: count({__name__=~"cluster:usage:.*|count:up0|count:up1|cluster_version|cluster_version_available_updates|cluster_operator_up|cluster_operator_conditions|cluster_version_payload|cluster_installer|cluster_infrastructure_provider|cluster_feature_set|instance:etcd_object_counts:sum|ALERTS|code:apiserver_request_total:rate:sum|cluster:capacity_cpu_cores:sum|cluster:capacity_memory_bytes:sum|cluster:cpu_usage_cores:sum|cluster:memory_usage_bytes:sum|openshift:cpu_usage_cores:sum|openshift:memory_usage_bytes:sum|workload:cpu_usage_cores:sum|workload:memory_usage_bytes:sum|cluster:virt_platform_nodes:sum|cluster:node_instance_type_count:sum|cnv:vmi_status_running:count|cluster:vmi_request_cpu_cores:sum|node_role_os_version_machine:cpu_capacity_cores:sum|node_role_os_version_machine:cpu_capacity_sockets:sum|subscription_sync_total|olm_resolution_duration_seconds|csv_succeeded|csv_abnormal|cluster:kube_persistentvolumeclaim_resource_requests_storage_bytes:provisioner:sum|cluster:kubelet_volume_stats_used_bytes:provisioner:sum|ceph_cluster_total_bytes|ceph_cluster_total_used_raw_bytes|ceph_health_status|job:ceph_osd_metadata:count|job:kube_pv:count|job:ceph_pools_iops:total|job:ceph_pools_iops_bytes:total|job:ceph_versions_running:count|job:noobaa_total_unhealthy_buckets:sum|job:noobaa_bucket_count:sum|job:noobaa_total_object_count:sum|noobaa_accounts_num|noobaa_total_usage|console_url|cluster:network_attachment_definition_instances:max|cluster:network_attachment_definition_enabled_instance_up:max|cluster:ingress_controller_aws_nlb_active:sum|insightsclient_request_send_total|cam_app_workload_migrations|cluster:apiserver_current_inflight_requests:sum:max_over_time:2m|cluster:alertmanager_integrations:max|cluster:telemetry_selected_series:count|openshift:prometheus_tsdb_head_series:sum|openshift:prometheus_tsdb_head_samples_appended_total:sum|monitoring:container_memory_working_set_bytes:sum|namespace_job:scrape_series_added:topk3_sum1h|namespace_job:scrape_samples_post_metric_relabeling:topk3|monitoring:haproxy_server_http_responses_total:sum|rhmi_status|cluster_legacy_scheduler_policy|cluster_master_schedulable|che_workspace_status|che_workspace_started_total|che_workspace_failure_total|che_workspace_start_time_seconds_sum|che_workspace_start_time_seconds_count|cco_credentials_mode|cluster:kube_persistentvolume_plugin_type_counts:sum|visual_web_terminal_sessions_total|acm_managed_cluster_info|cluster:vsphere_vcenter_info:sum|cluster:vsphere_esxi_version_total:sum|cluster:vsphere_node_hw_version_total:sum|openshift:build_by_strategy:sum|rhods_aggregate_availability|rhods_total_users|instance:etcd_disk_wal_fsync_duration_seconds:histogram_quantile|instance:etcd_mvcc_db_total_size_in_bytes:sum|instance:etcd_network_peer_round_trip_time_seconds:histogram_quantile|instance:etcd_mvcc_db_total_size_in_use_in_bytes:sum|instance:etcd_disk_backend_commit_duration_seconds:histogram_quantile|jaeger_operator_instances_storage_types|jaeger_operator_instances_strategies|jaeger_operator_instances_agent_strategies|appsvcs:cores_by_product:sum|nto_custom_profiles:count|openshift_csi_share_configmap|openshift_csi_share_secret|openshift_csi_share_mount_failures_total|openshift_csi_share_mount_requests_total",alertstate=~"firing|",quantile=~"0.99|0.99|0.99|"})
      record: cluster:telemetry_selected_series:count


reason:
,quantile=~"0.99|0.99|0.99|" in the query is for all the metrics and its redundant, and it should only for 3 metrics
instance:etcd_disk_backend_commit_duration_seconds:histogram_quantile
instance:etcd_disk_wal_fsync_duration_seconds:histogram_quantile
instance:etcd_network_peer_round_trip_time_seconds:histogram_quantile

this is defined in telemetry-config configmap
# oc -n openshift-monitoring get cm telemetry-config -o jsonpath="{.data.metrics\.yaml}" | grep {__name__= 
...
- '{__name__="instance:etcd_disk_wal_fsync_duration_seconds:histogram_quantile",quantile="0.99"}'
- '{__name__="instance:etcd_network_peer_round_trip_time_seconds:histogram_quantile",quantile="0.99"}'
- '{__name__="instance:etcd_disk_backend_commit_duration_seconds:histogram_quantile",quantile="0.99"}'
...
- '{__name__="olm_resolution_duration_seconds"}'
...

if we use the expr, the query would only keep quantile="0.99" metrics, example:
olm_resolution_duration_seconds{container="catalog-operator", endpoint="https-metrics", instance="10.128.0.94:8443", job="catalog-operator-metrics", namespace="openshift-operator-lifecycle-manager", outcome="failed", pod="catalog-operator-665c654dc4-2drrw", quantile="0.99", service="catalog-operator-metrics"}  3.022626623

actually we also have quantile="0.9" and quantile="0.95" metrics, these metrics would not in the search result then.
olm_resolution_duration_seconds{container="catalog-operator", endpoint="https-metrics", instance="10.128.0.94:8443", job="catalog-operator-metrics", namespace="openshift-operator-lifecycle-manager", outcome="failed", pod="catalog-operator-665c654dc4-2drrw", quantile="0.9", service="catalog-operator-metrics"}   0.400797379
olm_resolution_duration_seconds{container="catalog-operator", endpoint="https-metrics", instance="10.128.0.94:8443", job="catalog-operator-metrics", namespace="openshift-operator-lifecycle-manager", outcome="failed", pod="catalog-operator-665c654dc4-2drrw", quantile="0.95", service="catalog-operator-metrics"}  0.424998395
Version-Release number of selected component (if applicable):


How reproducible:
always

Steps to Reproduce:
1. see the description
2.
3.

Actual results:
4.10.0-0.nightly-2022-02-22-093600

Expected results:


Additional info:

Comment 1 hongyan li 2022-02-24 07:18:27 UTC

The following doc also need update
https://github.com/openshift/cluster-monitoring-operator/Documentation/telemeter_query
https://github.com/openshift/cluster-monitoring-operator/Documentation/sample-metrics.md

Comment 2 Simon Pasquier 2022-03-16 10:42:45 UTC

Indeed the rule expression isn't completely accurate since it may count more series than what is effectively sent to telemeter. Something like this would work:

# series with 1 label selector on the metric name only
count({__name__=~"cluster:usage:.*|count:up0|count:up1|cluster_version|..."})
+
# series with more than one label selector  
count({__name__=~"ALERTS",alertstate="firing"}) + count({__name__=~"instance:etcd_disk_wal_fsync_duration_seconds:histogram_quantile|instance:etcd_network_peer_round_trip_time_seconds:histogram_quantile|instance:etcd_disk_backend_commit_duration_seconds:histogram_quantile",quantile="0.99"}) + ...

Comment 3 Alexander Demicev 2022-04-05 11:19:08 UTC

Hi,
Looks like that the issue appears in a bunch of techpreview jobs https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-cap[…]operator-main-e2e-aws-capi-techpreview/1510920053355188224 
This is currently blocking our CI.

Comment 4 Simon Pasquier 2022-04-05 12:40:23 UTC

an example with a full link: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-capi-operator/46/pull-ci-openshift-cluster-capi-operator-main-e2e-aws-capi-techpreview/1511259398029185024

I suspect that even if we fix the recording rule, the CI would still be failing. The test has a hardcoded value of 600 for what is considered to be the maximum number of series sent to telemeter. But it probably needs an update since we're continuously adding new metrics.

Comment 5 Simon Pasquier 2022-04-07 09:30:49 UTC

Thinking more about the recording rule itself, it would be less error prone to use the metrics from telemeter-client which capture exactly the number of samples being sent:
max(federate_samples - federate_filtered_samples)

Obviously this would have to be revisited once/if we switch to Prometheus remote write but there's no concrete date for now.

Comment 8 Junqi Zhao 2022-06-13 06:52:22 UTC

checked with 4.11.0-0.nightly-2022-06-11-120123, the expression now changed to below and the result is the same with the previous correct expression
# oc -n openshift-monitoring get PrometheusRule telemetry -oyaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  creationTimestamp: "2022-06-13T03:02:13Z"
  generation: 1
  name: telemetry
  namespace: openshift-monitoring
  resourceVersion: "21091"
  uid: 1b3bb6fb-1d3d-4983-9cb1-82660c126b92
spec:
  groups:
  - name: telemeter.rules
    rules:
    - expr: max(federate_samples - federate_filtered_samples)
      record: cluster:telemetry_selected_series:count

Comment 13 errata-xmlrpc 2022-08-10 10:51:15 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Note You need to log in before you can comment on or make changes to this bug.