Bug 2057832

Summary:	expr for record rule: "cluster:telemetry_selected_series:count" is improper
Product:	OpenShift Container Platform	Reporter:	Junqi Zhao <juzhao>
Component:	Monitoring	Assignee:	Joao Marcal <jmarcal>
Status:	CLOSED ERRATA	QA Contact:	Junqi Zhao <juzhao>
Severity:	low	Docs Contact:
Priority:	low
Version:	4.10	CC:	ademicev, amuller, anpicker, hongyli, spasquie, wking
Target Milestone:	---
Target Release:	4.11.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-08-10 10:51:15 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Junqi Zhao 2022-02-24 06:44:29 UTC

Description of problem:
the bug is found when review https://github.com/openshift/openshift-docs/pull/42163, record rule "cluster:telemetry_selected_series:count" is wrong, which should update to the correct expr after the doc is approved.
# oc get prometheusrules telemetry -n openshift-monitoring -oyaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  creationTimestamp: "2022-02-23T23:20:35Z"
  generation: 1
  name: telemetry
  namespace: openshift-monitoring
  resourceVersion: "20147"
  uid: 979901ae-b88d-4ccc-8f94-734ca846a1d2
spec:
  groups:
  - name: telemeter.rules
    rules:
    - expr: count({__name__=~"cluster:usage:.*|count:up0|count:up1|cluster_version|cluster_version_available_updates|cluster_operator_up|cluster_operator_conditions|cluster_version_payload|cluster_installer|cluster_infrastructure_provider|cluster_feature_set|instance:etcd_object_counts:sum|ALERTS|code:apiserver_request_total:rate:sum|cluster:capacity_cpu_cores:sum|cluster:capacity_memory_bytes:sum|cluster:cpu_usage_cores:sum|cluster:memory_usage_bytes:sum|openshift:cpu_usage_cores:sum|openshift:memory_usage_bytes:sum|workload:cpu_usage_cores:sum|workload:memory_usage_bytes:sum|cluster:virt_platform_nodes:sum|cluster:node_instance_type_count:sum|cnv:vmi_status_running:count|cluster:vmi_request_cpu_cores:sum|node_role_os_version_machine:cpu_capacity_cores:sum|node_role_os_version_machine:cpu_capacity_sockets:sum|subscription_sync_total|olm_resolution_duration_seconds|csv_succeeded|csv_abnormal|cluster:kube_persistentvolumeclaim_resource_requests_storage_bytes:provisioner:sum|cluster:kubelet_volume_stats_used_bytes:provisioner:sum|ceph_cluster_total_bytes|ceph_cluster_total_used_raw_bytes|ceph_health_status|job:ceph_osd_metadata:count|job:kube_pv:count|job:ceph_pools_iops:total|job:ceph_pools_iops_bytes:total|job:ceph_versions_running:count|job:noobaa_total_unhealthy_buckets:sum|job:noobaa_bucket_count:sum|job:noobaa_total_object_count:sum|noobaa_accounts_num|noobaa_total_usage|console_url|cluster:network_attachment_definition_instances:max|cluster:network_attachment_definition_enabled_instance_up:max|cluster:ingress_controller_aws_nlb_active:sum|insightsclient_request_send_total|cam_app_workload_migrations|cluster:apiserver_current_inflight_requests:sum:max_over_time:2m|cluster:alertmanager_integrations:max|cluster:telemetry_selected_series:count|openshift:prometheus_tsdb_head_series:sum|openshift:prometheus_tsdb_head_samples_appended_total:sum|monitoring:container_memory_working_set_bytes:sum|namespace_job:scrape_series_added:topk3_sum1h|namespace_job:scrape_samples_post_metric_relabeling:topk3|monitoring:haproxy_server_http_responses_total:sum|rhmi_status|cluster_legacy_scheduler_policy|cluster_master_schedulable|che_workspace_status|che_workspace_started_total|che_workspace_failure_total|che_workspace_start_time_seconds_sum|che_workspace_start_time_seconds_count|cco_credentials_mode|cluster:kube_persistentvolume_plugin_type_counts:sum|visual_web_terminal_sessions_total|acm_managed_cluster_info|cluster:vsphere_vcenter_info:sum|cluster:vsphere_esxi_version_total:sum|cluster:vsphere_node_hw_version_total:sum|openshift:build_by_strategy:sum|rhods_aggregate_availability|rhods_total_users|instance:etcd_disk_wal_fsync_duration_seconds:histogram_quantile|instance:etcd_mvcc_db_total_size_in_bytes:sum|instance:etcd_network_peer_round_trip_time_seconds:histogram_quantile|instance:etcd_mvcc_db_total_size_in_use_in_bytes:sum|instance:etcd_disk_backend_commit_duration_seconds:histogram_quantile|jaeger_operator_instances_storage_types|jaeger_operator_instances_strategies|jaeger_operator_instances_agent_strategies|appsvcs:cores_by_product:sum|nto_custom_profiles:count|openshift_csi_share_configmap|openshift_csi_share_secret|openshift_csi_share_mount_failures_total|openshift_csi_share_mount_requests_total",alertstate=~"firing|",quantile=~"0.99|0.99|0.99|"})
      record: cluster:telemetry_selected_series:count


reason:
,quantile=~"0.99|0.99|0.99|" in the query is for all the metrics and its redundant, and it should only for 3 metrics
instance:etcd_disk_backend_commit_duration_seconds:histogram_quantile
instance:etcd_disk_wal_fsync_duration_seconds:histogram_quantile
instance:etcd_network_peer_round_trip_time_seconds:histogram_quantile

this is defined in telemetry-config configmap
# oc -n openshift-monitoring get cm telemetry-config -o jsonpath="{.data.metrics\.yaml}" | grep {__name__= 
...
- '{__name__="instance:etcd_disk_wal_fsync_duration_seconds:histogram_quantile",quantile="0.99"}'
- '{__name__="instance:etcd_network_peer_round_trip_time_seconds:histogram_quantile",quantile="0.99"}'
- '{__name__="instance:etcd_disk_backend_commit_duration_seconds:histogram_quantile",quantile="0.99"}'
...
- '{__name__="olm_resolution_duration_seconds"}'
...

if we use the expr, the query would only keep quantile="0.99" metrics, example:
olm_resolution_duration_seconds{container="catalog-operator", endpoint="https-metrics", instance="10.128.0.94:8443", job="catalog-operator-metrics", namespace="openshift-operator-lifecycle-manager", outcome="failed", pod="catalog-operator-665c654dc4-2drrw", quantile="0.99", service="catalog-operator-metrics"}  3.022626623

actually we also have quantile="0.9" and quantile="0.95" metrics, these metrics would not in the search result then.
olm_resolution_duration_seconds{container="catalog-operator", endpoint="https-metrics", instance="10.128.0.94:8443", job="catalog-operator-metrics", namespace="openshift-operator-lifecycle-manager", outcome="failed", pod="catalog-operator-665c654dc4-2drrw", quantile="0.9", service="catalog-operator-metrics"}   0.400797379
olm_resolution_duration_seconds{container="catalog-operator", endpoint="https-metrics", instance="10.128.0.94:8443", job="catalog-operator-metrics", namespace="openshift-operator-lifecycle-manager", outcome="failed", pod="catalog-operator-665c654dc4-2drrw", quantile="0.95", service="catalog-operator-metrics"}  0.424998395
Version-Release number of selected component (if applicable):


How reproducible:
always

Steps to Reproduce:
1. see the description
2.
3.

Actual results:
4.10.0-0.nightly-2022-02-22-093600

Expected results:


Additional info:

Comment 1 hongyan li 2022-02-24 07:18:27 UTC

The following doc also need update
https://github.com/openshift/cluster-monitoring-operator/Documentation/telemeter_query
https://github.com/openshift/cluster-monitoring-operator/Documentation/sample-metrics.md

Comment 2 Simon Pasquier 2022-03-16 10:42:45 UTC

Indeed the rule expression isn't completely accurate since it may count more series than what is effectively sent to telemeter. Something like this would work:

# series with 1 label selector on the metric name only
count({__name__=~"cluster:usage:.*|count:up0|count:up1|cluster_version|..."})
+
# series with more than one label selector  
count({__name__=~"ALERTS",alertstate="firing"}) + count({__name__=~"instance:etcd_disk_wal_fsync_duration_seconds:histogram_quantile|instance:etcd_network_peer_round_trip_time_seconds:histogram_quantile|instance:etcd_disk_backend_commit_duration_seconds:histogram_quantile",quantile="0.99"}) + ...

Comment 3 Alexander Demicev 2022-04-05 11:19:08 UTC

Hi,
Looks like that the issue appears in a bunch of techpreview jobs https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-cap[…]operator-main-e2e-aws-capi-techpreview/1510920053355188224 
This is currently blocking our CI.

Comment 4 Simon Pasquier 2022-04-05 12:40:23 UTC

an example with a full link: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-capi-operator/46/pull-ci-openshift-cluster-capi-operator-main-e2e-aws-capi-techpreview/1511259398029185024

I suspect that even if we fix the recording rule, the CI would still be failing. The test has a hardcoded value of 600 for what is considered to be the maximum number of series sent to telemeter. But it probably needs an update since we're continuously adding new metrics.

Comment 5 Simon Pasquier 2022-04-07 09:30:49 UTC

Thinking more about the recording rule itself, it would be less error prone to use the metrics from telemeter-client which capture exactly the number of samples being sent:
max(federate_samples - federate_filtered_samples)

Obviously this would have to be revisited once/if we switch to Prometheus remote write but there's no concrete date for now.

Comment 8 Junqi Zhao 2022-06-13 06:52:22 UTC

checked with 4.11.0-0.nightly-2022-06-11-120123, the expression now changed to below and the result is the same with the previous correct expression
# oc -n openshift-monitoring get PrometheusRule telemetry -oyaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  creationTimestamp: "2022-06-13T03:02:13Z"
  generation: 1
  name: telemetry
  namespace: openshift-monitoring
  resourceVersion: "21091"
  uid: 1b3bb6fb-1d3d-4983-9cb1-82660c126b92
spec:
  groups:
  - name: telemeter.rules
    rules:
    - expr: max(federate_samples - federate_filtered_samples)
      record: cluster:telemetry_selected_series:count

Comment 13 errata-xmlrpc 2022-08-10 10:51:15 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069