Bug 2057832 - expr for record rule: "cluster:telemetry_selected_series:count" is improper
Summary: expr for record rule: "cluster:telemetry_selected_series:count" is improper
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.10
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: 4.11.0
Assignee: Joao Marcal
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-02-24 06:44 UTC by Junqi Zhao
Modified: 2022-08-10 10:51 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-08-10 10:51:15 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-monitoring-operator pull 1646 0 None open Bug 2057832: Updates recording rule cluster:telemetry_selected_series:count 2022-05-16 16:39:00 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 10:51:34 UTC

Description Junqi Zhao 2022-02-24 06:44:29 UTC
Description of problem:
the bug is found when review https://github.com/openshift/openshift-docs/pull/42163, record rule "cluster:telemetry_selected_series:count" is wrong, which should update to the correct expr after the doc is approved.
# oc get prometheusrules telemetry -n openshift-monitoring -oyaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  creationTimestamp: "2022-02-23T23:20:35Z"
  generation: 1
  name: telemetry
  namespace: openshift-monitoring
  resourceVersion: "20147"
  uid: 979901ae-b88d-4ccc-8f94-734ca846a1d2
spec:
  groups:
  - name: telemeter.rules
    rules:
    - expr: count({__name__=~"cluster:usage:.*|count:up0|count:up1|cluster_version|cluster_version_available_updates|cluster_operator_up|cluster_operator_conditions|cluster_version_payload|cluster_installer|cluster_infrastructure_provider|cluster_feature_set|instance:etcd_object_counts:sum|ALERTS|code:apiserver_request_total:rate:sum|cluster:capacity_cpu_cores:sum|cluster:capacity_memory_bytes:sum|cluster:cpu_usage_cores:sum|cluster:memory_usage_bytes:sum|openshift:cpu_usage_cores:sum|openshift:memory_usage_bytes:sum|workload:cpu_usage_cores:sum|workload:memory_usage_bytes:sum|cluster:virt_platform_nodes:sum|cluster:node_instance_type_count:sum|cnv:vmi_status_running:count|cluster:vmi_request_cpu_cores:sum|node_role_os_version_machine:cpu_capacity_cores:sum|node_role_os_version_machine:cpu_capacity_sockets:sum|subscription_sync_total|olm_resolution_duration_seconds|csv_succeeded|csv_abnormal|cluster:kube_persistentvolumeclaim_resource_requests_storage_bytes:provisioner:sum|cluster:kubelet_volume_stats_used_bytes:provisioner:sum|ceph_cluster_total_bytes|ceph_cluster_total_used_raw_bytes|ceph_health_status|job:ceph_osd_metadata:count|job:kube_pv:count|job:ceph_pools_iops:total|job:ceph_pools_iops_bytes:total|job:ceph_versions_running:count|job:noobaa_total_unhealthy_buckets:sum|job:noobaa_bucket_count:sum|job:noobaa_total_object_count:sum|noobaa_accounts_num|noobaa_total_usage|console_url|cluster:network_attachment_definition_instances:max|cluster:network_attachment_definition_enabled_instance_up:max|cluster:ingress_controller_aws_nlb_active:sum|insightsclient_request_send_total|cam_app_workload_migrations|cluster:apiserver_current_inflight_requests:sum:max_over_time:2m|cluster:alertmanager_integrations:max|cluster:telemetry_selected_series:count|openshift:prometheus_tsdb_head_series:sum|openshift:prometheus_tsdb_head_samples_appended_total:sum|monitoring:container_memory_working_set_bytes:sum|namespace_job:scrape_series_added:topk3_sum1h|namespace_job:scrape_samples_post_metric_relabeling:topk3|monitoring:haproxy_server_http_responses_total:sum|rhmi_status|cluster_legacy_scheduler_policy|cluster_master_schedulable|che_workspace_status|che_workspace_started_total|che_workspace_failure_total|che_workspace_start_time_seconds_sum|che_workspace_start_time_seconds_count|cco_credentials_mode|cluster:kube_persistentvolume_plugin_type_counts:sum|visual_web_terminal_sessions_total|acm_managed_cluster_info|cluster:vsphere_vcenter_info:sum|cluster:vsphere_esxi_version_total:sum|cluster:vsphere_node_hw_version_total:sum|openshift:build_by_strategy:sum|rhods_aggregate_availability|rhods_total_users|instance:etcd_disk_wal_fsync_duration_seconds:histogram_quantile|instance:etcd_mvcc_db_total_size_in_bytes:sum|instance:etcd_network_peer_round_trip_time_seconds:histogram_quantile|instance:etcd_mvcc_db_total_size_in_use_in_bytes:sum|instance:etcd_disk_backend_commit_duration_seconds:histogram_quantile|jaeger_operator_instances_storage_types|jaeger_operator_instances_strategies|jaeger_operator_instances_agent_strategies|appsvcs:cores_by_product:sum|nto_custom_profiles:count|openshift_csi_share_configmap|openshift_csi_share_secret|openshift_csi_share_mount_failures_total|openshift_csi_share_mount_requests_total",alertstate=~"firing|",quantile=~"0.99|0.99|0.99|"})
      record: cluster:telemetry_selected_series:count


reason:
,quantile=~"0.99|0.99|0.99|" in the query is for all the metrics and its redundant, and it should only for 3 metrics
instance:etcd_disk_backend_commit_duration_seconds:histogram_quantile
instance:etcd_disk_wal_fsync_duration_seconds:histogram_quantile
instance:etcd_network_peer_round_trip_time_seconds:histogram_quantile

this is defined in telemetry-config configmap
# oc -n openshift-monitoring get cm telemetry-config -o jsonpath="{.data.metrics\.yaml}" | grep {__name__= 
...
- '{__name__="instance:etcd_disk_wal_fsync_duration_seconds:histogram_quantile",quantile="0.99"}'
- '{__name__="instance:etcd_network_peer_round_trip_time_seconds:histogram_quantile",quantile="0.99"}'
- '{__name__="instance:etcd_disk_backend_commit_duration_seconds:histogram_quantile",quantile="0.99"}'
...
- '{__name__="olm_resolution_duration_seconds"}'
...

if we use the expr, the query would only keep quantile="0.99" metrics, example:
olm_resolution_duration_seconds{container="catalog-operator", endpoint="https-metrics", instance="10.128.0.94:8443", job="catalog-operator-metrics", namespace="openshift-operator-lifecycle-manager", outcome="failed", pod="catalog-operator-665c654dc4-2drrw", quantile="0.99", service="catalog-operator-metrics"}  3.022626623

actually we also have quantile="0.9" and quantile="0.95" metrics, these metrics would not in the search result then.
olm_resolution_duration_seconds{container="catalog-operator", endpoint="https-metrics", instance="10.128.0.94:8443", job="catalog-operator-metrics", namespace="openshift-operator-lifecycle-manager", outcome="failed", pod="catalog-operator-665c654dc4-2drrw", quantile="0.9", service="catalog-operator-metrics"}   0.400797379
olm_resolution_duration_seconds{container="catalog-operator", endpoint="https-metrics", instance="10.128.0.94:8443", job="catalog-operator-metrics", namespace="openshift-operator-lifecycle-manager", outcome="failed", pod="catalog-operator-665c654dc4-2drrw", quantile="0.95", service="catalog-operator-metrics"}  0.424998395
Version-Release number of selected component (if applicable):


How reproducible:
always

Steps to Reproduce:
1. see the description
2.
3.

Actual results:
4.10.0-0.nightly-2022-02-22-093600

Expected results:


Additional info:

Comment 2 Simon Pasquier 2022-03-16 10:42:45 UTC
Indeed the rule expression isn't completely accurate since it may count more series than what is effectively sent to telemeter. Something like this would work:

# series with 1 label selector on the metric name only
count({__name__=~"cluster:usage:.*|count:up0|count:up1|cluster_version|..."})
+
# series with more than one label selector  
count({__name__=~"ALERTS",alertstate="firing"}) + count({__name__=~"instance:etcd_disk_wal_fsync_duration_seconds:histogram_quantile|instance:etcd_network_peer_round_trip_time_seconds:histogram_quantile|instance:etcd_disk_backend_commit_duration_seconds:histogram_quantile",quantile="0.99"}) + ...

Comment 3 Alexander Demicev 2022-04-05 11:19:08 UTC
Hi,
Looks like that the issue appears in a bunch of techpreview jobs https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-cap[…]operator-main-e2e-aws-capi-techpreview/1510920053355188224 
This is currently blocking our CI.

Comment 4 Simon Pasquier 2022-04-05 12:40:23 UTC
an example with a full link: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-capi-operator/46/pull-ci-openshift-cluster-capi-operator-main-e2e-aws-capi-techpreview/1511259398029185024

I suspect that even if we fix the recording rule, the CI would still be failing. The test has a hardcoded value of 600 for what is considered to be the maximum number of series sent to telemeter. But it probably needs an update since we're continuously adding new metrics.

Comment 5 Simon Pasquier 2022-04-07 09:30:49 UTC
Thinking more about the recording rule itself, it would be less error prone to use the metrics from telemeter-client which capture exactly the number of samples being sent:
max(federate_samples - federate_filtered_samples)

Obviously this would have to be revisited once/if we switch to Prometheus remote write but there's no concrete date for now.

Comment 8 Junqi Zhao 2022-06-13 06:52:22 UTC
checked with 4.11.0-0.nightly-2022-06-11-120123, the expression now changed to below and the result is the same with the previous correct expression
# oc -n openshift-monitoring get PrometheusRule telemetry -oyaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  creationTimestamp: "2022-06-13T03:02:13Z"
  generation: 1
  name: telemetry
  namespace: openshift-monitoring
  resourceVersion: "21091"
  uid: 1b3bb6fb-1d3d-4983-9cb1-82660c126b92
spec:
  groups:
  - name: telemeter.rules
    rules:
    - expr: max(federate_samples - federate_filtered_samples)
      record: cluster:telemetry_selected_series:count

Comment 13 errata-xmlrpc 2022-08-10 10:51:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069


Note You need to log in before you can comment on or make changes to this bug.