Description of problem: the bug is found when review https://github.com/openshift/openshift-docs/pull/42163, record rule "cluster:telemetry_selected_series:count" is wrong, which should update to the correct expr after the doc is approved. # oc get prometheusrules telemetry -n openshift-monitoring -oyaml apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: creationTimestamp: "2022-02-23T23:20:35Z" generation: 1 name: telemetry namespace: openshift-monitoring resourceVersion: "20147" uid: 979901ae-b88d-4ccc-8f94-734ca846a1d2 spec: groups: - name: telemeter.rules rules: - expr: count({__name__=~"cluster:usage:.*|count:up0|count:up1|cluster_version|cluster_version_available_updates|cluster_operator_up|cluster_operator_conditions|cluster_version_payload|cluster_installer|cluster_infrastructure_provider|cluster_feature_set|instance:etcd_object_counts:sum|ALERTS|code:apiserver_request_total:rate:sum|cluster:capacity_cpu_cores:sum|cluster:capacity_memory_bytes:sum|cluster:cpu_usage_cores:sum|cluster:memory_usage_bytes:sum|openshift:cpu_usage_cores:sum|openshift:memory_usage_bytes:sum|workload:cpu_usage_cores:sum|workload:memory_usage_bytes:sum|cluster:virt_platform_nodes:sum|cluster:node_instance_type_count:sum|cnv:vmi_status_running:count|cluster:vmi_request_cpu_cores:sum|node_role_os_version_machine:cpu_capacity_cores:sum|node_role_os_version_machine:cpu_capacity_sockets:sum|subscription_sync_total|olm_resolution_duration_seconds|csv_succeeded|csv_abnormal|cluster:kube_persistentvolumeclaim_resource_requests_storage_bytes:provisioner:sum|cluster:kubelet_volume_stats_used_bytes:provisioner:sum|ceph_cluster_total_bytes|ceph_cluster_total_used_raw_bytes|ceph_health_status|job:ceph_osd_metadata:count|job:kube_pv:count|job:ceph_pools_iops:total|job:ceph_pools_iops_bytes:total|job:ceph_versions_running:count|job:noobaa_total_unhealthy_buckets:sum|job:noobaa_bucket_count:sum|job:noobaa_total_object_count:sum|noobaa_accounts_num|noobaa_total_usage|console_url|cluster:network_attachment_definition_instances:max|cluster:network_attachment_definition_enabled_instance_up:max|cluster:ingress_controller_aws_nlb_active:sum|insightsclient_request_send_total|cam_app_workload_migrations|cluster:apiserver_current_inflight_requests:sum:max_over_time:2m|cluster:alertmanager_integrations:max|cluster:telemetry_selected_series:count|openshift:prometheus_tsdb_head_series:sum|openshift:prometheus_tsdb_head_samples_appended_total:sum|monitoring:container_memory_working_set_bytes:sum|namespace_job:scrape_series_added:topk3_sum1h|namespace_job:scrape_samples_post_metric_relabeling:topk3|monitoring:haproxy_server_http_responses_total:sum|rhmi_status|cluster_legacy_scheduler_policy|cluster_master_schedulable|che_workspace_status|che_workspace_started_total|che_workspace_failure_total|che_workspace_start_time_seconds_sum|che_workspace_start_time_seconds_count|cco_credentials_mode|cluster:kube_persistentvolume_plugin_type_counts:sum|visual_web_terminal_sessions_total|acm_managed_cluster_info|cluster:vsphere_vcenter_info:sum|cluster:vsphere_esxi_version_total:sum|cluster:vsphere_node_hw_version_total:sum|openshift:build_by_strategy:sum|rhods_aggregate_availability|rhods_total_users|instance:etcd_disk_wal_fsync_duration_seconds:histogram_quantile|instance:etcd_mvcc_db_total_size_in_bytes:sum|instance:etcd_network_peer_round_trip_time_seconds:histogram_quantile|instance:etcd_mvcc_db_total_size_in_use_in_bytes:sum|instance:etcd_disk_backend_commit_duration_seconds:histogram_quantile|jaeger_operator_instances_storage_types|jaeger_operator_instances_strategies|jaeger_operator_instances_agent_strategies|appsvcs:cores_by_product:sum|nto_custom_profiles:count|openshift_csi_share_configmap|openshift_csi_share_secret|openshift_csi_share_mount_failures_total|openshift_csi_share_mount_requests_total",alertstate=~"firing|",quantile=~"0.99|0.99|0.99|"}) record: cluster:telemetry_selected_series:count reason: ,quantile=~"0.99|0.99|0.99|" in the query is for all the metrics and its redundant, and it should only for 3 metrics instance:etcd_disk_backend_commit_duration_seconds:histogram_quantile instance:etcd_disk_wal_fsync_duration_seconds:histogram_quantile instance:etcd_network_peer_round_trip_time_seconds:histogram_quantile this is defined in telemetry-config configmap # oc -n openshift-monitoring get cm telemetry-config -o jsonpath="{.data.metrics\.yaml}" | grep {__name__= ... - '{__name__="instance:etcd_disk_wal_fsync_duration_seconds:histogram_quantile",quantile="0.99"}' - '{__name__="instance:etcd_network_peer_round_trip_time_seconds:histogram_quantile",quantile="0.99"}' - '{__name__="instance:etcd_disk_backend_commit_duration_seconds:histogram_quantile",quantile="0.99"}' ... - '{__name__="olm_resolution_duration_seconds"}' ... if we use the expr, the query would only keep quantile="0.99" metrics, example: olm_resolution_duration_seconds{container="catalog-operator", endpoint="https-metrics", instance="10.128.0.94:8443", job="catalog-operator-metrics", namespace="openshift-operator-lifecycle-manager", outcome="failed", pod="catalog-operator-665c654dc4-2drrw", quantile="0.99", service="catalog-operator-metrics"} 3.022626623 actually we also have quantile="0.9" and quantile="0.95" metrics, these metrics would not in the search result then. olm_resolution_duration_seconds{container="catalog-operator", endpoint="https-metrics", instance="10.128.0.94:8443", job="catalog-operator-metrics", namespace="openshift-operator-lifecycle-manager", outcome="failed", pod="catalog-operator-665c654dc4-2drrw", quantile="0.9", service="catalog-operator-metrics"} 0.400797379 olm_resolution_duration_seconds{container="catalog-operator", endpoint="https-metrics", instance="10.128.0.94:8443", job="catalog-operator-metrics", namespace="openshift-operator-lifecycle-manager", outcome="failed", pod="catalog-operator-665c654dc4-2drrw", quantile="0.95", service="catalog-operator-metrics"} 0.424998395 Version-Release number of selected component (if applicable): How reproducible: always Steps to Reproduce: 1. see the description 2. 3. Actual results: 4.10.0-0.nightly-2022-02-22-093600 Expected results: Additional info:
The following doc also need update https://github.com/openshift/cluster-monitoring-operator/Documentation/telemeter_query https://github.com/openshift/cluster-monitoring-operator/Documentation/sample-metrics.md
Indeed the rule expression isn't completely accurate since it may count more series than what is effectively sent to telemeter. Something like this would work: # series with 1 label selector on the metric name only count({__name__=~"cluster:usage:.*|count:up0|count:up1|cluster_version|..."}) + # series with more than one label selector count({__name__=~"ALERTS",alertstate="firing"}) + count({__name__=~"instance:etcd_disk_wal_fsync_duration_seconds:histogram_quantile|instance:etcd_network_peer_round_trip_time_seconds:histogram_quantile|instance:etcd_disk_backend_commit_duration_seconds:histogram_quantile",quantile="0.99"}) + ...
Hi, Looks like that the issue appears in a bunch of techpreview jobs https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-cap[…]operator-main-e2e-aws-capi-techpreview/1510920053355188224 This is currently blocking our CI.
an example with a full link: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-capi-operator/46/pull-ci-openshift-cluster-capi-operator-main-e2e-aws-capi-techpreview/1511259398029185024 I suspect that even if we fix the recording rule, the CI would still be failing. The test has a hardcoded value of 600 for what is considered to be the maximum number of series sent to telemeter. But it probably needs an update since we're continuously adding new metrics.
Thinking more about the recording rule itself, it would be less error prone to use the metrics from telemeter-client which capture exactly the number of samples being sent: max(federate_samples - federate_filtered_samples) Obviously this would have to be revisited once/if we switch to Prometheus remote write but there's no concrete date for now.
checked with 4.11.0-0.nightly-2022-06-11-120123, the expression now changed to below and the result is the same with the previous correct expression # oc -n openshift-monitoring get PrometheusRule telemetry -oyaml apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: creationTimestamp: "2022-06-13T03:02:13Z" generation: 1 name: telemetry namespace: openshift-monitoring resourceVersion: "21091" uid: 1b3bb6fb-1d3d-4983-9cb1-82660c126b92 spec: groups: - name: telemeter.rules rules: - expr: max(federate_samples - federate_filtered_samples) record: cluster:telemetry_selected_series:count
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069