2034192 – Prometheus fails to insert reporting metrics when the sample limit is met

Bug 2034192 - Prometheus fails to insert reporting metrics when the sample limit is met

Summary: Prometheus fails to insert reporting metrics when the sample limit is met

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.10.0
Assignee:	Arunprasad Rajkumar
QA Contact:	Junqi Zhao
Docs Contact:	Brian Burt
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-12-20 11:33 UTC by Simon Pasquier
Modified:	2022-03-10 16:35 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Previously, if reporting metrics failed due to reaching the configured sample limit, the metrics target would still appear with a status of `Up` in the web console UI even though the metrics were missing. With this release, Prometheus bypasses the sample limit setting for reporting metrics, and the metrics now appear regardless of the sample limit setting.
Clone Of:
Environment:
Last Closed:	2022-03-10 16:35:23 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-monitoring-operator pull 1522	None	open	Bug 2034192: [bot] Automated dependencies version update	2021-12-21 07:27:58 UTC
Github	openshift prometheus pull 117	None	open	Bug 2034192: [bot] Bump openshift/prometheus to v2.32.1	2021-12-20 13:14:29 UTC
Red Hat Product Errata	RHSA-2022:0056	None	None	None	2022-03-10 16:35:35 UTC

Description Simon Pasquier 2021-12-20 11:33:31 UTC

Description of problem:
When a service monitor defines a sample limit (which is possible for user-workload monitoring), the reporting metrics (up, scrape_samples_scraped, ...) may not be inserted by Prometheus if the number of samples exposed by the target is close to the limit.
See https://github.com/prometheus/prometheus/issues/9990 for the details.

Version-Release number of selected component (if applicable):
4.10

How reproducible:
Always

Steps to Reproduce:
1. Follow the OCP documentation to deploy the sample application which exposes only one metric.
https://docs.openshift.com/container-platform/4.9/monitoring/managing-metrics.html#setting-up-metrics-collection-for-user-defined-projects_managing-metrics
2. Add a sample limit of 1 to the application's service monitor

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    k8s-app: prometheus-example-monitor
  name: prometheus-example-monitor
  namespace: ns1
spec:
  endpoints:
  - interval: 30s
    port: web
    scheme: http
  sampleLimit: 1
  selector:
    matchLabels:
      app: prometheus-example-app


Actual results:
The target is scraped successfully (no target down) but the up metric is missing (like the other reporting metrics).

Expected results:
Reporting metrics should be present.

Additional info:
Fixed in Prometheus v2.32.1.

Comment 5 Junqi Zhao 2021-12-22 03:51:53 UTC

tested with 4.10.0-0.nightly-2021-12-21-130047, followed steps in comment 0, could see the up metric
# oc -n openshift-monitoring logs -c prometheus prometheus-k8s-0
ts=2021-12-21T23:54:33.692Z caller=main.go:532 level=info msg="Starting Prometheus" version="(version=2.32.1, branch=rhaos-4.10-rhel-8, revision=2003b6cb83d933ad154a6dcd6bc6b497488b8501)"

# oc -n openshift-user-workload-monitoring exec -c prometheus prometheus-user-workload-0 -- cat /etc/prometheus/config_out/prometheus.env.yaml
scrape_configs:
- job_name: serviceMonitor/ns1/prometheus-example-monitor/0
  ...
  - source_labels:
    - __tmp_hash
    regex: 0
    action: keep
  sample_limit: 1
  metric_relabel_configs:
  - target_label: namespace
    replacement: ns1

# oc -n openshift-user-workload-monitoring exec -c prometheus prometheus-user-workload-0 -- cat /etc/prometheus/config_out/prometheus.env.yaml
# oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?query=up%7Bnamespace%3D%22ns1%22%7D' | jq
{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [
      {
        "metric": {
          "__name__": "up",
          "endpoint": "web",
          "instance": "10.131.0.152:8080",
          "job": "prometheus-example-app",
          "namespace": "ns1",
          "pod": "prometheus-example-app-8659789999-nwh2k",
          "prometheus": "openshift-user-workload-monitoring/user-workload",
          "service": "prometheus-example-app"
        },
        "value": [
          1640145070.945,
          "1"
        ]
      }
    ]
  }
}

Comment 8 errata-xmlrpc 2022-03-10 16:35:23 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Note You need to log in before you can comment on or make changes to this bug.