Bug 2073112 - Prometheus (uwm) externalLabels not showing always in alerts.
Summary: Prometheus (uwm) externalLabels not showing always in alerts.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.10
Hardware: Unspecified
OS: Unspecified
medium
low
Target Milestone: ---
: 4.11.0
Assignee: Joao Marcal
QA Contact: hongyan li
Brian Burt
URL:
Whiteboard:
Depends On:
Blocks: 2118303
TreeView+ depends on / blocked
 
Reported: 2022-04-07 16:19 UTC by German Parente
Modified: 2022-08-26 13:29 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Before this update, UWM users would sometimes not see certain external labels even though they had configured UWM Prometheus to add those external labels this was caused by this configuration not being propagated to Thanos querier so if a user queried a metric not provided by the UWM Prometheus instance he would not see the external label. With this update, CMO now properly propagates the external labels configured in UWM Prometheus to Thanos ruler which resolves the issue.
Clone Of:
Environment:
Last Closed: 2022-08-10 11:05:18 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-monitoring-operator pull 1645 0 None open Bug 2073112: Adds UWM extrenalLabels from Prometheus to ThanosRuler labels 2022-04-21 10:10:52 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 11:05:41 UTC

Description German Parente 2022-04-07 16:19:19 UTC
Description of problem:

Some clarification should be needed in the following situation:

1) define externalLabels at UWM level:

oc get cm user-workload-monitoring-config -n openshift-user-workload-monitoring -o yaml
apiVersion: v1
data:
  config.yaml: |
    prometheus:
       externalLabels:
          labelmy: test
kind: ConfigMap

2) define PrometheusRules as this one:


oc get PrometheusRules -n ns1 -o yaml 
apiVersion: v1
items:
- apiVersion: monitoring.coreos.com/v1
  kind: PrometheusRule
  metadata:
    creationTimestamp: "2022-04-01T09:25:32Z"
    generation: 1
    name: example-alert
    namespace: ns1
    resourceVersion: "492473"
    uid: a8f58819-1131-40bb-995a-eafc62978cc5
  spec:
    groups:
    - name: oneexample
      rules:
      - alert: VersionAlert
        expr: version{job="prometheus-example-app"} == 1
        labels:
          mylabel: nada
          severity: critical

3) once the former alert is firing, check the alert labels:

oc exec alertmanager-main-0 -- amtool --alertmanager.url http://localhost:9093 alert query VersionAlert --output=json  | jq

we can see the labels ( at prometheus level and rule level):

    "labels": {
      "alertname": "VersionAlert",
      "endpoint": "web",
      "instance": "10.128.2.111:8080",
      "job": "prometheus-example-app",
      "labelmy": "test",
      "mylabel": "nada",
      "namespace": "ns1",
      "pod": "prometheus-example-app-7ffcdd457c-4b5hm",
      "prometheus": "openshift-user-workload-monitoring/user-workload",
      "service": "prometheus-example-app",
      "severity": "critical",
      "version": "v0.1.0"
    }

4) use an expression like this:

sum by (endpoint,instance,job,namespace,pod,prometheus,service) (up{job="prometheus-example-app"}) ==1

we can see the label as:


    "labels": {
      "alertname": "AlertTestTest",
      "endpoint": "web",
      "instance": "10.128.2.111:8080",
      "job": "prometheus-example-app",
      "mylabel": "nada",
      "namespace": "ns1",
      "pod": "prometheus-example-app-7ffcdd457c-4b5hm",
      "prometheus": "openshift-user-workload-monitoring/user-workload",
      "service": "prometheus-example-app",
      "severity": "critical"

So, externalLabels at prometheus level are not shown.

It seems there's a documentation bug upstream reflecting this:

https://github.com/openshift/openshift-docs/issues/44324

We need to clarify if this is indeed a documentation bug and we need to explain the reason why and in which cases this is not happening consistently.

Version-Release number of selected component (if applicable): 4.10

Comment 1 Joao Marcal 2022-04-12 15:23:09 UTC
After investigating we have discovered that, the customer can update their PrometheusRule resources to have in the "by" aggregation, the external label that they want to see in the alert.

Change from this:
sum by (endpoint,instance,job,namespace,pod,prometheus,service) (up{job="prometheus-example-app"}) == 1

To this:
sum by (endpoint,instance,job,namespace,pod,prometheus,service,labelmy) (up{job="prometheus-example-app"}) == 1

The "by" aggregation is discarding the external label. By reading the documentation this behavior is indeed misleading as one would expect the external labels to always show if configured.
This is a potential area of improvement for the monitoring stack.

Also good to know, is that, external labels will only show on an alert if the alert is using metrics that come from a Prometheus instance that is configured to add the external label.
For instance, if I configure UWM to add the label "labelmy: test" this label will only appear in alerts that query the UWM Prometheus instance, like "up{job="prometheus-example-app"} == 1".
An alert with an expression, "kube_deployment_status_replicas{job="prometheus-example-app"} == 1" will not show the external labels configured for UWM, since the data for this query is provided by the in-cluster Prometheus instance.

TL;DR Update the rule expression to have the external label, since by takes it away.

Comment 5 hongyan li 2022-04-22 05:56:56 UTC
Test with payload 4.11.0-0.nightly-2022-04-22-002610

Enable user workload monitoring
Deploy example app
Configure external label of user workload prometheus
Create alert rule with expression about data provided by in-cluster prometheus
Configuration yaml, see attachment

Query alert, can see external label

oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://alertmanager-main.openshift-monitoring.svc:9094/api/v1/alerts' | jq |grep -A10 KubeAlert
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  5701    0  5701    0     0   428k      0 --:--:-- --:--:-- --:--:--  428k
        "alertname": "KubeAlert",
        "container": "kube-rbac-proxy-main",
        "deployment": "prometheus-example-app",
        "endpoint": "https-main",
        "job": "kube-state-metrics",
        "namespace": "ns1",
        "prometheus": "openshift-monitoring/k8s",
        "service": "kube-state-metrics"
      },

Comment 7 hongyan li 2022-04-26 08:44:06 UTC
Added test case
OCP-50241 - Prometheus (uwm) externalLabels not showing always in alerts

Comment 12 errata-xmlrpc 2022-08-10 11:05:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Comment 14 Joao Marcal 2022-08-23 13:17:24 UTC
Backport was merged today to 4.10 https://github.com/openshift/cluster-monitoring-operator/pull/1742


Note You need to log in before you can comment on or make changes to this bug.