Bug 2073112

Summary: Prometheus (uwm) externalLabels not showing always in alerts.
Product: OpenShift Container Platform Reporter: German Parente <gparente>
Component: MonitoringAssignee: Joao Marcal <jmarcal>
Status: CLOSED ERRATA QA Contact: hongyan li <hongyli>
Severity: low Docs Contact: Brian Burt <bburt>
Priority: medium    
Version: 4.10CC: amuller, anpicker, aos-bugs, bburt, clasohm, cruhm, gekis, hongyli, jfajersk, jmarcal, juzhao, spasquie
Target Milestone: ---   
Target Release: 4.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Before this update, UWM users would sometimes not see certain external labels even though they had configured UWM Prometheus to add those external labels this was caused by this configuration not being propagated to Thanos querier so if a user queried a metric not provided by the UWM Prometheus instance he would not see the external label. With this update, CMO now properly propagates the external labels configured in UWM Prometheus to Thanos ruler which resolves the issue.
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-08-10 11:05:18 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2118303    

Description German Parente 2022-04-07 16:19:19 UTC
Description of problem:

Some clarification should be needed in the following situation:

1) define externalLabels at UWM level:

oc get cm user-workload-monitoring-config -n openshift-user-workload-monitoring -o yaml
apiVersion: v1
data:
  config.yaml: |
    prometheus:
       externalLabels:
          labelmy: test
kind: ConfigMap

2) define PrometheusRules as this one:


oc get PrometheusRules -n ns1 -o yaml 
apiVersion: v1
items:
- apiVersion: monitoring.coreos.com/v1
  kind: PrometheusRule
  metadata:
    creationTimestamp: "2022-04-01T09:25:32Z"
    generation: 1
    name: example-alert
    namespace: ns1
    resourceVersion: "492473"
    uid: a8f58819-1131-40bb-995a-eafc62978cc5
  spec:
    groups:
    - name: oneexample
      rules:
      - alert: VersionAlert
        expr: version{job="prometheus-example-app"} == 1
        labels:
          mylabel: nada
          severity: critical

3) once the former alert is firing, check the alert labels:

oc exec alertmanager-main-0 -- amtool --alertmanager.url http://localhost:9093 alert query VersionAlert --output=json  | jq

we can see the labels ( at prometheus level and rule level):

    "labels": {
      "alertname": "VersionAlert",
      "endpoint": "web",
      "instance": "10.128.2.111:8080",
      "job": "prometheus-example-app",
      "labelmy": "test",
      "mylabel": "nada",
      "namespace": "ns1",
      "pod": "prometheus-example-app-7ffcdd457c-4b5hm",
      "prometheus": "openshift-user-workload-monitoring/user-workload",
      "service": "prometheus-example-app",
      "severity": "critical",
      "version": "v0.1.0"
    }

4) use an expression like this:

sum by (endpoint,instance,job,namespace,pod,prometheus,service) (up{job="prometheus-example-app"}) ==1

we can see the label as:


    "labels": {
      "alertname": "AlertTestTest",
      "endpoint": "web",
      "instance": "10.128.2.111:8080",
      "job": "prometheus-example-app",
      "mylabel": "nada",
      "namespace": "ns1",
      "pod": "prometheus-example-app-7ffcdd457c-4b5hm",
      "prometheus": "openshift-user-workload-monitoring/user-workload",
      "service": "prometheus-example-app",
      "severity": "critical"

So, externalLabels at prometheus level are not shown.

It seems there's a documentation bug upstream reflecting this:

https://github.com/openshift/openshift-docs/issues/44324

We need to clarify if this is indeed a documentation bug and we need to explain the reason why and in which cases this is not happening consistently.

Version-Release number of selected component (if applicable): 4.10

Comment 1 Joao Marcal 2022-04-12 15:23:09 UTC
After investigating we have discovered that, the customer can update their PrometheusRule resources to have in the "by" aggregation, the external label that they want to see in the alert.

Change from this:
sum by (endpoint,instance,job,namespace,pod,prometheus,service) (up{job="prometheus-example-app"}) == 1

To this:
sum by (endpoint,instance,job,namespace,pod,prometheus,service,labelmy) (up{job="prometheus-example-app"}) == 1

The "by" aggregation is discarding the external label. By reading the documentation this behavior is indeed misleading as one would expect the external labels to always show if configured.
This is a potential area of improvement for the monitoring stack.

Also good to know, is that, external labels will only show on an alert if the alert is using metrics that come from a Prometheus instance that is configured to add the external label.
For instance, if I configure UWM to add the label "labelmy: test" this label will only appear in alerts that query the UWM Prometheus instance, like "up{job="prometheus-example-app"} == 1".
An alert with an expression, "kube_deployment_status_replicas{job="prometheus-example-app"} == 1" will not show the external labels configured for UWM, since the data for this query is provided by the in-cluster Prometheus instance.

TL;DR Update the rule expression to have the external label, since by takes it away.

Comment 5 hongyan li 2022-04-22 05:56:56 UTC
Test with payload 4.11.0-0.nightly-2022-04-22-002610

Enable user workload monitoring
Deploy example app
Configure external label of user workload prometheus
Create alert rule with expression about data provided by in-cluster prometheus
Configuration yaml, see attachment

Query alert, can see external label

oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://alertmanager-main.openshift-monitoring.svc:9094/api/v1/alerts' | jq |grep -A10 KubeAlert
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  5701    0  5701    0     0   428k      0 --:--:-- --:--:-- --:--:--  428k
        "alertname": "KubeAlert",
        "container": "kube-rbac-proxy-main",
        "deployment": "prometheus-example-app",
        "endpoint": "https-main",
        "job": "kube-state-metrics",
        "namespace": "ns1",
        "prometheus": "openshift-monitoring/k8s",
        "service": "kube-state-metrics"
      },

Comment 7 hongyan li 2022-04-26 08:44:06 UTC
Added test case
OCP-50241 - Prometheus (uwm) externalLabels not showing always in alerts

Comment 12 errata-xmlrpc 2022-08-10 11:05:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Comment 14 Joao Marcal 2022-08-23 13:17:24 UTC
Backport was merged today to 4.10 https://github.com/openshift/cluster-monitoring-operator/pull/1742