Bug 1853264

Summary:	Metrics produce high unbound cardinality
Product:	OpenShift Container Platform	Reporter:	Lili Cosic <lcosic>
Component:	Machine Config Operator	Assignee:	Sinny Kumari <skumari>
Machine Config Operator sub component:	Machine Config Operator	QA Contact:	Rio Liu <rioliu>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	medium
Priority:	high	CC:	amurdaca, aos-bugs, jerzhang, mkrejci, skumari, sregidor, wking
Version:	4.4
Target Milestone:	---
Target Release:	4.13.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-05-17 22:46:32 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Lili Cosic 2020-07-02 09:57:09 UTC

Description of problem:

Your alerts and in turn metrics seem to be very high unbound cardinality. This can cause bad alerts to fire multiple times, which in turn means users won't trust that alert and just silence it. It can also cause Prometheus which stores all of these metrics (series) to "explode" due to unbound high cardinality. I noticed the following problems:

- For one label drain_time part of mcd_drain metric is causing problems as you are adding a timestamp to the label which in turn produces a unique series each for each time. Please remove this, I don't see you using the timestamp in the alert itself so not sure why its needed. Docs on this are here -> https://prometheus.io/docs/practices/naming/#labels

- I noticed also err label in many metrics, that can produce high cardinality metric series, due to ever changing nature of the error msg, here we want to bound this number of errors/reasons/msgs. The goal is to have a predictable number of metrics.

- Alerting here is done as almost a bespoke monitoring system, a lot of this is due to evaluation of > 1 or > 0. These alerts are going to be waking up users at night, you need to make sure they are valid. https://prometheus.io/docs/practices/alerting/#online-serving-systems

- mcd_kubelet_state, any reason why you can't use kubelet own metrics?

- Minor problems like not following best practices around metric naming. https://prometheus.io/docs/practices/naming/#metric-names

https://github.com/openshift/machine-config-operator/blob/master/pkg/daemon/metrics.go#L35

Some examples of alerts we ship: https://github.com/openshift/cluster-monitoring-operator/blob/master/assets/prometheus-k8s/rules.yaml#L1058-L1145

Feel free to ask for guidance on best practices, happy to help out here!

Version-Release number of selected component (if applicable):
4.4+

How reproducible:
Every time the metrics are incremented.

Steps to Reproduce:
1.
2.
3.

Actual results:

Expected results:

Additional info:

Comment 1 Lili Cosic 2020-07-02 09:58:30 UTC

Let me know if something is not clear! If you can't reproduce this in a cluster, you can see this very clearly if you query the alerts via telemetry.

Comment 5 Kirsten Garrison 2020-08-31 23:37:01 UTC

A related PR (but not sufficient to close): https://github.com/openshift/machine-config-operator/pull/2044

Comment 12 Kirsten Garrison 2021-05-06 23:39:00 UTC

*** Bug 1957421 has been marked as a duplicate of this bug. ***

Comment 15 Sinny Kumari 2022-10-17 11:47:02 UTC

some of the final fixes of this bug was done in WIP PR https://github.com/openshift/machine-config-operator/pull/2394 which was closed by bot. This PR needs a rebase and probably some minor fixes and review from the monitoring team.

Comment 20 Sergio 2022-12-23 09:57:01 UTC

Verified using IPI on AWS version:

$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.13.0-0.nightly-2022-12-22-120609   True        False         26m     Cluster version is 4.13.0-0.nightly-2022-12-22-120609


1) KubeletHealthState alert and mcd_kubelet_state metric

To trigger the error we execute "systemctl stop kubelet.service" in a worker node.

$ oc debug node/$(oc get nodes -l node-role.kubernetes.io/worker -ojsonpath='{.items[0].metadata.name}') -- chroot /host  sh -c "systemctl stop kubelet.service; sleep 600; systemctl start kubelet.service"


A  KubeletHealthState alarm is raised, and the mcd_kubelet_state metric does not contain any label with error messages or dates that could affect the metric's cardinality

$ oc rsh -n openshift-monitoring prometheus-k8s-0 sh -c 'curl -s -k  -H "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query?query=mcd_kubelet_state' | jq
{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [
      {
        "metric": {
          "__name__": "mcd_kubelet_state",
          "container": "oauth-proxy",
          "endpoint": "metrics",
          "instance": "10.0.139.227:9001",
          "job": "machine-config-daemon",
          "namespace": "openshift-machine-config-operator",
          "node": "ip-10-0-139-227.us-east-2.compute.internal",
          "pod": "machine-config-daemon-kwqzp",
          "service": "machine-config-daemon"
        },
        "value": [
          1671704093.511,
          "26"
        ]

2) MCCDrainError alert and the mcc_drain_err metric

To trigger this alert we follow the steps in test case: "OCP-56706 - [MCO][MCO-420] Move MCD drain alert into the MCC, revisit error modes"

A MCCDrainError is triggered and mcc_drain_err metric does not contain any label with error messages or dates that could affect the metric's cardinality


 oc rsh -n openshift-monitoring prometheus-k8s-0 sh -c 'curl -s -k  -H "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query?query=mcc_drain_err'
{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [
      {
        "metric": {
          "__name__": "mcc_drain_err",
          "container": "oauth-proxy",
          "endpoint": "metrics",
          "instance": "10.128.0.77:9001",
          "job": "machine-config-controller",
          "namespace": "openshift-machine-config-operator",
          "node": "ip-10-0-223-51.us-east-2.compute.internal",
          "pod": "machine-config-controller-5468769874-xnrx2",
          "service": "machine-config-controller"
        },
        "value": [
          1671711418.660,
          "1"
        ]


3) MCDRebootError alert and mcd_reboots_failed_total metric

To trigger this alert we execute the following commands in a worker node and then apply a MC:

$ mount -o remount,rw /usr
$ mv /usr/bin/systemd-run /usr/bin/systemd-run2


A MCDRebootError alert is triggered ONLY FOR 15 MINUTES (then the alert is removed) and mcd_reboots_failed_total metric does not contain any label with error messages or dates that could affect the metric's cardinality


$ oc rsh -n openshift-monitoring prometheus-k8s-0 sh -c 'curl -s -k  -H "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query?query=mcd_reboots_failed_total' | jq

      {
        "metric": {
          "__name__": "mcd_reboots_failed_total",
          "container": "oauth-proxy",
          "endpoint": "metrics",
          "instance": "10.0.151.175:9001",
          "job": "machine-config-daemon",
          "namespace": "openshift-machine-config-operator",
          "node": "ip-10-0-151-175.us-east-2.compute.internal",
          "pod": "machine-config-daemon-dzckk",
          "service": "machine-config-daemon"
        },
        "value": [
          1671723733.324,
          "1"
        ]


4) MCDPivotError alert and mcd_pivot_errors_total metric

To trigger this alert we replace the rpm-ostree exec file in a worker node, following this steps:

$ mount -o remount,rw /usr
$ mv /usr/bin/rpm-ostree /usr/bin/rpm-ostree2
$ vi  /usr/bin/rpm-ostree

The contento of the new rpm-ostree file should be:

#!/bin/bash
if [ "$1" == "rebase" ];
then
exit -1
else
/usr/bin/rpm-ostree2 $@
fi
exit $?


$ chmod +x  /usr/bin/rpm-ostree



A MCDPivotError alert is triggered and mcd_pivot_errors_total metric does not contain any label with error messages or dates that could affect the metric's cardinality

$  oc rsh -n openshift-monitoring prometheus-k8s-0 sh -c 'curl -s -k  -H "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query?query=mcd_pivot_errors_total' | jq

      {
        "metric": {
          "__name__": "mcd_pivot_errors_total",
          "container": "oauth-proxy",
          "endpoint": "metrics",
          "instance": "10.0.150.111:9001",
          "job": "machine-config-daemon",
          "namespace": "openshift-machine-config-operator",
          "node": "ip-10-0-150-111.us-east-2.compute.internal",
          "pod": "machine-config-daemon-pgxgm",
          "service": "machine-config-daemon"
        },
        "value": [
          1671787167.657,
          "9"
        ]
      },



All alerts are triggered and all metrics are reported so that they don't have high unbound cardinality.

We would like to remark that the MCDRebootError is now triggered ONLY DURING 15 MINUTES after the reboot error and then, once the 15 minutes are over, the alert is removed. This behavior is not the previous behavior, since before those changes the alert was triggered and it was only removed once the node was rebooted, and never before it happened. This is not related to the metric's cardinality, so we will move this BZ to VERIFIED status and will check the new MCDRebootError's behavior with devs (@skumari) and if the new behavior is not intended we will open a new bug.

We move this BZ to VERIFIED status.

Comment 23 errata-xmlrpc 2023-05-17 22:46:32 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.13.0 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:1326