Description of problem: Your alerts and in turn metrics seem to be very high unbound cardinality. This can cause bad alerts to fire multiple times, which in turn means users won't trust that alert and just silence it. It can also cause Prometheus which stores all of these metrics (series) to "explode" due to unbound high cardinality. I noticed the following problems: - For one label drain_time part of mcd_drain metric is causing problems as you are adding a timestamp to the label which in turn produces a unique series each for each time. Please remove this, I don't see you using the timestamp in the alert itself so not sure why its needed. Docs on this are here -> https://prometheus.io/docs/practices/naming/#labels - I noticed also err label in many metrics, that can produce high cardinality metric series, due to ever changing nature of the error msg, here we want to bound this number of errors/reasons/msgs. The goal is to have a predictable number of metrics. - Alerting here is done as almost a bespoke monitoring system, a lot of this is due to evaluation of > 1 or > 0. These alerts are going to be waking up users at night, you need to make sure they are valid. https://prometheus.io/docs/practices/alerting/#online-serving-systems - mcd_kubelet_state, any reason why you can't use kubelet own metrics? - Minor problems like not following best practices around metric naming. https://prometheus.io/docs/practices/naming/#metric-names https://github.com/openshift/machine-config-operator/blob/master/pkg/daemon/metrics.go#L35 Some examples of alerts we ship: https://github.com/openshift/cluster-monitoring-operator/blob/master/assets/prometheus-k8s/rules.yaml#L1058-L1145 Feel free to ask for guidance on best practices, happy to help out here! Version-Release number of selected component (if applicable): 4.4+ How reproducible: Every time the metrics are incremented. Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Let me know if something is not clear! If you can't reproduce this in a cluster, you can see this very clearly if you query the alerts via telemetry.
A related PR (but not sufficient to close): https://github.com/openshift/machine-config-operator/pull/2044
*** Bug 1957421 has been marked as a duplicate of this bug. ***
some of the final fixes of this bug was done in WIP PR https://github.com/openshift/machine-config-operator/pull/2394 which was closed by bot. This PR needs a rebase and probably some minor fixes and review from the monitoring team.
Verified using IPI on AWS version: $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.13.0-0.nightly-2022-12-22-120609 True False 26m Cluster version is 4.13.0-0.nightly-2022-12-22-120609 1) KubeletHealthState alert and mcd_kubelet_state metric To trigger the error we execute "systemctl stop kubelet.service" in a worker node. $ oc debug node/$(oc get nodes -l node-role.kubernetes.io/worker -ojsonpath='{.items[0].metadata.name}') -- chroot /host sh -c "systemctl stop kubelet.service; sleep 600; systemctl start kubelet.service" A KubeletHealthState alarm is raised, and the mcd_kubelet_state metric does not contain any label with error messages or dates that could affect the metric's cardinality $ oc rsh -n openshift-monitoring prometheus-k8s-0 sh -c 'curl -s -k -H "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query?query=mcd_kubelet_state' | jq { "status": "success", "data": { "resultType": "vector", "result": [ { "metric": { "__name__": "mcd_kubelet_state", "container": "oauth-proxy", "endpoint": "metrics", "instance": "10.0.139.227:9001", "job": "machine-config-daemon", "namespace": "openshift-machine-config-operator", "node": "ip-10-0-139-227.us-east-2.compute.internal", "pod": "machine-config-daemon-kwqzp", "service": "machine-config-daemon" }, "value": [ 1671704093.511, "26" ] 2) MCCDrainError alert and the mcc_drain_err metric To trigger this alert we follow the steps in test case: "OCP-56706 - [MCO][MCO-420] Move MCD drain alert into the MCC, revisit error modes" A MCCDrainError is triggered and mcc_drain_err metric does not contain any label with error messages or dates that could affect the metric's cardinality oc rsh -n openshift-monitoring prometheus-k8s-0 sh -c 'curl -s -k -H "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query?query=mcc_drain_err' { "status": "success", "data": { "resultType": "vector", "result": [ { "metric": { "__name__": "mcc_drain_err", "container": "oauth-proxy", "endpoint": "metrics", "instance": "10.128.0.77:9001", "job": "machine-config-controller", "namespace": "openshift-machine-config-operator", "node": "ip-10-0-223-51.us-east-2.compute.internal", "pod": "machine-config-controller-5468769874-xnrx2", "service": "machine-config-controller" }, "value": [ 1671711418.660, "1" ] 3) MCDRebootError alert and mcd_reboots_failed_total metric To trigger this alert we execute the following commands in a worker node and then apply a MC: $ mount -o remount,rw /usr $ mv /usr/bin/systemd-run /usr/bin/systemd-run2 A MCDRebootError alert is triggered ONLY FOR 15 MINUTES (then the alert is removed) and mcd_reboots_failed_total metric does not contain any label with error messages or dates that could affect the metric's cardinality $ oc rsh -n openshift-monitoring prometheus-k8s-0 sh -c 'curl -s -k -H "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query?query=mcd_reboots_failed_total' | jq { "metric": { "__name__": "mcd_reboots_failed_total", "container": "oauth-proxy", "endpoint": "metrics", "instance": "10.0.151.175:9001", "job": "machine-config-daemon", "namespace": "openshift-machine-config-operator", "node": "ip-10-0-151-175.us-east-2.compute.internal", "pod": "machine-config-daemon-dzckk", "service": "machine-config-daemon" }, "value": [ 1671723733.324, "1" ] 4) MCDPivotError alert and mcd_pivot_errors_total metric To trigger this alert we replace the rpm-ostree exec file in a worker node, following this steps: $ mount -o remount,rw /usr $ mv /usr/bin/rpm-ostree /usr/bin/rpm-ostree2 $ vi /usr/bin/rpm-ostree The contento of the new rpm-ostree file should be: #!/bin/bash if [ "$1" == "rebase" ]; then exit -1 else /usr/bin/rpm-ostree2 $@ fi exit $? $ chmod +x /usr/bin/rpm-ostree A MCDPivotError alert is triggered and mcd_pivot_errors_total metric does not contain any label with error messages or dates that could affect the metric's cardinality $ oc rsh -n openshift-monitoring prometheus-k8s-0 sh -c 'curl -s -k -H "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query?query=mcd_pivot_errors_total' | jq { "metric": { "__name__": "mcd_pivot_errors_total", "container": "oauth-proxy", "endpoint": "metrics", "instance": "10.0.150.111:9001", "job": "machine-config-daemon", "namespace": "openshift-machine-config-operator", "node": "ip-10-0-150-111.us-east-2.compute.internal", "pod": "machine-config-daemon-pgxgm", "service": "machine-config-daemon" }, "value": [ 1671787167.657, "9" ] }, All alerts are triggered and all metrics are reported so that they don't have high unbound cardinality. We would like to remark that the MCDRebootError is now triggered ONLY DURING 15 MINUTES after the reboot error and then, once the 15 minutes are over, the alert is removed. This behavior is not the previous behavior, since before those changes the alert was triggered and it was only removed once the node was rebooted, and never before it happened. This is not related to the metric's cardinality, so we will move this BZ to VERIFIED status and will check the new MCDRebootError's behavior with devs (@skumari) and if the new behavior is not intended we will open a new bug. We move this BZ to VERIFIED status.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.13.0 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:1326