Bug 1853264
| Summary: | Metrics produce high unbound cardinality | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Lili Cosic <lcosic> |
| Component: | Machine Config Operator | Assignee: | Sinny Kumari <skumari> |
| Machine Config Operator sub component: | Machine Config Operator | QA Contact: | Rio Liu <rioliu> |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | medium | ||
| Priority: | high | CC: | amurdaca, aos-bugs, jerzhang, mkrejci, skumari, sregidor, wking |
| Version: | 4.4 | ||
| Target Milestone: | --- | ||
| Target Release: | 4.13.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-05-17 22:46:32 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Lili Cosic
2020-07-02 09:57:09 UTC
Let me know if something is not clear! If you can't reproduce this in a cluster, you can see this very clearly if you query the alerts via telemetry. A related PR (but not sufficient to close): https://github.com/openshift/machine-config-operator/pull/2044 *** Bug 1957421 has been marked as a duplicate of this bug. *** some of the final fixes of this bug was done in WIP PR https://github.com/openshift/machine-config-operator/pull/2394 which was closed by bot. This PR needs a rebase and probably some minor fixes and review from the monitoring team. Verified using IPI on AWS version:
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.13.0-0.nightly-2022-12-22-120609 True False 26m Cluster version is 4.13.0-0.nightly-2022-12-22-120609
1) KubeletHealthState alert and mcd_kubelet_state metric
To trigger the error we execute "systemctl stop kubelet.service" in a worker node.
$ oc debug node/$(oc get nodes -l node-role.kubernetes.io/worker -ojsonpath='{.items[0].metadata.name}') -- chroot /host sh -c "systemctl stop kubelet.service; sleep 600; systemctl start kubelet.service"
A KubeletHealthState alarm is raised, and the mcd_kubelet_state metric does not contain any label with error messages or dates that could affect the metric's cardinality
$ oc rsh -n openshift-monitoring prometheus-k8s-0 sh -c 'curl -s -k -H "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query?query=mcd_kubelet_state' | jq
{
"status": "success",
"data": {
"resultType": "vector",
"result": [
{
"metric": {
"__name__": "mcd_kubelet_state",
"container": "oauth-proxy",
"endpoint": "metrics",
"instance": "10.0.139.227:9001",
"job": "machine-config-daemon",
"namespace": "openshift-machine-config-operator",
"node": "ip-10-0-139-227.us-east-2.compute.internal",
"pod": "machine-config-daemon-kwqzp",
"service": "machine-config-daemon"
},
"value": [
1671704093.511,
"26"
]
2) MCCDrainError alert and the mcc_drain_err metric
To trigger this alert we follow the steps in test case: "OCP-56706 - [MCO][MCO-420] Move MCD drain alert into the MCC, revisit error modes"
A MCCDrainError is triggered and mcc_drain_err metric does not contain any label with error messages or dates that could affect the metric's cardinality
oc rsh -n openshift-monitoring prometheus-k8s-0 sh -c 'curl -s -k -H "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query?query=mcc_drain_err'
{
"status": "success",
"data": {
"resultType": "vector",
"result": [
{
"metric": {
"__name__": "mcc_drain_err",
"container": "oauth-proxy",
"endpoint": "metrics",
"instance": "10.128.0.77:9001",
"job": "machine-config-controller",
"namespace": "openshift-machine-config-operator",
"node": "ip-10-0-223-51.us-east-2.compute.internal",
"pod": "machine-config-controller-5468769874-xnrx2",
"service": "machine-config-controller"
},
"value": [
1671711418.660,
"1"
]
3) MCDRebootError alert and mcd_reboots_failed_total metric
To trigger this alert we execute the following commands in a worker node and then apply a MC:
$ mount -o remount,rw /usr
$ mv /usr/bin/systemd-run /usr/bin/systemd-run2
A MCDRebootError alert is triggered ONLY FOR 15 MINUTES (then the alert is removed) and mcd_reboots_failed_total metric does not contain any label with error messages or dates that could affect the metric's cardinality
$ oc rsh -n openshift-monitoring prometheus-k8s-0 sh -c 'curl -s -k -H "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query?query=mcd_reboots_failed_total' | jq
{
"metric": {
"__name__": "mcd_reboots_failed_total",
"container": "oauth-proxy",
"endpoint": "metrics",
"instance": "10.0.151.175:9001",
"job": "machine-config-daemon",
"namespace": "openshift-machine-config-operator",
"node": "ip-10-0-151-175.us-east-2.compute.internal",
"pod": "machine-config-daemon-dzckk",
"service": "machine-config-daemon"
},
"value": [
1671723733.324,
"1"
]
4) MCDPivotError alert and mcd_pivot_errors_total metric
To trigger this alert we replace the rpm-ostree exec file in a worker node, following this steps:
$ mount -o remount,rw /usr
$ mv /usr/bin/rpm-ostree /usr/bin/rpm-ostree2
$ vi /usr/bin/rpm-ostree
The contento of the new rpm-ostree file should be:
#!/bin/bash
if [ "$1" == "rebase" ];
then
exit -1
else
/usr/bin/rpm-ostree2 $@
fi
exit $?
$ chmod +x /usr/bin/rpm-ostree
A MCDPivotError alert is triggered and mcd_pivot_errors_total metric does not contain any label with error messages or dates that could affect the metric's cardinality
$ oc rsh -n openshift-monitoring prometheus-k8s-0 sh -c 'curl -s -k -H "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query?query=mcd_pivot_errors_total' | jq
{
"metric": {
"__name__": "mcd_pivot_errors_total",
"container": "oauth-proxy",
"endpoint": "metrics",
"instance": "10.0.150.111:9001",
"job": "machine-config-daemon",
"namespace": "openshift-machine-config-operator",
"node": "ip-10-0-150-111.us-east-2.compute.internal",
"pod": "machine-config-daemon-pgxgm",
"service": "machine-config-daemon"
},
"value": [
1671787167.657,
"9"
]
},
All alerts are triggered and all metrics are reported so that they don't have high unbound cardinality.
We would like to remark that the MCDRebootError is now triggered ONLY DURING 15 MINUTES after the reboot error and then, once the 15 minutes are over, the alert is removed. This behavior is not the previous behavior, since before those changes the alert was triggered and it was only removed once the node was rebooted, and never before it happened. This is not related to the metric's cardinality, so we will move this BZ to VERIFIED status and will check the new MCDRebootError's behavior with devs (@skumari) and if the new behavior is not intended we will open a new bug.
We move this BZ to VERIFIED status.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.13.0 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:1326 |