Bug 2056802
| Summary: | "enforcedLabelLimit|enforcedLabelNameLengthLimit|enforcedLabelValueLengthLimit" do not take effect | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Junqi Zhao <juzhao> | ||||
| Component: | Monitoring | Assignee: | Jayapriya Pai <janantha> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Junqi Zhao <juzhao> | ||||
| Severity: | medium | Docs Contact: | |||||
| Priority: | medium | ||||||
| Version: | 4.11 | CC: | amuller, anpicker, aos-bugs, spasquie | ||||
| Target Milestone: | --- | ||||||
| Target Release: | 4.11.0 | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | No Doc Update | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2022-08-10 10:50:27 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
|
Description
Junqi Zhao
2022-02-22 04:51:02 UTC
tested with PR launch openshift/cluster-monitoring-operator#1350 use the same steps in comment 0 still the same result as the bug launched with launch 4.11.0-0.ci-2022-03-07-013215,openshift/cluster-monitoring-operator#1350 https://github.com/openshift/prometheus/pull/121 is in this cluster 1. the user endpoint exposes only one metrics: version{version="v0.4.0"} with value 1, label number is 1, checked from thanos-querier API, there are 9 labels in total from thanos-querier(include "__name__") # oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?query=version' | jq { "status": "success", "data": { "resultType": "vector", "result": [ { "metric": { "__name__": "version", "endpoint": "web", "instance": "10.128.2.20:8080", "job": "prometheus-example-app", "namespace": "ns1", "pod": "prometheus-example-app-676776dcb9-fllq8", "prometheus": "openshift-user-workload-monitoring/user-workload", "service": "prometheus-example-app", "version": "v0.4.0" }, "value": [ 1645500602.075, "1" ] } ] } } tested with each parameter, setting see below # oc -n openshift-user-workload-monitoring get prometheus user-workload -oyaml | grep -E "enforcedLabelLimit|enforcedLabelNameLengthLimit|enforcedLabelValueLengthLimit" enforcedLabelLimit: 1 enforcedLabelNameLengthLimit: 1 enforcedLabelValueLengthLimit: 1 error in /targets api shows label_limit exceeded (metric: version, number of label: 8, limit: 1) , the msg is wrong, since there are 9 labels in total, should throw out error "label_limit exceeded (metric: version, number of label: 9, limit: 1)", it does not count label "version" error in /targets api "labels": { "endpoint": "web", "instance": "10.131.0.21:8080", "job": "prometheus-example-app", "namespace": "ns1", "pod": "prometheus-example-app-676776dcb9-llg28", "prometheus": "openshift-user-workload-monitoring/user-workload", "service": "prometheus-example-app" }, "scrapePool": "serviceMonitor/ns1/prometheus-example-monitor/0", "scrapeUrl": "http://10.131.0.21:8080/metrics", "globalUrl": "http://10.131.0.21:8080/metrics", "lastError": "label_limit exceeded (metric: version, number of label: 8, limit: 1)", "lastScrape": "2022-03-07T07:44:20.443436825Z", "lastScrapeDuration": 0.002481978, "health": "down" }, 2. update enforcedLabelLimit to 8, also error in targets API # oc -n openshift-user-workload-monitoring get prometheus user-workload -oyaml | grep -E "enforcedLabelLimit|enforcedLabelNameLengthLimit|enforcedLabelValueLengthLimit" enforcedLabelLimit: 8 enforcedLabelNameLengthLimit: 1 enforcedLabelValueLengthLimit: 1 "labels": { "endpoint": "web", "instance": "10.131.0.21:8080", "job": "prometheus-example-app", "namespace": "ns1", "pod": "prometheus-example-app-676776dcb9-llg28", "prometheus": "openshift-user-workload-monitoring/user-workload", "service": "prometheus-example-app" }, "scrapePool": "serviceMonitor/ns1/prometheus-example-monitor/0", "scrapeUrl": "http://10.131.0.21:8080/metrics", "globalUrl": "http://10.131.0.21:8080/metrics", "lastError": "label_name_length_limit exceeded (metric: version, label: {__name__ version}, name length: 8, limit: 1)", "lastScrape": "2022-03-07T09:40:20.443332988Z", "lastScrapeDuration": 0.003412566, "health": "down" }, error should be "label_limit exceeded (metric: version, number of label: 9, limit: 1)", 3. update enforcedLabelLimit to 9 # oc -n openshift-user-workload-monitoring get prometheus user-workload -oyaml | grep -E "enforcedLabelLimit|enforcedLabelNameLengthLimit|enforcedLabelValueLengthLimit" enforcedLabelLimit: 9 enforcedLabelNameLengthLimit: 1 enforcedLabelValueLengthLimit: 1 error in targets API is "label_name_length_limit exceeded (metric: version, label: {__name__ version}, name length: 8, limit: 1)", change to "label_name_length_limit exceeded (metric: version, label: {__name__}, name length: 8, limit: 1)" is better "labels": { "endpoint": "web", "instance": "10.131.0.21:8080", "job": "prometheus-example-app", "namespace": "ns1", "pod": "prometheus-example-app-676776dcb9-llg28", "prometheus": "openshift-user-workload-monitoring/user-workload", "service": "prometheus-example-app" }, "scrapePool": "serviceMonitor/ns1/prometheus-example-monitor/0", "scrapeUrl": "http://10.131.0.21:8080/metrics", "globalUrl": "http://10.131.0.21:8080/metrics", "lastError": "label_name_length_limit exceeded (metric: version, label: {__name__ version}, name length: 8, limit: 1)", "lastScrape": "2022-03-07T09:20:50.44352258Z", "lastScrapeDuration": 0.003263413, "health": "down" }, 4. update enforcedLabelNameLengthLimit to 8 # oc -n openshift-user-workload-monitoring get prometheus user-workload -oyaml | grep -E "enforcedLabelLimit|enforcedLabelNameLengthLimit|enforcedLabelValueLengthLimit" enforcedLabelLimit: 9 enforcedLabelNameLengthLimit: 8 enforcedLabelValueLengthLimit: 1 error "label_value_length_limit exceeded (metric: version, label: {__name__ version}, value length: 7, limit: 1)", change to "label_value_length_limit exceeded (metric: version, label: {__name__}, value:{version}, value length: 7, limit: 1)" is better "labels": { "endpoint": "web", "instance": "10.131.0.21:8080", "job": "prometheus-example-app", "namespace": "ns1", "pod": "prometheus-example-app-676776dcb9-llg28", "prometheus": "openshift-user-workload-monitoring/user-workload", "service": "prometheus-example-app" }, "scrapePool": "serviceMonitor/ns1/prometheus-example-monitor/0", "scrapeUrl": "http://10.131.0.21:8080/metrics", "globalUrl": "http://10.131.0.21:8080/metrics", "lastError": "label_value_length_limit exceeded (metric: version, label: {__name__ version}, value length: 7, limit: 1)", "lastScrape": "2022-03-07T09:27:37.12488813Z", "lastScrapeDuration": 0.003259243, "health": "down" } 5. update enforcedLabelValueLengthLimit to 7 # oc -n openshift-user-workload-monitoring get prometheus user-workload -oyaml | grep -E "enforcedLabelLimit|enforcedLabelNameLengthLimit|enforcedLabelValueLengthLimit" enforcedLabelLimit: 9 enforcedLabelNameLengthLimit: 8 enforcedLabelValueLengthLimit: 7 the error msg "label: {instance 10.131.0.21:8080}" is also confusing, "label: {instance}, value: {10.131.0.21:8080}" is better "labels": { "endpoint": "web", "instance": "10.131.0.21:8080", "job": "prometheus-example-app", "namespace": "ns1", "pod": "prometheus-example-app-676776dcb9-llg28", "prometheus": "openshift-user-workload-monitoring/user-workload", "service": "prometheus-example-app" }, "scrapePool": "serviceMonitor/ns1/prometheus-example-monitor/0", "scrapeUrl": "http://10.131.0.21:8080/metrics", "globalUrl": "http://10.131.0.21:8080/metrics", "lastError": "label_value_length_limit exceeded (metric: version, label: {instance 10.131.0.21:8080}, value length: 16, limit: 7)", "lastScrape": "2022-03-07T09:32:50.443931585Z", "lastScrapeDuration": 0.003819909, "health": "down" }, 6. the longest label value is "openshift-user-workload-monitoring/user-workload", length is 48 update enforcedLabelValueLengthLimit to 48 # oc -n openshift-user-workload-monitoring get prometheus user-workload -oyaml | grep -E "enforcedLabelLimit|enforcedLabelNameLengthLimit|enforcedLabelValueLengthLimit" enforcedLabelLimit: 9 enforcedLabelNameLengthLimit: 8 enforcedLabelValueLengthLimit: 48 error is "labels": { "endpoint": "web", "instance": "10.131.0.21:8080", "job": "prometheus-example-app", "namespace": "ns1", "pod": "prometheus-example-app-676776dcb9-llg28", "prometheus": "openshift-user-workload-monitoring/user-workload", "service": "prometheus-example-app" }, "scrapePool": "serviceMonitor/ns1/prometheus-example-monitor/0", "scrapeUrl": "http://10.131.0.21:8080/metrics", "globalUrl": "http://10.131.0.21:8080/metrics", "lastError": "label_name_length_limit exceeded (metric: version, label: {namespace ns1}, name length: 9, limit: 8)", "lastScrape": "2022-03-07T09:36:07.125599191Z", "lastScrapeDuration": 0.001726719, "health": "down" }, since the enforcedLabelNameLengthLimit and enforcedLabelValueLengthLimit are checked for each metrics sequentially, the error "(metric: version, label: {namespace ns1}, name length: 9, limit: 8)" makes sense, but "(metric: version, label: {namespace}, name length: 9, limit: 8)" is better 7. the longest label name is prometheus, length is 9, update enforcedLabelNameLengthLimit to 9, # oc -n openshift-user-workload-monitoring get prometheus user-workload -oyaml | grep -E "enforcedLabelLimit|enforcedLabelNameLengthLimit|enforcedLabelValueLengthLimit" enforcedLabelLimit: 9 enforcedLabelNameLengthLimit: 9 enforcedLabelValueLengthLimit: 48 no error, this is expected. in summary: 1. we should confirm if the enforcedLabelLimit setting in only applied for the labels added by thanos-querier, does not include the labels exposed by user metrics 2. should change the error msg for easy understanding Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069 |