Bug 1858991
Summary: | invalid syntax error to list PrometheusRule/ServiceMonitor | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Junqi Zhao <juzhao> | ||||||
Component: | Monitoring | Assignee: | Sergiusz Urbaniak <surbania> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Junqi Zhao <juzhao> | ||||||
Severity: | urgent | Docs Contact: | |||||||
Priority: | urgent | ||||||||
Version: | 4.6 | CC: | aabhishe, alegrand, anpicker, dtaylor, erooth, kakkoyun, lcosic, mf.flip, mloibl, oarribas, pkrupa, spasquie, surbania, vrutkovs, wking | ||||||
Target Milestone: | --- | Keywords: | Reopened | ||||||
Target Release: | 4.6.0 | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | |||||||||
: | 1891815 (view as bug list) | Environment: |
[sig-instrumentation] Prometheus when installed on the cluster should have a AlertmanagerReceiversNotConfigured alert in firing state
[sig-instrumentation] Prometheus when installed on the cluster should have non-Pod host cAdvisor metrics
[sig-instrumentation] Prometheus when installed on the cluster should report telemetry if a cloud.openshift.com token is present
[sig-instrumentation][Late] Alerts should have a Watchdog alert in firing state the entire cluster run
test: [sig-instrumentation] Prometheus when installed on the cluster should have a AlertmanagerReceiversNotConfigured alert in firing state
test: [sig-instrumentation] Prometheus when installed on the cluster should have non-Pod host cAdvisor metrics
|
||||||
Last Closed: | 2020-10-27 16:16:18 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 1870287, 1891815 | ||||||||
Attachments: |
|
Hm strange never seen this in my testing, can you tell me your environment so I can try to reproduce? Or how long you had it running for? Or anything other specific things, as this never shows up for our CI. Is the stack functioning otherwise? Are targets up and metrics scrapped? tested with 4.6.0-0.nightly-2020-07-21-200036, no such error now, close it # oc -n openshift-monitoring logs $(oc -n openshift-monitoring get po | grep prometheus-operator | awk '{print $1}') -c prometheus-operator | grep "invalid syntax" no result Tested with 4.6.0-0.nightly-2020-07-23-220427 on a fresh cluster, issue is reproduced, # oc -n openshift-monitoring logs $(oc -n openshift-monitoring get po | grep prometheus-operator | awk '{print $1}') -c prometheus-operator | grep "invalid syntax" | head -n 3 E0724 02:41:22.544393 1 reflector.go:178] github.com/coreos/prometheus-operator/pkg/thanos/operator.go:313: Failed to list *v1.PrometheusRule: resourceVersion: Invalid value: "48466/48467": strconv.ParseUint: parsing "48466/48467": invalid syntax E0724 02:41:22.876746 1 reflector.go:178] github.com/coreos/prometheus-operator/pkg/prometheus/operator.go:478: Failed to list *v1.ServiceMonitor: resourceVersion: Invalid value: "48487/48488": strconv.ParseUint: parsing "48487/48488": invalid syntax E0724 02:41:24.497890 1 reflector.go:178] github.com/coreos/prometheus-operator/pkg/thanos/operator.go:313: Failed to list *v1.PrometheusRule: resourceVersion: Invalid value: "48466/48467": strconv.ParseUint: parsing "48466/48467": invalid syntax prometheus-operator logs please see from the prometheus-operator/logs/prometheus-operator-54df945b6d-n7ffd-prometheus-operator.log of the dump file alerts see from the alerts/alerts.txt file Created attachment 1702295 [details]
monitoring dump file
I have seen this happen once or twice as well, seems like its when the multilistwatch package List function gets two resource versions. Currently in progress but no time to finish it this sprint, due to other higher priority bugzillas. Moving to next sprint. Currently in progress but no time to finish it this sprint, due to other higher priority bugzillas. Moving to next sprint. E0803 02:35:11.273843 1 reflector.go:178] github.com/coreos/prometheus-operator/pkg/prometheus/operator.go:478: Failed to list *v1.ServiceMonitor: resourceVersion: Invalid value: "100350/100352/100353": strconv.ParseUint: parsing "100350/100352/100353": invalid syntax E0803 02:35:39.898425 1 reflector.go:178] github.com/coreos/prometheus-operator/pkg/thanos/operator.go:313: Failed to list *v1.PrometheusRule: resourceVersion: Invalid value: "100300/100301": strconv.ParseUint: parsing "100300/100301": invalid syntax when the error shows, there will be following alerts triggerd ********************************* KubeAPIErrorsHigh API server is returning errors for 85.71% of requests for LIST servicemonitors . KubeAPIErrorsHigh API server is returning errors for 100% of requests for LIST prometheusrules . PrometheusOperatorListErrors Errors while performing List operations in controller prometheus in openshift-monitoring namespace. PrometheusOperatorListErrors Errors while performing List operations in controller thanos in openshift-monitoring namespace. ********************************* copy-pasting as this issue is more specific than https://bugzilla.redhat.com/show_bug.cgi?id=1856189: we have high confidence now that, time wise, that rewriting multilistwatcher is a bigger effort than anticipated. Hence we are having the following strategy now: - We continue to observe failures in CI. - We prepare another hotfix, where cluster-monitoring-operator watches prometheus-operator reconcile errors and restarts the pod (with a max count). We observed that restarting prometheus-operator fixes things. We will merge that hotfix only if we don't make the upstream fix in time. - We work in parallel on an upstream fix in prometheus-operator. Ince upstream is ready, do a 0.40.z release and merge that one into a 4.6.z release. Upstream is merged https://github.com/prometheus-operator/prometheus-operator/pull/3440. Downstream backport is now ready to be reviewed at https://github.com/openshift/prometheus-operator/pull/86. @junqi, just a quick verification point from a recent e2e run: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_prometheus-operator/86/pull-ci-openshift-prometheus-operator-master-e2e-aws/1303218507856482304/artifacts/e2e-aws/gather-extra/pods/openshift-monitoring_prometheus-operator-b6f8b657-9lbbm_prometheus-operator.log There are no more "invalid syntax" log entries. Note that "E0908 07:17:40.535604 1 reflector.go:382] github.com/coreos/prometheus-operator/pkg/informers/informers.go:75: Failed to watch *v1.PrometheusRule: unknown (get prometheusrules.monitoring.coreos.com)" is not related to this and is tracked in https://bugzilla.redhat.com/show_bug.cgi?id=1856189. tested with 4.6.0-0.nightly-2020-09-09-173545, there is not "invalid syntax" log entries. # oc -n openshift-monitoring logs $(oc -n openshift-monitoring get po | grep prometheus-operator | awk '{print $1}') -c prometheus-operator | grep "invalid syntax" no result for the "Failed to list" error, should be related to bug 1856189 # oc -n openshift-monitoring logs $(oc -n openshift-monitoring get po | grep prometheus-operator | awk '{print $1}') -c prometheus-operator | grep "Failed to list" E0909 23:41:41.327457 1 reflector.go:178] github.com/coreos/prometheus-operator/pkg/informers/informers.go:75: Failed to list *v1.PrometheusRule: prometheusrules.monitoring.coreos.com is forbidden: User "system:serviceaccount:openshift-monitoring:prometheus-operator" cannot list resource "prometheusrules" in API group "monitoring.coreos.com" in the namespace "openshift-service-ca-operator" E0909 23:41:41.327497 1 reflector.go:178] github.com/coreos/prometheus-operator/pkg/informers/informers.go:75: Failed to list *v1.PrometheusRule: prometheusrules.monitoring.coreos.com is forbidden: User "system:serviceaccount:openshift-monitoring:prometheus-operator" cannot list resource "prometheusrules" in API group "monitoring.coreos.com" in the namespace "openshift-machine-api" E0909 23:41:41.327524 1 reflector.go:178] github.com/coreos/prometheus-operator/pkg/informers/informers.go:75: Failed to list *v1.PrometheusRule: prometheusrules.monitoring.coreos.com is forbidden: User "system:serviceaccount:openshift-monitoring:prometheus-operator" cannot list resource "prometheusrules" in API group "monitoring.coreos.com" in the namespace "openshift-machine-config-operator" E0909 23:41:41.327546 1 reflector.go:178] github.com/coreos/prometheus-operator/pkg/informers/informers.go:75: Failed to list *v1.PrometheusRule: prometheusrules.monitoring.coreos.com is forbidden: User "system:serviceaccount:openshift-monitoring:prometheus-operator" cannot list resource "prometheusrules" in API group "monitoring.coreos.com" in the namespace "openshift-marketplace" *** Bug 1890293 has been marked as a duplicate of this bug. *** *** Bug 1890857 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196 |
Created attachment 1701831 [details] prometheus-operator container logs Description of problem: # oc -n openshift-monitoring logs prometheus-operator-67fb57bc5c-jxlf6 -c prometheus-operator ... E0721 00:14:29.696630 1 reflector.go:178] github.com/coreos/prometheus-operator/pkg/prometheus/operator.go:480: Failed to list *v1.PrometheusRule: resourceVersion: Invalid value: "43462/43463": strconv.ParseUint: parsing "43462/43463": invalid syntax E0721 00:14:32.384560 1 reflector.go:178] github.com/coreos/prometheus-operator/pkg/prometheus/operator.go:480: Failed to list *v1.PrometheusRule: resourceVersion: Invalid value: "43462/43463": strconv.ParseUint: parsing "43462/43463": invalid syntax ... details please see from the attached logs file Version-Release number of selected component (if applicable): 4.6.0-0.nightly-2020-07-20-183524 Prometheus Operator version '0.40.0' How reproducible: always Steps to Reproduce: 1. see the description 2. 3. Actual results: invalid syntax error to list PrometheusRule Expected results: no error Additional info: