Bug 1891815 - invalid syntax error to list PrometheusRule/ServiceMonitor
Summary: invalid syntax error to list PrometheusRule/ServiceMonitor
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.5
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: ---
: 4.5.z
Assignee: Sergiusz Urbaniak
QA Contact: Junqi Zhao
URL:
Whiteboard: non-multi-arch
: 1892594 1897352 (view as bug list)
Depends On: 1858991
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-10-27 12:48 UTC by Andrei Neagoe
Modified: 2021-02-10 15:34 UTC (History)
36 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1858991
Environment:
[sig-instrumentation] Prometheus when installed on the cluster should have a AlertmanagerReceiversNotConfigured alert in firing state [sig-instrumentation] Prometheus when installed on the cluster should have non-Pod host cAdvisor metrics [sig-instrumentation] Prometheus when installed on the cluster should report telemetry if a cloud.openshift.com token is present [sig-instrumentation][Late] Alerts should have a Watchdog alert in firing state the entire cluster run test: [sig-instrumentation] Prometheus when installed on the cluster should have a AlertmanagerReceiversNotConfigured alert in firing state test: [sig-instrumentation] Prometheus when installed on the cluster should have non-Pod host cAdvisor metrics [sig-instrumentation] Prometheus when installed on the cluster should have important platform topology metrics [sig-instrumentation][sig-builds][Feature:Builds] Prometheus when installed on the cluster should start and expose a secured proxy and verify build metrics [sig-instrumentation] Prometheus when installed on the cluster should start and expose a secured proxy and unsecured metrics [sig-instrumentation] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early]
Last Closed: 2020-12-08 18:26:14 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift prometheus-operator pull 99 0 None closed Bug 1891815: Revert "Bug 1881077:go.sum,go.mod: Bump client-go & co. to fix known bug" 2021-02-18 12:23:17 UTC
Red Hat Knowledge Base (Solution) 5644661 0 None None None 2020-12-13 13:02:51 UTC
Red Hat Product Errata RHBA-2020:5250 0 None None None 2020-12-08 18:26:20 UTC

Internal Links: 1881077

Description Andrei Neagoe 2020-10-27 12:48:22 UTC
If feasible, please backport to 4.5

+++ This bug was initially created as a clone of Bug #1858991 +++

Description of problem:
# oc -n openshift-monitoring logs prometheus-operator-67fb57bc5c-jxlf6 -c prometheus-operator
...
E0721 00:14:29.696630       1 reflector.go:178] github.com/coreos/prometheus-operator/pkg/prometheus/operator.go:480: Failed to list *v1.PrometheusRule: resourceVersion: Invalid value: "43462/43463": strconv.ParseUint: parsing "43462/43463": invalid syntax
E0721 00:14:32.384560       1 reflector.go:178] github.com/coreos/prometheus-operator/pkg/prometheus/operator.go:480: Failed to list *v1.PrometheusRule: resourceVersion: Invalid value: "43462/43463": strconv.ParseUint: parsing "43462/43463": invalid syntax
...

details please see from the attached logs file

Version-Release number of selected component (if applicable):
4.6.0-0.nightly-2020-07-20-183524
Prometheus Operator version '0.40.0'

How reproducible:
always

Steps to Reproduce:
1. see the description
2.
3.

Actual results:
invalid syntax error to list PrometheusRule

Expected results:
no error

Additional info:

--- Additional comment from Lili Cosic on 2020-07-21 07:16:58 UTC ---

Hm strange never seen this in my testing, can you tell me your environment so I can try to reproduce? Or how long you had it running for? Or anything other specific things, as this never shows up for our CI.

Is the stack functioning otherwise? Are targets up and metrics scrapped?

--- Additional comment from Junqi Zhao on 2020-07-22 02:35:29 UTC ---

tested with 4.6.0-0.nightly-2020-07-21-200036, no such error now, close it
# oc -n openshift-monitoring logs $(oc -n openshift-monitoring get po | grep prometheus-operator | awk '{print $1}') -c prometheus-operator | grep "invalid syntax"
no result

--- Additional comment from Junqi Zhao on 2020-07-24 04:13:58 UTC ---

Tested with 4.6.0-0.nightly-2020-07-23-220427 on a fresh cluster, issue is reproduced,
# oc -n openshift-monitoring logs $(oc -n openshift-monitoring get po | grep prometheus-operator | awk '{print $1}') -c prometheus-operator | grep "invalid syntax" | head -n 3
E0724 02:41:22.544393       1 reflector.go:178] github.com/coreos/prometheus-operator/pkg/thanos/operator.go:313: Failed to list *v1.PrometheusRule: resourceVersion: Invalid value: "48466/48467": strconv.ParseUint: parsing "48466/48467": invalid syntax
E0724 02:41:22.876746       1 reflector.go:178] github.com/coreos/prometheus-operator/pkg/prometheus/operator.go:478: Failed to list *v1.ServiceMonitor: resourceVersion: Invalid value: "48487/48488": strconv.ParseUint: parsing "48487/48488": invalid syntax
E0724 02:41:24.497890       1 reflector.go:178] github.com/coreos/prometheus-operator/pkg/thanos/operator.go:313: Failed to list *v1.PrometheusRule: resourceVersion: Invalid value: "48466/48467": strconv.ParseUint: parsing "48466/48467": invalid syntax

prometheus-operator logs please see from the prometheus-operator/logs/prometheus-operator-54df945b6d-n7ffd-prometheus-operator.log of the dump file
alerts see from the alerts/alerts.txt file

--- Additional comment from Junqi Zhao on 2020-07-24 04:14:26 UTC ---



--- Additional comment from Lili Cosic on 2020-07-29 09:11:21 UTC ---

I have seen this happen once or twice as well, seems like its when the multilistwatch package List function gets two resource versions.

--- Additional comment from Lili Cosic on 2020-07-31 11:17:28 UTC ---

Currently in progress but no time to finish it this sprint, due to other higher priority bugzillas. Moving to next sprint.

--- Additional comment from Lili Cosic on 2020-07-31 11:17:42 UTC ---

Currently in progress but no time to finish it this sprint, due to other higher priority bugzillas. Moving to next sprint.

--- Additional comment from Junqi Zhao on 2020-08-03 02:50:59 UTC ---

E0803 02:35:11.273843       1 reflector.go:178] github.com/coreos/prometheus-operator/pkg/prometheus/operator.go:478: Failed to list *v1.ServiceMonitor: resourceVersion: Invalid value: "100350/100352/100353": strconv.ParseUint: parsing "100350/100352/100353": invalid syntax
E0803 02:35:39.898425       1 reflector.go:178] github.com/coreos/prometheus-operator/pkg/thanos/operator.go:313: Failed to list *v1.PrometheusRule: resourceVersion: Invalid value: "100300/100301": strconv.ParseUint: parsing "100300/100301": invalid syntax

when the error shows, there will be following alerts triggerd
*********************************
KubeAPIErrorsHigh
API server is returning errors for 85.71% of requests for LIST servicemonitors .

KubeAPIErrorsHigh
API server is returning errors for 100% of requests for LIST prometheusrules .

PrometheusOperatorListErrors
Errors while performing List operations in controller prometheus in openshift-monitoring namespace.

PrometheusOperatorListErrors
Errors while performing List operations in controller thanos in openshift-monitoring namespace.
*********************************

--- Additional comment from Sergiusz Urbaniak on 2020-08-12 12:51:07 UTC ---

copy-pasting as this issue is more specific than https://bugzilla.redhat.com/show_bug.cgi?id=1856189:

we have high confidence now that, time wise, that rewriting multilistwatcher is a bigger effort than anticipated.

Hence we are having the following strategy now:

- We continue to observe failures in CI.
- We prepare another hotfix, where cluster-monitoring-operator watches prometheus-operator reconcile errors and restarts the pod (with a max count). We observed that restarting prometheus-operator fixes things. We will merge that hotfix only if we don't make the upstream fix in time.
- We work in parallel on an upstream fix in prometheus-operator. Ince upstream is ready, do a 0.40.z release and merge that one into a 4.6.z release.

--- Additional comment from Sergiusz Urbaniak on 2020-09-08 06:46:59 UTC ---

Upstream is merged https://github.com/prometheus-operator/prometheus-operator/pull/3440.

Downstream backport is now ready to be reviewed at https://github.com/openshift/prometheus-operator/pull/86.

--- Additional comment from Sergiusz Urbaniak on 2020-09-08 10:08:29 UTC ---

@junqi, just a quick verification point from a recent e2e run: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_prometheus-operator/86/pull-ci-openshift-prometheus-operator-master-e2e-aws/1303218507856482304/artifacts/e2e-aws/gather-extra/pods/openshift-monitoring_prometheus-operator-b6f8b657-9lbbm_prometheus-operator.log

There are no more "invalid syntax" log entries.

Note that "E0908 07:17:40.535604       1 reflector.go:382] github.com/coreos/prometheus-operator/pkg/informers/informers.go:75: Failed to watch *v1.PrometheusRule: unknown (get prometheusrules.monitoring.coreos.com)"

is not related to this and is tracked in https://bugzilla.redhat.com/show_bug.cgi?id=1856189.

--- Additional comment from Junqi Zhao on 2020-09-10 06:00:34 UTC ---

tested with 4.6.0-0.nightly-2020-09-09-173545, there is not "invalid syntax" log entries.
# oc -n openshift-monitoring logs $(oc -n openshift-monitoring get po | grep prometheus-operator | awk '{print $1}') -c prometheus-operator | grep "invalid syntax"
no result

for the "Failed to list" error, should be related to bug 1856189
# oc -n openshift-monitoring logs $(oc -n openshift-monitoring get po | grep prometheus-operator | awk '{print $1}') -c prometheus-operator | grep "Failed to list"
E0909 23:41:41.327457       1 reflector.go:178] github.com/coreos/prometheus-operator/pkg/informers/informers.go:75: Failed to list *v1.PrometheusRule: prometheusrules.monitoring.coreos.com is forbidden: User "system:serviceaccount:openshift-monitoring:prometheus-operator" cannot list resource "prometheusrules" in API group "monitoring.coreos.com" in the namespace "openshift-service-ca-operator"
E0909 23:41:41.327497       1 reflector.go:178] github.com/coreos/prometheus-operator/pkg/informers/informers.go:75: Failed to list *v1.PrometheusRule: prometheusrules.monitoring.coreos.com is forbidden: User "system:serviceaccount:openshift-monitoring:prometheus-operator" cannot list resource "prometheusrules" in API group "monitoring.coreos.com" in the namespace "openshift-machine-api"
E0909 23:41:41.327524       1 reflector.go:178] github.com/coreos/prometheus-operator/pkg/informers/informers.go:75: Failed to list *v1.PrometheusRule: prometheusrules.monitoring.coreos.com is forbidden: User "system:serviceaccount:openshift-monitoring:prometheus-operator" cannot list resource "prometheusrules" in API group "monitoring.coreos.com" in the namespace "openshift-machine-config-operator"
E0909 23:41:41.327546       1 reflector.go:178] github.com/coreos/prometheus-operator/pkg/informers/informers.go:75: Failed to list *v1.PrometheusRule: prometheusrules.monitoring.coreos.com is forbidden: User "system:serviceaccount:openshift-monitoring:prometheus-operator" cannot list resource "prometheusrules" in API group "monitoring.coreos.com" in the namespace "openshift-marketplace"

--- Additional comment from Sergiusz Urbaniak on 2020-10-22 06:30:40 UTC ---



--- Additional comment from Simon Pasquier on 2020-10-23 07:45:51 UTC ---

Comment 3 Simon Pasquier 2020-10-29 11:54:12 UTC
*** Bug 1892594 has been marked as a duplicate of this bug. ***

Comment 4 Junqi Zhao 2020-10-30 01:47:25 UTC
same bug 1890857

Comment 13 Daniel Del Ciancio 2020-11-10 21:35:03 UTC
I've just upgraded to 4.5.17 and hit the issue as well.  I'm seeing these alerts firing:

AL
KubeAPIErrorsHigh
API server is returning errors for 83.33% of requests for LIST servicemonitors .
 Warning	 Firing
Since 
Nov 9, 5:25 pm

AL
KubeAPIErrorsHigh
API server is returning errors for 100% of requests for LIST podmonitors .
 Warning	 Firing
Since 
Nov 9, 5:23 pm

Further investigation into the prometheus-operator logs shows :

E1110 02:32:38.430139       1 reflector.go:178] github.com/coreos/prometheus-operator/pkg/prometheus/operator.go:485: Failed to list *v1.ServiceMonitor: resourceVersion: Invalid value: "68562505/68562505/68562505/68562505/68562505/68562505/68562505/68562505/68562505/68562505/68562505/68562505/68562505/68562505/68562505/68562505/68562505/68562505/68562505/68562505/68562505/68562505/68562505/68562505/68562505/68562505/68562505/68562505/68562505/68562505/68562505/68562505/68562505/68562505/68562505/68562505/68562505/68562505/68562505/68562505": strconv.ParseUint: parsing "68562505/68562505/68562505/68562505/68562505/68562505/68562505/68562505/68562505/68562505/68562505/68562505/68562505/68562505/68562505/68562505/68562505/68562505/68562505/68562505/68562505/68562505/68562505/68562505/68562505/68562505/68562505/68562505/68562505/68562505/68562505/68562505/68562505/68562505/68562505/68562505/68562505/68562505/68562505/68562505": invalid syntax
E1110 02:33:46.831448       1 reflector.go:178] github.com/coreos/prometheus-operator/pkg/prometheus/operator.go:486: Failed to list *v1.PodMonitor: resourceVersion: Invalid value: "68612517/68612517/68612517/68612517/68612517/68612517/68612517/68612517/68612517/68612517/68612517/68612517/68612517/68612517/68612517/68612517/68612517/68612517/68612517/68612517/68612517/68612517/68612517/68612517/68612517/68612517/68612517/68612517/68612517/68612517/68612517/68612517/68612517/68612517/68612517/68612517/68612517/68612517/68612517/68612517": strconv.ParseUint: parsing "68612517/68612517/68612517/68612517/68612517/68612517/68612517/68612517/68612517/68612517/68612517/68612517/68612517/68612517/68612517/68612517/68612517/68612517/68612517/68612517/68612517/68612517/68612517/68612517/68612517/68612517/68612517/68612517/68612517/68612517/68612517/68612517/68612517/68612517/68612517/68612517/68612517/68612517/68612517/68612517": invalid syntax


The above errors repeat regularly.

Comment 14 Scott Dodson 2020-11-12 19:20:14 UTC
According to CCX team this alert is firing on almost all 4.5.16+ clusters, this should be marked urgent.

Comment 16 Simon Pasquier 2020-11-13 08:35:31 UTC
*** Bug 1897352 has been marked as a duplicate of this bug. ***

Comment 17 Sergiusz Urbaniak 2020-11-13 09:06:57 UTC
UpcomingSprint: We don't have enough capacity to tackle this one in the next sprint (193).

Comment 32 Junqi Zhao 2020-11-26 03:52:57 UTC
based on Comment 31, change to VERIFIED

Comment 36 errata-xmlrpc 2020-12-08 18:26:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.5.22 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:5250


Note You need to log in before you can comment on or make changes to this bug.