1856189 – Failed to watch errors in prometheus-operator container logs

Bug 1856189 - Failed to watch errors in prometheus-operator container logs

Summary: Failed to watch errors in prometheus-operator container logs

Keywords:
Status:	CLOSED DUPLICATE of bug 1863011
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	low
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Sergiusz Urbaniak
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	1833189 1867230 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-07-13 06:33 UTC by Junqi Zhao
Modified:	2020-09-17 07:08 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-09-17 07:08:39 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
prometheus-operator container logs (73.78 KB, text/plain) 2020-07-13 06:33 UTC, Junqi Zhao	no flags	Details
cluster-monitoring-operator logs (598.60 KB, text/plain) 2020-07-13 10:21 UTC, Junqi Zhao	no flags	Details
monitoring dump file (1.46 MB, application/gzip) 2020-07-13 10:35 UTC, Junqi Zhao	no flags	Details
all "Failed to watch" errors in prometheus-operator container (49.61 KB, text/plain) 2020-07-14 12:36 UTC, Junqi Zhao	no flags	Details
faulty prometheus-operator log (44.82 KB, text/plain) 2020-08-07 10:05 UTC, Sergiusz Urbaniak	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift prometheus-operator pull 82	0	None	closed	Bug 1856189: pkg/listwatch: err out with multiple resource versions in List	2021-02-15 12:49:14 UTC

Description Junqi Zhao 2020-07-13 06:33:07 UTC

Created attachment 1700771 [details]
prometheus-operator container logs

Created attachment 1700771 [details]
prometheus-operator container logs

Description of problem:
Failed to watch *v1.PrometheusRule/ServiceMonitor/ConfigMap/ThanosRuler/Secret logs in prometheus-operator container logs, details see from the log file
# oc -n openshift-monitoring logs prometheus-operator-78d6c7dd7b-5nqrx -c prometheus-operator
...
level=info ts=2020-07-12T23:33:24.412694678Z caller=operator.go:1920 component=prometheusoperator msg="CRD updated" crd=ServiceMonitor
level=info ts=2020-07-12T23:33:24.439504477Z caller=operator.go:1920 component=prometheusoperator msg="CRD updated" crd=PrometheusRule
...
E0712 23:36:13.300168       1 reflector.go:280] github.com/coreos/prometheus-operator/pkg/thanos/operator.go:319: Failed to watch *v1.PrometheusRule: expected resource version to have 37 parts to match the number of ListerWatchers
E0712 23:36:13.301891       1 reflector.go:280] github.com/coreos/prometheus-operator/pkg/prometheus/operator.go:480: Failed to watch *v1.PrometheusRule: expected resource version to have 37 parts to match the number of ListerWatchers
E0712 23:41:32.523357       1 reflector.go:280] github.com/coreos/prometheus-operator/pkg/prometheus/operator.go:478: Failed to watch *v1.ServiceMonitor: expected resource version to have 37 parts to match the number of ListerWatchers
...
E0713 00:22:30.566988       1 reflector.go:280] github.com/coreos/prometheus-operator/pkg/thanos/operator.go:318: Failed to watch *v1.ConfigMap: unknown (get configmaps)
E0713 00:31:54.324789       1 reflector.go:280] github.com/coreos/prometheus-operator/pkg/thanos/operator.go:317: Failed to watch *v1.ThanosRuler: unknown (get thanosrulers.monitoring.coreos.com)
E0713 02:46:14.565044       1 reflector.go:280] github.com/coreos/prometheus-operator/pkg/prometheus/operator.go:482: Failed to watch *v1.Secret: unknown (get secrets)


# oc explain PrometheusRule
KIND:     PrometheusRule
VERSION:  monitoring.coreos.com/v1
...

# oc explain ServiceMonitor
KIND:     ServiceMonitor
VERSION:  monitoring.coreos.com/v1
...

# oc explain ConfigMap
KIND:     ConfigMap
VERSION:  v1
...

# oc explain ThanosRuler
KIND:     ThanosRuler
VERSION:  monitoring.coreos.com/v1
...

# oc explain Secret
KIND:     Secret
VERSION:  v1
...

Version-Release number of selected component (if applicable):
4.6.0-0.nightly-2020-07-12-204440
Prometheus Operator version '0.38.1'
How reproducible:
always

Steps to Reproduce:
1. see the description
2.
3.

Actual results:
Failed to watch *v1.PrometheusRule/ServiceMonitor/ConfigMap/ThanosRuler/Secret

Expected results:
no error

Additional info:

Comment 1 Lili Cosic 2020-07-13 08:17:22 UTC

> E0713 02:46:14.565044       1 reflector.go:280] github.com/coreos/prometheus-operator/pkg/prometheus/operator.go:482: Failed to watch *v1.Secret: unknown (get secrets)

We have noticed this in kube-state-metrics as well IIRC? Does the error go away at some point?

Comment 2 Lili Cosic 2020-07-13 09:47:25 UTC

Can you share the logs of cluster-monitoring-operator as well?

On nightly I found the following in prom-operator logs:
> E0713 08:56:44.860949       1 operator.go:996] Sync "openshift-monitoring/k8s" failed: configmaps is forbidden: User "system:serviceaccount:openshift-monitoring:prometheus-operator" cannot list resource "configmaps" in API group "" in the namespace "openshift-monitoring"


And then found this in CMO logs:

I0713 08:57:57.775984       1 operator.go:345] Updating ClusterOperator status to failed. Err: running task Updating Prometheus Operator failed: waiting for Prometheus CRDs to become available failed: Get https://172.30.0.1:443/apis/apiextensions.k8s.io/v1beta1/customresourcedefinitions/prometheuses.monitoring.coreos.com: unexpected EOF
E0713 08:57:57.789011       1 operator.go:277] Syncing "openshift-monitoring/cluster-monitoring-config" failed
E0713 08:57:57.791658       1 operator.go:278] sync "openshift-monitoring/cluster-monitoring-config" failed: running task Updating Prometheus Operator failed: waiting for Prometheus CRDs to become available failed: Get https://172.30.0.1:443/apis/apiextensions.k8s.io/v1beta1/customresourcedefinitions/prometheuses.monitoring.coreos.com: unexpected EOF
W0713 08:57:57.796838       1 operator.go:396] No Cluster Monitoring ConfigMap was found. Using defaults.

Can you confirm its the same for you in the cluster-monitoring-operator logs? Thanks

Comment 3 Junqi Zhao 2020-07-13 10:20:49 UTC

the errors are only in prometheus-operator and they never go away, 
E0713 09:36:21.163592       1 reflector.go:280] github.com/coreos/prometheus-operator/pkg/thanos/operator.go:319: Failed to watch *v1.PrometheusRule: expected resource version to have 37 parts to match the number of ListerWatchers
E0713 09:41:57.888675       1 reflector.go:280] github.com/coreos/prometheus-operator/pkg/prometheus/operator.go:477: Failed to watch *v1.Prometheus: unknown (get prometheuses.monitoring.coreos.com)
E0713 09:42:30.113598       1 reflector.go:280] github.com/coreos/prometheus-operator/pkg/prometheus/operator.go:480: Failed to watch *v1.PrometheusRule: expected resource version to have 37 parts to match the number of ListerWatchers
E0713 09:44:26.245619       1 reflector.go:280] github.com/coreos/prometheus-operator/pkg/thanos/operator.go:319: Failed to watch *v1.PrometheusRule: expected resource version to have 37 parts to match the number of ListerWatchers
E0713 09:51:53.197237       1 reflector.go:280] github.com/coreos/prometheus-operator/pkg/prometheus/operator.go:480: Failed to watch *v1.PrometheusRule: expected resource version to have 37 parts to match the number of ListerWatchers
E0713 09:53:24.330961       1 reflector.go:280] github.com/coreos/prometheus-operator/pkg/thanos/operator.go:319: Failed to watch *v1.PrometheusRule: expected resource version to have 37 parts to match the number of ListerWatchers
E0713 10:01:05.417023       1 reflector.go:280] github.com/coreos/prometheus-operator/pkg/thanos/operator.go:319: Failed to watch *v1.PrometheusRule: expected resource version to have 37 parts to match the number of ListerWatchers
E0713 10:01:11.282930       1 reflector.go:280] github.com/coreos/prometheus-operator/pkg/prometheus/operator.go:480: Failed to watch *v1.PrometheusRule: expected resource version to have 37 parts to match the number of ListerWatchers
E0713 10:06:19.422937       1 reflector.go:280] github.com/coreos/prometheus-operator/pkg/prometheus/operator.go:480: Failed to watch *v1.PrometheusRule: expected resource version to have 37 parts to match the number of ListerWatchers
E0713 10:09:23.496442       1 reflector.go:280] github.com/coreos/prometheus-operator/pkg/thanos/operator.go:319: Failed to watch *v1.PrometheusRule: expected resource version to have 37 parts to match the number of ListerWatchers
E0713 10:12:41.504035       1 reflector.go:280] github.com/coreos/prometheus-operator/pkg/prometheus/operator.go:480: Failed to watch *v1.PrometheusRule: expected resource version to have 37 parts to match the number of ListerWatchers
E0713 10:15:47.574989       1 reflector.go:280] github.com/coreos/prometheus-operator/pkg/thanos/operator.go:319: Failed to watch *v1.PrometheusRule: expected resource version to have 37 parts to match the number of ListerWatchers

there is not such error in cluster-monitoring-operator logs, see the attached

Comment 4 Junqi Zhao 2020-07-13 10:21:26 UTC

Created attachment 1700806 [details]
cluster-monitoring-operator logs

Comment 5 Junqi Zhao 2020-07-13 10:35:26 UTC

Created attachment 1700811 [details]
monitoring dump file

you can find all pods' logs and resources file

Comment 6 Lili Cosic 2020-07-13 12:11:43 UTC

@Frederic I found the following while trying to see if its a similar problem to the prom-operator 0.40 one:
- I can reproduce it, but problem goes away after prom-operator pod is restarted, so my suspicion is race condition?
- We did recently touch the CRD creation which might explain the above logs, so we should possibly create those before cluster monitoring operator?

Comment 7 Lili Cosic 2020-07-13 13:15:16 UTC

Also for what it's worth it seems this problem only starts to happen from roughly 2 days ago onwards, I did not see it on Friday at all.

Comment 8 Simon Pasquier 2020-07-13 13:26:25 UTC

@Lili something to note: we have rolled back the operator to v0.38.1 but we don't have the change [1] that removed the `--manage-crds=false` flag. Hence both CMO and prometheus operator try to create CRDs now.

[1] https://github.com/openshift/cluster-monitoring-operator/pull/843

Comment 9 Lili Cosic 2020-07-13 13:33:28 UTC

Seems like even in the 0.40 bump we see these errors:

> E0713 12:07:29.489513       1 operator.go:999] Sync "openshift-monitoring/k8s" failed: configmaps is forbidden: User "system:serviceaccount:openshift-monitoring:prometheus-operator" cannot list resource "configmaps" in API group "" in the namespace "openshift-monitoring"


So most likely unrelated, but still worth exploring.

Comment 10 Lili Cosic 2020-07-14 11:23:01 UTC

@Junqi I cannot see this in the latest nightly:

> E0713 00:31:54.324789       1 reflector.go:280] github.com/coreos/prometheus-operator/pkg/thanos/operator.go:317: Failed to watch *v1.ThanosRuler: unknown (get thanosrulers.monitoring.coreos.com)

Can you confirm its still present? Thanks!

Comment 11 Junqi Zhao 2020-07-14 12:35:42 UTC

(In reply to Lili Cosic from comment #10)
> @Junqi I cannot see this in the latest nightly:
> 
> > E0713 00:31:54.324789       1 reflector.go:280] github.com/coreos/prometheus-operator/pkg/thanos/operator.go:317: Failed to watch *v1.ThanosRuler: unknown (get thanosrulers.monitoring.coreos.com)
> 
> Can you confirm its still present? Thanks!

yes, no error for "Failed to watch *v1.ThanosRuler", see all the errors in prometheus-operator container

Comment 12 Junqi Zhao 2020-07-14 12:36:36 UTC

Created attachment 1701019 [details]
all "Failed to watch" errors in prometheus-operator container

Comment 13 Lili Cosic 2020-07-14 14:31:55 UTC

Strange, did not see the "unknown" in my nightly. 

@Frederic But the msgs "expected resource version to have 37 parts to match the number of ListerWatchers" are explained here -> https://github.com/coreos/prometheus-operator/issues/3218 So that part we can ignore for now, as we want to remove the multi list watcher. I am more worried about the "unknown" errors and the forbidden ones.

Comment 14 Lili Cosic 2020-07-23 06:50:37 UTC

I have not seen these problems since we bumped to 0.40 release, have you Junqi?

Comment 15 Junqi Zhao 2020-07-24 04:19:54 UTC

(In reply to Lili Cosic from comment #14)
> I have not seen these problems since we bumped to 0.40 release, have you
> Junqi?

there are the "Failed to watch" errors now, Prometheus Operator version '0.40.0'
# oc -n openshift-monitoring logs $(oc -n openshift-monitoring get po | grep prometheus-operator | awk '{print $1}') -c prometheus-operator | grep "Failed to watch"
E0724 00:15:30.223920       1 reflector.go:382] github.com/coreos/prometheus-operator/pkg/prometheus/operator.go:480: Failed to watch *v1.PrometheusRule: expected resource version to have 1 part, got 2
E0724 01:16:03.943670       1 reflector.go:382] github.com/coreos/prometheus-operator/pkg/prometheus/operator.go:482: Failed to watch *v1.Secret: unknown (get secrets)
E0724 02:46:09.091550       1 reflector.go:382] github.com/coreos/prometheus-operator/pkg/alertmanager/operator.go:216: Failed to watch *v1.Alertmanager: unknown (get alertmanagers.monitoring.coreos.com)
E0724 03:45:24.961676       1 reflector.go:382] github.com/coreos/prometheus-operator/pkg/prometheus/operator.go:483: Failed to watch *v1.StatefulSet: unknown (get statefulsets.apps)

Comment 16 Frederic Branczyk 2020-07-31 14:13:31 UTC

This will need another look, previous attempts did not seem to fix this fully. Moving to next sprint.

Comment 17 Junqi Zhao 2020-08-07 03:59:00 UTC

tested with 4.6.0-0.nightly-2020-08-06-131904
# oc -n openshift-monitoring logs $(oc -n openshift-monitoring get po | grep prometheus-operator | awk '{print $1}') -c prometheus-operator | grep "Failed to"
E0807 00:41:37.429177       1 reflector.go:382] github.com/coreos/prometheus-operator/pkg/thanos/operator.go:311: Failed to watch *v1.ThanosRuler: unknown (get thanosrulers.monitoring.coreos.com)
E0807 01:11:02.742601       1 reflector.go:382] github.com/coreos/prometheus-operator/pkg/thanos/operator.go:313: Failed to watch *v1.PrometheusRule: unknown (get prometheusrules.monitoring.coreos.com)
E0807 03:11:37.208029       1 reflector.go:382] github.com/coreos/prometheus-operator/pkg/prometheus/operator.go:481: Failed to watch *v1.ConfigMap: unknown (get configmaps)
E0807 03:11:38.041840       1 reflector.go:382] github.com/coreos/prometheus-operator/pkg/thanos/operator.go:313: Failed to watch *v1.PrometheusRule: unknown (get prometheusrules.monitoring.coreos.com)
E0807 03:11:38.130261       1 reflector.go:178] github.com/coreos/prometheus-operator/pkg/prometheus/operator.go:481: Failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:openshift-monitoring:prometheus-operator" cannot list resource "configmaps" in API group "" in the namespace "openshift-monitoring"

Comment 18 Sergiusz Urbaniak 2020-08-07 10:05:08 UTC

Created attachment 1710779 [details]
faulty prometheus-operator log

We have yet more strong indicators that clusters that:

a) don't export all alerts after installation
b) don't export all recording rules after installation

correlate with the log outputs here. Today I was investigating a (cluster-bot) cluster which showed symptoms a) and b).

The prometheus-operator log revealed the log lines reported here (see attached file):

E0807 08:05:26.132193       1 reflector.go:382] github.com/coreos/prometheus-operator/pkg/prometheus/operator.go:479: Failed to watch *v1.PodMonitor: expected resource version to have 1 part, got 2
level=info ts=2020-08-07T08:05:26.171112359Z caller=operator.go:256 component=thanosoperator msg="successfully synced all caches"
E0807 08:05:26.176190       1 reflector.go:382] github.com/coreos/prometheus-operator/pkg/prometheus/operator.go:480: Failed to watch *v1.PrometheusRule: expected resource version to have 1 part, got 2

and many

E0807 08:05:27.313755       1 reflector.go:178] github.com/coreos/prometheus-operator/pkg/prometheus/operator.go:479: Failed to list *v1.PodMonitor: resourceVersion: Invalid value: "31998/31999": strconv.ParseUint: parsing "31998/31999": invalid syntax
E0807 08:05:27.522495       1 reflector.go:178] github.com/coreos/prometheus-operator/pkg/prometheus/operator.go:480: Failed to list *v1.PrometheusRule: resourceVersion: Invalid value: "32000/32001": strconv.ParseUint: parsing "32000/32001": invalid syntax

entries.

A restart of prometheus-operator solves the issue.

My hypothesis:
a) In some cases, when prometheus-operator starts initially, it gets reported multiple resource versions causing errors [1] and subsequent cascading failures [2]. Especially the failure case [2] is interesting, as it is triggered by some code path in client-go (most likely [3]) but I am still investigating.

b) In many other cases, when prometheus-operator starts initially, it gets reported only one resource version, then functioninig correctly.

[1] https://github.com/prometheus-operator/prometheus-operator/blob/2bb654e5ca930afc60a1a7af309816ad2d78c923/pkg/listwatch/listwatch.go#L177
[2] https://github.com/prometheus-operator/prometheus-operator/blob/ace1ce936b6ac2ccc717a8c87912b76bb11c290c/vendor/k8s.io/client-go/tools/cache/reflector.go#L278
[3] https://github.com/prometheus-operator/prometheus-operator/blob/b66baffe2e65ccd0671fc0b63e5009992304655b/vendor/k8s.io/client-go/tools/cache/mutation_cache.go#L237

Comment 19 Sergiusz Urbaniak 2020-08-07 10:05:42 UTC

Raising severity to high as the observed symptons are missing recording and alerting rules.

Comment 20 Sergiusz Urbaniak 2020-08-10 05:50:56 UTC

*** Bug 1867230 has been marked as a duplicate of this bug. ***

Comment 21 Sergiusz Urbaniak 2020-08-10 05:51:44 UTC

Some more investigation e2e runs: https://bugzilla.redhat.com/show_bug.cgi?id=1867230

Potential follow-up fix in upstream: https://github.com/prometheus-operator/prometheus-operator/pull/3396

Comment 22 David Eads 2020-08-10 12:21:54 UTC

trying to make sippy find this

test:
[sig-instrumentation] Prometheus when installed on the cluster should start and expose a secured proxy and unsecured metrics 

test: [sig-instrumentation] Prometheus when installed on the cluster should have important platform topology metrics [Suite:openshift/conformance/parallel]

Comment 23 David Eads 2020-08-10 13:37:11 UTC

also linking to 

test: [sig-instrumentation] Prometheus when installed on the cluster should have a AlertmanagerReceiversNotConfigured alert in firing state
test: [sig-instrumentation] Prometheus when installed on the cluster should have non-Pod host cAdvisor metrics

based on https://app.slack.com/client/T027F3GAJ/C0VMT03S5/thread/C0VMT03S5-1597066515.468000

Comment 26 Yu Qi Zhang 2020-08-10 15:35:57 UTC

Prometheus tests also fail quite a bit of Azure:

https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-4.6/1292725561579278336
https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-4.6/1292679726086754304
https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-4.6/1291866204109017088

Just cross-linking here to check if these have the same cause.

Comment 27 Junqi Zhao 2020-08-11 03:01:11 UTC

tested with 4.6.0-0.nightly-2020-08-10-180431,still see the errors
# oc -n openshift-monitoring logs $(oc -n openshift-monitoring get po | grep prometheus-operator | awk '{print $1}') -c prometheus-operator | grep "Failed to watch" 
E0810 23:33:43.891001       1 reflector.go:382] github.com/coreos/prometheus-operator/pkg/prometheus/operator.go:478: Failed to watch *v1.ServiceMonitor: expected resource version to have 1 part, got 3
E0810 23:58:50.771826       1 reflector.go:382] github.com/coreos/prometheus-operator/pkg/alertmanager/operator.go:216: Failed to watch *v1.Alertmanager: unknown (get alertmanagers.monitoring.coreos.com)
E0811 02:03:49.478946       1 reflector.go:382] github.com/coreos/prometheus-operator/pkg/alertmanager/operator.go:217: Failed to watch *v1.StatefulSet: unknown (get statefulsets.apps)


# oc api-resources | grep -E "Alertmanager|StatefulSet"
NAME                                  SHORTNAMES       APIGROUP                              NAMESPACED   KIND
statefulsets                          sts              apps                                  true         StatefulSet
alertmanagers                                          monitoring.coreos.com                 true         Alertmanager

Comment 28 Sergiusz Urbaniak 2020-08-11 12:04:14 UTC

@junqi the errors are expected. the question is if the cluster converges against a stable state and if all prometheus rules/recording/alerting rules/service monitors are available.

However, the newest patch proofed not to fix the situation.

Reassigning to lili as she is working on an overhaul for the multilistwatcher code in prometheus-operator. Initially we wanted to land the overhaul in 4.7 but since a hotfix is not possible, we prioritizing it for 4.6.

Comment 29 Sergiusz Urbaniak 2020-08-12 09:45:07 UTC

we have high confidence now that, time wise, lili's patch is a bigger effort than anticipated.

Hence we are having the following strategy now:

- We continue to observe failures in CI.
- We prepare another hotfix, where cluster-monitoring-operator watches prometheus-operator reconcile errors and restarts the pod (with a max count). We observed that restarting prometheus-operator fixes things. We will merge that hotfix only if we don't make the upstream fix in time.
- We work in parallel on an upstream fix in prometheus-operator. Ince upstream is ready, do a 0.40.z release and merge that one into a 4.6.z release.

Comment 30 Sergiusz Urbaniak 2020-08-12 12:53:45 UTC

lowering severity here to low as we should focus here only on the 

E0713 00:22:30.566988       1 reflector.go:280] github.com/coreos/prometheus-operator/pkg/thanos/operator.go:318: Failed to watch *v1.ConfigMap: unknown (get configmaps)

errors only. So far, they showed no sign of functionality degradation.

The multilistwatcher issues and missing reconcilation issues in prometheus-operator are being tracked in https://bugzilla.redhat.com/show_bug.cgi?id=1858991 from now on.

Comment 31 Junqi Zhao 2020-08-12 14:40:53 UTC

(In reply to Sergiusz Urbaniak from comment #28)
> @junqi the errors are expected. the question is if the cluster converges
> against a stable state and if all prometheus rules/recording/alerting
> rules/service monitors are available.
> 
> However, the newest patch proofed not to fix the situation.
> 
> Reassigning to lili as she is working on an overhaul for the
> multilistwatcher code in prometheus-operator. Initially we wanted to land
> the overhaul in 4.7 but since a hotfix is not possible, we prioritizing it
> for 4.6.

a stable cluster, and all prometheus rules/recording/alerting rules/service monitors are available. and there are more erorrs
# oc -n openshift-monitoring logs $(oc -n openshift-monitoring get po | grep prometheus-operator | awk '{print $1}') -c prometheus-operator | grep "Failed to watch" 
E0812 09:38:15.163395       1 reflector.go:382] github.com/coreos/prometheus-operator/pkg/prometheus/operator.go:479: Failed to watch *v1.PodMonitor: unknown (get podmonitors.monitoring.coreos.com)
E0812 09:38:15.249811       1 reflector.go:382] github.com/coreos/prometheus-operator/pkg/prometheus/operator.go:482: Failed to watch *v1.Secret: unknown (get secrets)
E0812 11:23:55.900809       1 reflector.go:382] github.com/coreos/prometheus-operator/pkg/prometheus/operator.go:479: Failed to watch *v1.PodMonitor: unknown (get podmonitors.monitoring.coreos.com)
E0812 12:23:16.346716       1 reflector.go:382] github.com/coreos/prometheus-operator/pkg/thanos/operator.go:313: Failed to watch *v1.PrometheusRule: unknown (get prometheusrules.monitoring.coreos.com)
E0812 13:10:40.510172       1 reflector.go:382] github.com/coreos/prometheus-operator/pkg/prometheus/operator.go:477: Failed to watch *v1.Prometheus: unknown (get prometheuses.monitoring.coreos.com)
E0812 13:36:55.019911       1 reflector.go:382] github.com/coreos/prometheus-operator/pkg/thanos/operator.go:313: Failed to watch *v1.PrometheusRule: unknown (get prometheusrules.monitoring.coreos.com)
E0812 13:36:55.020090       1 reflector.go:382] github.com/coreos/prometheus-operator/pkg/prometheus/operator.go:479: Failed to watch *v1.PodMonitor: unknown (get podmonitors.monitoring.coreos.com)
E0812 13:36:55.021541       1 reflector.go:382] github.com/coreos/prometheus-operator/pkg/prometheus/operator.go:480: Failed to watch *v1.PrometheusRule: unknown (get prometheusrules.monitoring.coreos.com)
E0812 13:36:55.021573       1 reflector.go:382] github.com/coreos/prometheus-operator/pkg/prometheus/operator.go:478: Failed to watch *v1.ServiceMonitor: unknown (get servicemonitors.monitoring.coreos.com)

Comment 32 David Taylor 2020-08-19 15:42:41 UTC

Hi, we are noticing many Alerting pull-ci-openshift-console-master-e2e-gcp-console test flakes: https://search.ci.openshift.org/?search=Monitoring%3A+Alerts+creates+and+expires+a+Silence&maxAge=48h&context=3&type=bug%2Bjunit&name=e2e-gcp-console&maxMatches=5&maxBytes=20971520&groupBy=job

Is this bug contributing to this?  I ask because I see in the comments that one of the symptoms was some alerts and rules not initially loading, which seems to be the case:  https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_console/6236/pull-ci-openshift-console-master-e2e-gcp-console/1295850715876429824/artifacts/e2e-gcp-console/gui_test_screenshots/cypress/screenshots/monitoring/monitoring.spec.ts/Monitoring%20Alerts%20--%20displays%20and%20filters%20the%20Alerts%20list%20page,%20links%20to%20detail%20pages%20%28failed%29.png

Comment 35 Sergiusz Urbaniak 2020-08-25 07:34:06 UTC

Using a temporary verbosity increase of prometheus-operator in https://github.com/openshift/cluster-monitoring-operator/pull/914 revealed the following:

- It seems rolebindings which are necessary to execute watch calls against the kube-apiserver are not effective yet.
- An initial watch/get/list call fails and manifests differently in log files, sometimes explicitely with "forbidden" messages and sometimes with 'unknown get' errors as shown below.
- Effectively it is all the same issue, namely not-yet-effective rolebindings.
- It is a bit unfortunate as the default error message for this edge is "unknown" as per [1]. In case of a 403 response as shown above client-go simply reuses the "unknown" message masking the actual underlying permission issue, see [2]

[1] https://github.com/prometheus-operator/prometheus-operator/blob/ace1ce936b6ac2ccc717a8c87912b76bb11c290c/vendor/k8s.io/client-go/rest/request.go#L1111
[2] https://github.com/prometheus-operator/prometheus-operator/blob/7515d115eb0e0ba694bd3a0e48b496e331442e1e/vendor/k8s.io/apimachinery/pkg/api/errors/errors.go#L415-L418

From the prometheus-operator logs in https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/914/pull-ci-openshift-cluster-monitoring-operator-master-e2e/1297945391496433664/artifacts/e2e/gather-extra/pods/ we can see:

```
I0824 18:38:19.364535       1 round_trippers.go:443] GET https://172.30.0.1:443/apis/monitoring.coreos.com/v1/namespaces/openshift-monitoring/prome
theuses?allowWatchBookmarks=true&resourceVersion=33986&timeout=6m4s&timeoutSeconds=364&watch=true 403 Forbidden in 1 milliseconds
I0824 18:38:19.364621       1 round_trippers.go:449] Response Headers:
I0824 18:38:19.364639       1 round_trippers.go:452]     Audit-Id: 967cc52f-a6cd-4395-b9c3-9fa803e8b7f8
I0824 18:38:19.364646       1 round_trippers.go:452]     Cache-Control: no-cache, private
I0824 18:38:19.364652       1 round_trippers.go:452]     Content-Type: application/json
I0824 18:38:19.364657       1 round_trippers.go:452]     X-Content-Type-Options: nosniff
I0824 18:38:19.364662       1 round_trippers.go:452]     Content-Length: 420
I0824 18:38:19.364667       1 round_trippers.go:452]     Date: Mon, 24 Aug 2020 18:38:19 GMT
I0824 18:38:19.364689       1 request.go:1068] Response Body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"promet
heuses.monitoring.coreos.com is forbidden: User \"system:serviceaccount:openshift-monitoring:prometheus-operator\" cannot watch resource \"promethe
uses\" in API group \"monitoring.coreos.com\" in the namespace \"openshift-monitoring\"","reason":"Forbidden","details":{"group":"monitoring.coreos
.com","kind":"prometheuses"},"code":403}
E0824 18:38:19.364723       1 reflector.go:382] github.com/coreos/prometheus-operator/pkg/prometheus/operator.go:477: Failed to watch *v1.Prometheus: unknown (get prometheuses.monitoring.coreos.com)
```

client-go then retries and a subsequent watch call succeeds once the role binding is effective:

```
I0824 18:38:20.611119       1 round_trippers.go:443] GET https://172.30.0.1:443/apis/monitoring.coreos.com/v1/namespaces/openshift-monitoring/prometheuses?resourceVersion=33986 200 OK in 1 milliseconds
I0824 18:38:20.611137       1 round_trippers.go:449] Response Headers:
I0824 18:38:20.611144       1 round_trippers.go:452]     X-Kubernetes-Pf-Flowschema-Uid: d7d2f0ee-90c7-4117-aa53-a11a8baea206
I0824 18:38:20.611150       1 round_trippers.go:452]     X-Kubernetes-Pf-Prioritylevel-Uid: d860e2a8-f121-4056-a625-f74d2bdcc776
I0824 18:38:20.611156       1 round_trippers.go:452]     Date: Mon, 24 Aug 2020 18:38:20 GMT
I0824 18:38:20.611163       1 round_trippers.go:452]     Audit-Id: a40f496a-78bd-4fa5-849b-7060477bcfcd
I0824 18:38:20.611168       1 round_trippers.go:452]     Cache-Control: no-cache, private
I0824 18:38:20.611174       1 round_trippers.go:452]     Content-Type: application/json
I0824 18:38:20.611575       1 request.go:1068] Response Body: {"apiVersion":"monitoring.coreos.com/v1","items":[{"apiVersion":"monitoring.coreos.com/v1","kind":"Prometheus","metadata":{"creationTimestamp":"2020-08-24T17:51:29Z"...
```

We need to find a way to initiate deployment of assets if and only if role bindings are effective.

Comment 36 Sergiusz Urbaniak 2020-08-25 07:36:22 UTC

The good news here is though: The error messages in the log files are transient, the role bindings are eventually effective in the API.

Comment 37 Sergiusz Urbaniak 2020-08-25 07:36:31 UTC

*** Bug 1833189 has been marked as a duplicate of this bug. ***

Comment 38 Sergiusz Urbaniak 2020-08-25 07:38:53 UTC

Setting target release to 4.7.0 as this is a good tech debt item which could be resolved/implemented in cluster-monitoring-operator.

Comment 39 Sergiusz Urbaniak 2020-09-11 13:15:05 UTC

Further discussion with api machinery team indicates this seems to be caused by the removal of kube-apiserver informer sync logic at startup.

The investigation is ongoing.

Comment 40 Sergiusz Urbaniak 2020-09-15 16:00:44 UTC

Further investigation showed that cluster-monitoring-operator recreates those role bindings every few minutes:

$ kubectl -n openshift-monitoring get clusterrolebinding prometheus-operator -o yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  creationTimestamp: "2020-09-15T15:38:51Z"
...

<delay>

$ kubectl -n openshift-monitoring get clusterrolebinding prometheus-operator -o yaml
GET:https://api.devint.openshiftknativedemo.org:6443/apis/logging.openshift.io/v1?timeout=32s
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  creationTimestamp: "2020-09-15T15:53:52Z"
...

We need to be smarter in apply logic of cluster-monitoring-operator and prevent recreation.

Comment 41 W. Trevor King 2020-09-16 05:09:13 UTC

From [1]:

  $ oc adm release extract --to manifests quay.io/openshift-release-dev/ocp-release:4.6.0-fc.6-x86_64
  Extracted release payload from digest sha256:933f3d6f61ddec9f3b88a0932b47c438d7dfc15ff1873ab176284b66c9cff76e created at 2020-09-14T21:50:05Z
  $ diff -u manifests/0000_90_kube-controller-manager-operator_05_alert-pdb.yaml manifests/0000_90_kube-controller-manager-operator_05_alert-kcm-down.yaml
  --- manifests/0000_90_kube-controller-manager-operator_05_alert-pdb.yaml	2020-09-12 05:33:59.000000000 -0700
  +++ manifests/0000_90_kube-controller-manager-operator_05_alert-kcm-down.yaml	2020-09-12 05:33:59.000000000 -0700
  @@ -9,19 +9,11 @@
     groups:
       - name: cluster-version
         rules:
  -        - alert: PodDisruptionBudgetAtLimit
  +        - alert: KubeControllerManagerDown
             annotations:
  -            message: The pod disruption budget is preventing further disruption to pods because it is at the minimum allowed level.
  +            message: KubeControllerManager has disappeared from Prometheus target discovery.
             expr: |
  -            max by(namespace, poddisruptionbudget) (kube_poddisruptionbudget_status_expected_pods == kube_poddisruptionbudget_status_desired_healthy)
  -          for: 15m
  -          labels:
  -            severity: warning
  -        - alert: PodDisruptionBudgetLimit
  -          annotations:
  -            message: The pod disruption budget is below the minimum number allowed pods.
  -          expr: |
  -            max by (namespace, poddisruptionbudget) (kube_poddisruptionbudget_status_expected_pods < kube_poddisruptionbudget_status_desired_healthy)
  +            absent(up{job="kube-controller-manager"} == 1)
             for: 15m
             labels:
               severity: critical

is that what's going on here?  I don't understand why [2,3] are using the same kind/namespace/name with different spec.groups; maybe that's ok for PrometheusRule?

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1879184#c2
[2]: https://github.com/openshift/cluster-kube-controller-manager-operator/blob/9773980cbca12bfb0d5e719c13fb81b0de352efb/manifests/0000_90_kube-controller-manager-operator_05_alert-kcm-down.yaml
[3]: https://github.com/openshift/cluster-kube-controller-manager-operator/blob/9773980cbca12bfb0d5e719c13fb81b0de352efb/manifests/0000_90_kube-controller-manager-operator_05_alert-pdb.yaml

Comment 42 Sergiusz Urbaniak 2020-09-17 07:08:39 UTC

@trevor: no, your manifests refer to a different operator, what we found is happening here is that CMO hotloops in recreating role bindings every few minutes.

closing out as duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1863011 as it is more specific.

*** This bug has been marked as a duplicate of bug 1863011 ***

Note You need to log in before you can comment on or make changes to this bug.