Summary: | Failed to watch errors in prometheus-operator container logs | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Junqi Zhao <juzhao> | ||||||||||||
Component: | Monitoring | Assignee: | Sergiusz Urbaniak <surbania> | ||||||||||||
Status: | CLOSED DUPLICATE | QA Contact: | Junqi Zhao <juzhao> | ||||||||||||
Severity: | low | Docs Contact: | |||||||||||||
Priority: | high | ||||||||||||||
Version: | 4.6 | CC: | alegrand, anpicker, deads, dtaylor, erooth, jerzhang, kakkoyun, lcosic, mloibl, pkrupa, spasquie, surbania, wking | ||||||||||||
Target Milestone: | --- | ||||||||||||||
Target Release: | 4.6.0 | ||||||||||||||
Hardware: | Unspecified | ||||||||||||||
OS: | Unspecified | ||||||||||||||
Whiteboard: | |||||||||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||
Clone Of: | Environment: | ||||||||||||||
Last Closed: | 2020-09-17 07:08:39 UTC | Type: | Bug | ||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||
Documentation: | --- | CRM: | |||||||||||||
Verified Versions: | Category: | --- | |||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
Attachments: |
|
Description
Junqi Zhao
2020-07-13 06:33:07 UTC
> E0713 02:46:14.565044 1 reflector.go:280] github.com/coreos/prometheus-operator/pkg/prometheus/operator.go:482: Failed to watch *v1.Secret: unknown (get secrets)
We have noticed this in kube-state-metrics as well IIRC? Does the error go away at some point?
Can you share the logs of cluster-monitoring-operator as well? On nightly I found the following in prom-operator logs: > E0713 08:56:44.860949 1 operator.go:996] Sync "openshift-monitoring/k8s" failed: configmaps is forbidden: User "system:serviceaccount:openshift-monitoring:prometheus-operator" cannot list resource "configmaps" in API group "" in the namespace "openshift-monitoring" And then found this in CMO logs: I0713 08:57:57.775984 1 operator.go:345] Updating ClusterOperator status to failed. Err: running task Updating Prometheus Operator failed: waiting for Prometheus CRDs to become available failed: Get https://172.30.0.1:443/apis/apiextensions.k8s.io/v1beta1/customresourcedefinitions/prometheuses.monitoring.coreos.com: unexpected EOF E0713 08:57:57.789011 1 operator.go:277] Syncing "openshift-monitoring/cluster-monitoring-config" failed E0713 08:57:57.791658 1 operator.go:278] sync "openshift-monitoring/cluster-monitoring-config" failed: running task Updating Prometheus Operator failed: waiting for Prometheus CRDs to become available failed: Get https://172.30.0.1:443/apis/apiextensions.k8s.io/v1beta1/customresourcedefinitions/prometheuses.monitoring.coreos.com: unexpected EOF W0713 08:57:57.796838 1 operator.go:396] No Cluster Monitoring ConfigMap was found. Using defaults. Can you confirm its the same for you in the cluster-monitoring-operator logs? Thanks the errors are only in prometheus-operator and they never go away, E0713 09:36:21.163592 1 reflector.go:280] github.com/coreos/prometheus-operator/pkg/thanos/operator.go:319: Failed to watch *v1.PrometheusRule: expected resource version to have 37 parts to match the number of ListerWatchers E0713 09:41:57.888675 1 reflector.go:280] github.com/coreos/prometheus-operator/pkg/prometheus/operator.go:477: Failed to watch *v1.Prometheus: unknown (get prometheuses.monitoring.coreos.com) E0713 09:42:30.113598 1 reflector.go:280] github.com/coreos/prometheus-operator/pkg/prometheus/operator.go:480: Failed to watch *v1.PrometheusRule: expected resource version to have 37 parts to match the number of ListerWatchers E0713 09:44:26.245619 1 reflector.go:280] github.com/coreos/prometheus-operator/pkg/thanos/operator.go:319: Failed to watch *v1.PrometheusRule: expected resource version to have 37 parts to match the number of ListerWatchers E0713 09:51:53.197237 1 reflector.go:280] github.com/coreos/prometheus-operator/pkg/prometheus/operator.go:480: Failed to watch *v1.PrometheusRule: expected resource version to have 37 parts to match the number of ListerWatchers E0713 09:53:24.330961 1 reflector.go:280] github.com/coreos/prometheus-operator/pkg/thanos/operator.go:319: Failed to watch *v1.PrometheusRule: expected resource version to have 37 parts to match the number of ListerWatchers E0713 10:01:05.417023 1 reflector.go:280] github.com/coreos/prometheus-operator/pkg/thanos/operator.go:319: Failed to watch *v1.PrometheusRule: expected resource version to have 37 parts to match the number of ListerWatchers E0713 10:01:11.282930 1 reflector.go:280] github.com/coreos/prometheus-operator/pkg/prometheus/operator.go:480: Failed to watch *v1.PrometheusRule: expected resource version to have 37 parts to match the number of ListerWatchers E0713 10:06:19.422937 1 reflector.go:280] github.com/coreos/prometheus-operator/pkg/prometheus/operator.go:480: Failed to watch *v1.PrometheusRule: expected resource version to have 37 parts to match the number of ListerWatchers E0713 10:09:23.496442 1 reflector.go:280] github.com/coreos/prometheus-operator/pkg/thanos/operator.go:319: Failed to watch *v1.PrometheusRule: expected resource version to have 37 parts to match the number of ListerWatchers E0713 10:12:41.504035 1 reflector.go:280] github.com/coreos/prometheus-operator/pkg/prometheus/operator.go:480: Failed to watch *v1.PrometheusRule: expected resource version to have 37 parts to match the number of ListerWatchers E0713 10:15:47.574989 1 reflector.go:280] github.com/coreos/prometheus-operator/pkg/thanos/operator.go:319: Failed to watch *v1.PrometheusRule: expected resource version to have 37 parts to match the number of ListerWatchers there is not such error in cluster-monitoring-operator logs, see the attached Created attachment 1700806 [details]
cluster-monitoring-operator logs
Created attachment 1700811 [details]
monitoring dump file
you can find all pods' logs and resources file
@Frederic I found the following while trying to see if its a similar problem to the prom-operator 0.40 one: - I can reproduce it, but problem goes away after prom-operator pod is restarted, so my suspicion is race condition? - We did recently touch the CRD creation which might explain the above logs, so we should possibly create those before cluster monitoring operator? Also for what it's worth it seems this problem only starts to happen from roughly 2 days ago onwards, I did not see it on Friday at all. @Lili something to note: we have rolled back the operator to v0.38.1 but we don't have the change [1] that removed the `--manage-crds=false` flag. Hence both CMO and prometheus operator try to create CRDs now. [1] https://github.com/openshift/cluster-monitoring-operator/pull/843 Seems like even in the 0.40 bump we see these errors:
> E0713 12:07:29.489513 1 operator.go:999] Sync "openshift-monitoring/k8s" failed: configmaps is forbidden: User "system:serviceaccount:openshift-monitoring:prometheus-operator" cannot list resource "configmaps" in API group "" in the namespace "openshift-monitoring"
So most likely unrelated, but still worth exploring.
@Junqi I cannot see this in the latest nightly:
> E0713 00:31:54.324789 1 reflector.go:280] github.com/coreos/prometheus-operator/pkg/thanos/operator.go:317: Failed to watch *v1.ThanosRuler: unknown (get thanosrulers.monitoring.coreos.com)
Can you confirm its still present? Thanks!
(In reply to Lili Cosic from comment #10) > @Junqi I cannot see this in the latest nightly: > > > E0713 00:31:54.324789 1 reflector.go:280] github.com/coreos/prometheus-operator/pkg/thanos/operator.go:317: Failed to watch *v1.ThanosRuler: unknown (get thanosrulers.monitoring.coreos.com) > > Can you confirm its still present? Thanks! yes, no error for "Failed to watch *v1.ThanosRuler", see all the errors in prometheus-operator container Created attachment 1701019 [details]
all "Failed to watch" errors in prometheus-operator container
Strange, did not see the "unknown" in my nightly. @Frederic But the msgs "expected resource version to have 37 parts to match the number of ListerWatchers" are explained here -> https://github.com/coreos/prometheus-operator/issues/3218 So that part we can ignore for now, as we want to remove the multi list watcher. I am more worried about the "unknown" errors and the forbidden ones. I have not seen these problems since we bumped to 0.40 release, have you Junqi? (In reply to Lili Cosic from comment #14) > I have not seen these problems since we bumped to 0.40 release, have you > Junqi? there are the "Failed to watch" errors now, Prometheus Operator version '0.40.0' # oc -n openshift-monitoring logs $(oc -n openshift-monitoring get po | grep prometheus-operator | awk '{print $1}') -c prometheus-operator | grep "Failed to watch" E0724 00:15:30.223920 1 reflector.go:382] github.com/coreos/prometheus-operator/pkg/prometheus/operator.go:480: Failed to watch *v1.PrometheusRule: expected resource version to have 1 part, got 2 E0724 01:16:03.943670 1 reflector.go:382] github.com/coreos/prometheus-operator/pkg/prometheus/operator.go:482: Failed to watch *v1.Secret: unknown (get secrets) E0724 02:46:09.091550 1 reflector.go:382] github.com/coreos/prometheus-operator/pkg/alertmanager/operator.go:216: Failed to watch *v1.Alertmanager: unknown (get alertmanagers.monitoring.coreos.com) E0724 03:45:24.961676 1 reflector.go:382] github.com/coreos/prometheus-operator/pkg/prometheus/operator.go:483: Failed to watch *v1.StatefulSet: unknown (get statefulsets.apps) This will need another look, previous attempts did not seem to fix this fully. Moving to next sprint. tested with 4.6.0-0.nightly-2020-08-06-131904 # oc -n openshift-monitoring logs $(oc -n openshift-monitoring get po | grep prometheus-operator | awk '{print $1}') -c prometheus-operator | grep "Failed to" E0807 00:41:37.429177 1 reflector.go:382] github.com/coreos/prometheus-operator/pkg/thanos/operator.go:311: Failed to watch *v1.ThanosRuler: unknown (get thanosrulers.monitoring.coreos.com) E0807 01:11:02.742601 1 reflector.go:382] github.com/coreos/prometheus-operator/pkg/thanos/operator.go:313: Failed to watch *v1.PrometheusRule: unknown (get prometheusrules.monitoring.coreos.com) E0807 03:11:37.208029 1 reflector.go:382] github.com/coreos/prometheus-operator/pkg/prometheus/operator.go:481: Failed to watch *v1.ConfigMap: unknown (get configmaps) E0807 03:11:38.041840 1 reflector.go:382] github.com/coreos/prometheus-operator/pkg/thanos/operator.go:313: Failed to watch *v1.PrometheusRule: unknown (get prometheusrules.monitoring.coreos.com) E0807 03:11:38.130261 1 reflector.go:178] github.com/coreos/prometheus-operator/pkg/prometheus/operator.go:481: Failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:openshift-monitoring:prometheus-operator" cannot list resource "configmaps" in API group "" in the namespace "openshift-monitoring" Created attachment 1710779 [details] faulty prometheus-operator log We have yet more strong indicators that clusters that: a) don't export all alerts after installation b) don't export all recording rules after installation correlate with the log outputs here. Today I was investigating a (cluster-bot) cluster which showed symptoms a) and b). The prometheus-operator log revealed the log lines reported here (see attached file): E0807 08:05:26.132193 1 reflector.go:382] github.com/coreos/prometheus-operator/pkg/prometheus/operator.go:479: Failed to watch *v1.PodMonitor: expected resource version to have 1 part, got 2 level=info ts=2020-08-07T08:05:26.171112359Z caller=operator.go:256 component=thanosoperator msg="successfully synced all caches" E0807 08:05:26.176190 1 reflector.go:382] github.com/coreos/prometheus-operator/pkg/prometheus/operator.go:480: Failed to watch *v1.PrometheusRule: expected resource version to have 1 part, got 2 and many E0807 08:05:27.313755 1 reflector.go:178] github.com/coreos/prometheus-operator/pkg/prometheus/operator.go:479: Failed to list *v1.PodMonitor: resourceVersion: Invalid value: "31998/31999": strconv.ParseUint: parsing "31998/31999": invalid syntax E0807 08:05:27.522495 1 reflector.go:178] github.com/coreos/prometheus-operator/pkg/prometheus/operator.go:480: Failed to list *v1.PrometheusRule: resourceVersion: Invalid value: "32000/32001": strconv.ParseUint: parsing "32000/32001": invalid syntax entries. A restart of prometheus-operator solves the issue. My hypothesis: a) In some cases, when prometheus-operator starts initially, it gets reported multiple resource versions causing errors [1] and subsequent cascading failures [2]. Especially the failure case [2] is interesting, as it is triggered by some code path in client-go (most likely [3]) but I am still investigating. b) In many other cases, when prometheus-operator starts initially, it gets reported only one resource version, then functioninig correctly. [1] https://github.com/prometheus-operator/prometheus-operator/blob/2bb654e5ca930afc60a1a7af309816ad2d78c923/pkg/listwatch/listwatch.go#L177 [2] https://github.com/prometheus-operator/prometheus-operator/blob/ace1ce936b6ac2ccc717a8c87912b76bb11c290c/vendor/k8s.io/client-go/tools/cache/reflector.go#L278 [3] https://github.com/prometheus-operator/prometheus-operator/blob/b66baffe2e65ccd0671fc0b63e5009992304655b/vendor/k8s.io/client-go/tools/cache/mutation_cache.go#L237 Raising severity to high as the observed symptons are missing recording and alerting rules. *** Bug 1867230 has been marked as a duplicate of this bug. *** Some more investigation e2e runs: https://bugzilla.redhat.com/show_bug.cgi?id=1867230 Potential follow-up fix in upstream: https://github.com/prometheus-operator/prometheus-operator/pull/3396 trying to make sippy find this test: [sig-instrumentation] Prometheus when installed on the cluster should start and expose a secured proxy and unsecured metrics test: [sig-instrumentation] Prometheus when installed on the cluster should have important platform topology metrics [Suite:openshift/conformance/parallel] also linking to test: [sig-instrumentation] Prometheus when installed on the cluster should have a AlertmanagerReceiversNotConfigured alert in firing state test: [sig-instrumentation] Prometheus when installed on the cluster should have non-Pod host cAdvisor metrics based on https://app.slack.com/client/T027F3GAJ/C0VMT03S5/thread/C0VMT03S5-1597066515.468000 Prometheus tests also fail quite a bit of Azure: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-4.6/1292725561579278336 https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-4.6/1292679726086754304 https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-4.6/1291866204109017088 Just cross-linking here to check if these have the same cause. tested with 4.6.0-0.nightly-2020-08-10-180431,still see the errors # oc -n openshift-monitoring logs $(oc -n openshift-monitoring get po | grep prometheus-operator | awk '{print $1}') -c prometheus-operator | grep "Failed to watch" E0810 23:33:43.891001 1 reflector.go:382] github.com/coreos/prometheus-operator/pkg/prometheus/operator.go:478: Failed to watch *v1.ServiceMonitor: expected resource version to have 1 part, got 3 E0810 23:58:50.771826 1 reflector.go:382] github.com/coreos/prometheus-operator/pkg/alertmanager/operator.go:216: Failed to watch *v1.Alertmanager: unknown (get alertmanagers.monitoring.coreos.com) E0811 02:03:49.478946 1 reflector.go:382] github.com/coreos/prometheus-operator/pkg/alertmanager/operator.go:217: Failed to watch *v1.StatefulSet: unknown (get statefulsets.apps) # oc api-resources | grep -E "Alertmanager|StatefulSet" NAME SHORTNAMES APIGROUP NAMESPACED KIND statefulsets sts apps true StatefulSet alertmanagers monitoring.coreos.com true Alertmanager @junqi the errors are expected. the question is if the cluster converges against a stable state and if all prometheus rules/recording/alerting rules/service monitors are available. However, the newest patch proofed not to fix the situation. Reassigning to lili as she is working on an overhaul for the multilistwatcher code in prometheus-operator. Initially we wanted to land the overhaul in 4.7 but since a hotfix is not possible, we prioritizing it for 4.6. we have high confidence now that, time wise, lili's patch is a bigger effort than anticipated. Hence we are having the following strategy now: - We continue to observe failures in CI. - We prepare another hotfix, where cluster-monitoring-operator watches prometheus-operator reconcile errors and restarts the pod (with a max count). We observed that restarting prometheus-operator fixes things. We will merge that hotfix only if we don't make the upstream fix in time. - We work in parallel on an upstream fix in prometheus-operator. Ince upstream is ready, do a 0.40.z release and merge that one into a 4.6.z release. lowering severity here to low as we should focus here only on the E0713 00:22:30.566988 1 reflector.go:280] github.com/coreos/prometheus-operator/pkg/thanos/operator.go:318: Failed to watch *v1.ConfigMap: unknown (get configmaps) errors only. So far, they showed no sign of functionality degradation. The multilistwatcher issues and missing reconcilation issues in prometheus-operator are being tracked in https://bugzilla.redhat.com/show_bug.cgi?id=1858991 from now on. (In reply to Sergiusz Urbaniak from comment #28) > @junqi the errors are expected. the question is if the cluster converges > against a stable state and if all prometheus rules/recording/alerting > rules/service monitors are available. > > However, the newest patch proofed not to fix the situation. > > Reassigning to lili as she is working on an overhaul for the > multilistwatcher code in prometheus-operator. Initially we wanted to land > the overhaul in 4.7 but since a hotfix is not possible, we prioritizing it > for 4.6. a stable cluster, and all prometheus rules/recording/alerting rules/service monitors are available. and there are more erorrs # oc -n openshift-monitoring logs $(oc -n openshift-monitoring get po | grep prometheus-operator | awk '{print $1}') -c prometheus-operator | grep "Failed to watch" E0812 09:38:15.163395 1 reflector.go:382] github.com/coreos/prometheus-operator/pkg/prometheus/operator.go:479: Failed to watch *v1.PodMonitor: unknown (get podmonitors.monitoring.coreos.com) E0812 09:38:15.249811 1 reflector.go:382] github.com/coreos/prometheus-operator/pkg/prometheus/operator.go:482: Failed to watch *v1.Secret: unknown (get secrets) E0812 11:23:55.900809 1 reflector.go:382] github.com/coreos/prometheus-operator/pkg/prometheus/operator.go:479: Failed to watch *v1.PodMonitor: unknown (get podmonitors.monitoring.coreos.com) E0812 12:23:16.346716 1 reflector.go:382] github.com/coreos/prometheus-operator/pkg/thanos/operator.go:313: Failed to watch *v1.PrometheusRule: unknown (get prometheusrules.monitoring.coreos.com) E0812 13:10:40.510172 1 reflector.go:382] github.com/coreos/prometheus-operator/pkg/prometheus/operator.go:477: Failed to watch *v1.Prometheus: unknown (get prometheuses.monitoring.coreos.com) E0812 13:36:55.019911 1 reflector.go:382] github.com/coreos/prometheus-operator/pkg/thanos/operator.go:313: Failed to watch *v1.PrometheusRule: unknown (get prometheusrules.monitoring.coreos.com) E0812 13:36:55.020090 1 reflector.go:382] github.com/coreos/prometheus-operator/pkg/prometheus/operator.go:479: Failed to watch *v1.PodMonitor: unknown (get podmonitors.monitoring.coreos.com) E0812 13:36:55.021541 1 reflector.go:382] github.com/coreos/prometheus-operator/pkg/prometheus/operator.go:480: Failed to watch *v1.PrometheusRule: unknown (get prometheusrules.monitoring.coreos.com) E0812 13:36:55.021573 1 reflector.go:382] github.com/coreos/prometheus-operator/pkg/prometheus/operator.go:478: Failed to watch *v1.ServiceMonitor: unknown (get servicemonitors.monitoring.coreos.com) Hi, we are noticing many Alerting pull-ci-openshift-console-master-e2e-gcp-console test flakes: https://search.ci.openshift.org/?search=Monitoring%3A+Alerts+creates+and+expires+a+Silence&maxAge=48h&context=3&type=bug%2Bjunit&name=e2e-gcp-console&maxMatches=5&maxBytes=20971520&groupBy=job Is this bug contributing to this? I ask because I see in the comments that one of the symptoms was some alerts and rules not initially loading, which seems to be the case: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_console/6236/pull-ci-openshift-console-master-e2e-gcp-console/1295850715876429824/artifacts/e2e-gcp-console/gui_test_screenshots/cypress/screenshots/monitoring/monitoring.spec.ts/Monitoring%20Alerts%20--%20displays%20and%20filters%20the%20Alerts%20list%20page,%20links%20to%20detail%20pages%20%28failed%29.png Using a temporary verbosity increase of prometheus-operator in https://github.com/openshift/cluster-monitoring-operator/pull/914 revealed the following: - It seems rolebindings which are necessary to execute watch calls against the kube-apiserver are not effective yet. - An initial watch/get/list call fails and manifests differently in log files, sometimes explicitely with "forbidden" messages and sometimes with 'unknown get' errors as shown below. - Effectively it is all the same issue, namely not-yet-effective rolebindings. - It is a bit unfortunate as the default error message for this edge is "unknown" as per [1]. In case of a 403 response as shown above client-go simply reuses the "unknown" message masking the actual underlying permission issue, see [2] [1] https://github.com/prometheus-operator/prometheus-operator/blob/ace1ce936b6ac2ccc717a8c87912b76bb11c290c/vendor/k8s.io/client-go/rest/request.go#L1111 [2] https://github.com/prometheus-operator/prometheus-operator/blob/7515d115eb0e0ba694bd3a0e48b496e331442e1e/vendor/k8s.io/apimachinery/pkg/api/errors/errors.go#L415-L418 From the prometheus-operator logs in https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/914/pull-ci-openshift-cluster-monitoring-operator-master-e2e/1297945391496433664/artifacts/e2e/gather-extra/pods/ we can see: ``` I0824 18:38:19.364535 1 round_trippers.go:443] GET https://172.30.0.1:443/apis/monitoring.coreos.com/v1/namespaces/openshift-monitoring/prome theuses?allowWatchBookmarks=true&resourceVersion=33986&timeout=6m4s&timeoutSeconds=364&watch=true 403 Forbidden in 1 milliseconds I0824 18:38:19.364621 1 round_trippers.go:449] Response Headers: I0824 18:38:19.364639 1 round_trippers.go:452] Audit-Id: 967cc52f-a6cd-4395-b9c3-9fa803e8b7f8 I0824 18:38:19.364646 1 round_trippers.go:452] Cache-Control: no-cache, private I0824 18:38:19.364652 1 round_trippers.go:452] Content-Type: application/json I0824 18:38:19.364657 1 round_trippers.go:452] X-Content-Type-Options: nosniff I0824 18:38:19.364662 1 round_trippers.go:452] Content-Length: 420 I0824 18:38:19.364667 1 round_trippers.go:452] Date: Mon, 24 Aug 2020 18:38:19 GMT I0824 18:38:19.364689 1 request.go:1068] Response Body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"promet heuses.monitoring.coreos.com is forbidden: User \"system:serviceaccount:openshift-monitoring:prometheus-operator\" cannot watch resource \"promethe uses\" in API group \"monitoring.coreos.com\" in the namespace \"openshift-monitoring\"","reason":"Forbidden","details":{"group":"monitoring.coreos .com","kind":"prometheuses"},"code":403} E0824 18:38:19.364723 1 reflector.go:382] github.com/coreos/prometheus-operator/pkg/prometheus/operator.go:477: Failed to watch *v1.Prometheus: unknown (get prometheuses.monitoring.coreos.com) ``` client-go then retries and a subsequent watch call succeeds once the role binding is effective: ``` I0824 18:38:20.611119 1 round_trippers.go:443] GET https://172.30.0.1:443/apis/monitoring.coreos.com/v1/namespaces/openshift-monitoring/prometheuses?resourceVersion=33986 200 OK in 1 milliseconds I0824 18:38:20.611137 1 round_trippers.go:449] Response Headers: I0824 18:38:20.611144 1 round_trippers.go:452] X-Kubernetes-Pf-Flowschema-Uid: d7d2f0ee-90c7-4117-aa53-a11a8baea206 I0824 18:38:20.611150 1 round_trippers.go:452] X-Kubernetes-Pf-Prioritylevel-Uid: d860e2a8-f121-4056-a625-f74d2bdcc776 I0824 18:38:20.611156 1 round_trippers.go:452] Date: Mon, 24 Aug 2020 18:38:20 GMT I0824 18:38:20.611163 1 round_trippers.go:452] Audit-Id: a40f496a-78bd-4fa5-849b-7060477bcfcd I0824 18:38:20.611168 1 round_trippers.go:452] Cache-Control: no-cache, private I0824 18:38:20.611174 1 round_trippers.go:452] Content-Type: application/json I0824 18:38:20.611575 1 request.go:1068] Response Body: {"apiVersion":"monitoring.coreos.com/v1","items":[{"apiVersion":"monitoring.coreos.com/v1","kind":"Prometheus","metadata":{"creationTimestamp":"2020-08-24T17:51:29Z"... ``` We need to find a way to initiate deployment of assets if and only if role bindings are effective. The good news here is though: The error messages in the log files are transient, the role bindings are eventually effective in the API. *** Bug 1833189 has been marked as a duplicate of this bug. *** Setting target release to 4.7.0 as this is a good tech debt item which could be resolved/implemented in cluster-monitoring-operator. Further discussion with api machinery team indicates this seems to be caused by the removal of kube-apiserver informer sync logic at startup. The investigation is ongoing. Further investigation showed that cluster-monitoring-operator recreates those role bindings every few minutes: $ kubectl -n openshift-monitoring get clusterrolebinding prometheus-operator -o yaml apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: creationTimestamp: "2020-09-15T15:38:51Z" ... <delay> $ kubectl -n openshift-monitoring get clusterrolebinding prometheus-operator -o yaml GET:https://api.devint.openshiftknativedemo.org:6443/apis/logging.openshift.io/v1?timeout=32s apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: creationTimestamp: "2020-09-15T15:53:52Z" ... We need to be smarter in apply logic of cluster-monitoring-operator and prevent recreation. From [1]: $ oc adm release extract --to manifests quay.io/openshift-release-dev/ocp-release:4.6.0-fc.6-x86_64 Extracted release payload from digest sha256:933f3d6f61ddec9f3b88a0932b47c438d7dfc15ff1873ab176284b66c9cff76e created at 2020-09-14T21:50:05Z $ diff -u manifests/0000_90_kube-controller-manager-operator_05_alert-pdb.yaml manifests/0000_90_kube-controller-manager-operator_05_alert-kcm-down.yaml --- manifests/0000_90_kube-controller-manager-operator_05_alert-pdb.yaml 2020-09-12 05:33:59.000000000 -0700 +++ manifests/0000_90_kube-controller-manager-operator_05_alert-kcm-down.yaml 2020-09-12 05:33:59.000000000 -0700 @@ -9,19 +9,11 @@ groups: - name: cluster-version rules: - - alert: PodDisruptionBudgetAtLimit + - alert: KubeControllerManagerDown annotations: - message: The pod disruption budget is preventing further disruption to pods because it is at the minimum allowed level. + message: KubeControllerManager has disappeared from Prometheus target discovery. expr: | - max by(namespace, poddisruptionbudget) (kube_poddisruptionbudget_status_expected_pods == kube_poddisruptionbudget_status_desired_healthy) - for: 15m - labels: - severity: warning - - alert: PodDisruptionBudgetLimit - annotations: - message: The pod disruption budget is below the minimum number allowed pods. - expr: | - max by (namespace, poddisruptionbudget) (kube_poddisruptionbudget_status_expected_pods < kube_poddisruptionbudget_status_desired_healthy) + absent(up{job="kube-controller-manager"} == 1) for: 15m labels: severity: critical is that what's going on here? I don't understand why [2,3] are using the same kind/namespace/name with different spec.groups; maybe that's ok for PrometheusRule? [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1879184#c2 [2]: https://github.com/openshift/cluster-kube-controller-manager-operator/blob/9773980cbca12bfb0d5e719c13fb81b0de352efb/manifests/0000_90_kube-controller-manager-operator_05_alert-kcm-down.yaml [3]: https://github.com/openshift/cluster-kube-controller-manager-operator/blob/9773980cbca12bfb0d5e719c13fb81b0de352efb/manifests/0000_90_kube-controller-manager-operator_05_alert-pdb.yaml @trevor: no, your manifests refer to a different operator, what we found is happening here is that CMO hotloops in recreating role bindings every few minutes. closing out as duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1863011 as it is more specific. *** This bug has been marked as a duplicate of bug 1863011 *** |