Created attachment 1795925 [details] must-gather logs of openshift-monitor Description of problem: On single node installation we hit a case where we have 2 clusters that failed to finish installation cause of monitoring operator is in Degraded state. Error we saw is: Failed to rollout the stack. Error: running task Updating Alertmanager failed: waiting for Alertmanager object changes failed: waiting for Alertmanager openshift-monitoring/main: retrieving Alertmanager status failed: retrieving stateful set failed: statefulsets.apps "alertmanager-main" not found' There is no alertmanager-main pod running and it never tried to start. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Monitoring operator is degraded Expected results: Monitoring operator is available Additional info: If needed will send full must-gather logs, file is too large and i can't attach it. Attaching openshift-monitor namespace fro must-gather
We encountered this issue in rc.0 during scale testing of SNO deployement. It's new behavior that we didnt encounter in fc.9 or earlier builds.
I tried to reproduce the issue on a fresh 4.8.0-rc.0 SNO cluster and both Prometheus and Alertmanager were correctly created so it doesn't seem to be happening all the time. And so far, from only the monitoring logs, I haven't been able to find the root cause of the issue since all the errors seem to be unrelated to the problem we are seeing of both Prometheus and Alertmanager not being created. So would you mind sharing the full must-gather? We might be able to get more information from it.
Logs were attached through slack
Thank you for sharing the must-gather Igal. After looking a bit more into the logs, it seems that the issue is coming from a bug in prometheus-operator, but it is not a regression from 4.8.0-fc.9 since we haven't merged any code in the project between both versions. That said, a week prior to fc.9, we have bumped prometheus-operator to v0.48.0, which might have introduced the bug, since we haven't seen it before. The issue seems to be that prometheus-operator is entering a particular code path which stops the sync loop but keeps the operator running as if nothing was wrong, so the operator can't sync the new Prometheus and Alertmanager instances and create statefulsets accordingly. So far I haven't found the exact code path that make this happen, but it might be related to the following log lines that are pretty unusual: 2021-06-25T00:00:40.096571055Z level=error ts=2021-06-25T00:00:40.096464439Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="github.com/coreos/prometheus-operator/pkg/alertmanager/operator. go:475: Failed to watch *v1.Namespace: unknown (get namespaces)" 2021-06-25T00:00:40.096688152Z level=error ts=2021-06-25T00:00:40.096671045Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="github.com/coreos/prometheus-operator/pkg/informers/informers.go:75: Failed to watch *v1.ServiceMonitor: unknown (get servicemonitors.monitoring.coreos.com)" 2021-06-25T00:00:40.096764012Z level=error ts=2021-06-25T00:00:40.096748738Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="github.com/coreos/prometheus-operator/pkg/thanos/operator.go: 334: Failed to watch *v1.Namespace: unknown (get namespaces)" 2021-06-25T00:00:40.096847154Z level=error ts=2021-06-25T00:00:40.096834494Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="github.com/coreos/prometheus-operator/pkg/informers/informers.go:75: Failed to watch *v1.Prometheus: unknown (get prometheuses.monitoring.coreos.com)" I will further investigate this, but in the meantime, you should be able to mitigate this issue by restarting the prometheus-operator pod in the openshift-monitoring namespace.
A fix has been merged in upstream prometheus-operator and will be backported to 4.8.
tested with 4.9.0-0.nightly-2021-07-06-205913, prometheus-operator version is 0.49.0 # oc -n openshift-monitoring logs prometheus-operator-5565bf44c4-4gqb8 -c prometheus-operator | head level=info ts=2021-07-07T06:14:54.459669957Z caller=main.go:295 msg="Starting Prometheus Operator" version="(version=0.49.0, branch=rhaos-4.9-rhel-8, revision=9564308)" but resource label for prometheus-operator is still 0.48.1, exmaple: # oc -n openshift-monitoring get deploy prometheus-operator -oyaml ... labels: app.kubernetes.io/component: controller app.kubernetes.io/name: prometheus-operator app.kubernetes.io/part-of: openshift-monitoring app.kubernetes.io/version: 0.48.1 # oc -n openshift-monitoring get servicemonitor prometheus-operator -oyaml | head -n 30 ... labels: app.kubernetes.io/component: controller app.kubernetes.io/name: prometheus-operator app.kubernetes.io/part-of: openshift-monitoring app.kubernetes.io/version: 0.48.1 name: prometheus-operator
Seems like changes from https://github.com/openshift/cluster-monitoring-operator/pull/1267 didn't get into the last nightly build. Junqi, can you please try with the next build?
checked with 4.9.0-0.nightly-2021-07-07-030430, prometheus-operator version is 0.49.0, resource label for prometheus-operator is also 0.49.0 # oc -n openshift-monitoring logs prometheus-operator-7897867998-qhqst -c prometheus-operator | head -n 2 level=info ts=2021-07-07T09:38:22.97437722Z caller=main.go:295 msg="Starting Prometheus Operator" version="(version=0.49.0, branch=rhaos-4.9-rhel-8, revision=9564308)" level=info ts=2021-07-07T09:38:22.974624033Z caller=main.go:296 build_context="(go=go1.16.4, user=root, date=20210706-19:50:49)" # oc -n openshift-monitoring get deploy prometheus-operator -oyaml ... labels: app.kubernetes.io/component: controller app.kubernetes.io/name: prometheus-operator app.kubernetes.io/part-of: openshift-monitoring app.kubernetes.io/version: 0.49.0 name: prometheus-operator
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759