Bug 1977435 - SNO - monitoring operator is not available cause failed: waiting for Alertmanager openshift-monitoring/main
Summary: SNO - monitoring operator is not available cause failed: waiting for Alertman...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.8
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: ---
: 4.9.0
Assignee: Damien Grisonnet
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks: 1979575
TreeView+ depends on / blocked
 
Reported: 2021-06-29 18:01 UTC by Igal Tsoiref
Modified: 2021-12-16 20:23 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1979575 (view as bug list)
Environment:
Last Closed: 2021-10-18 17:37:18 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
must-gather logs of openshift-monitor (3.35 MB, application/gzip)
2021-06-29 18:01 UTC, Igal Tsoiref
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-monitoring-operator pull 1267 0 None open Bug 1977435: jsonnet: bump prometheus-operator to v0.49.0 2021-07-06 16:20:29 UTC
Github openshift prometheus-operator pull 131 0 None open Bug 1977435: Bump prometheus-operator to v0.49.0 2021-07-06 16:15:11 UTC
Github prometheus-operator prometheus-operator pull 4143 0 None closed Add timeout to the informers cache synchronization 2021-07-06 13:02:52 UTC
Red Hat Knowledge Base (Solution) 6598471 0 None None None 2021-12-16 20:23:01 UTC
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-10-18 17:37:38 UTC

Description Igal Tsoiref 2021-06-29 18:01:15 UTC
Created attachment 1795925 [details]
must-gather logs of openshift-monitor

Description of problem:
On single node installation we hit a case where we have 2 clusters that failed to finish installation cause of monitoring operator is in Degraded state.

Error we saw is:
Failed to rollout the stack. Error: running task Updating Alertmanager
      failed: waiting for Alertmanager object changes failed: waiting for Alertmanager
      openshift-monitoring/main: retrieving Alertmanager status failed: retrieving
      stateful set failed: statefulsets.apps "alertmanager-main" not found'


There is no alertmanager-main pod running and it never tried to start.


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:
Monitoring operator is degraded

Expected results:
Monitoring operator is available

Additional info:

If needed will send full must-gather logs, file is too large and i can't attach it.
Attaching openshift-monitor namespace fro must-gather

Comment 1 Rom Freiman 2021-06-29 18:41:28 UTC
We encountered this issue in rc.0 during scale testing of SNO deployement.

It's new behavior that we didnt encounter in fc.9 or earlier builds.

Comment 2 Damien Grisonnet 2021-06-30 15:36:53 UTC
I tried to reproduce the issue on a fresh 4.8.0-rc.0 SNO cluster and both Prometheus and Alertmanager were correctly created so it doesn't seem to be happening all the time.

And so far, from only the monitoring logs, I haven't been able to find the root cause of the issue since all the errors seem to be unrelated to the problem we are seeing of both Prometheus and Alertmanager not being created.

So would you mind sharing the full must-gather? We might be able to get more information from it.

Comment 3 Igal Tsoiref 2021-06-30 17:38:55 UTC
Logs were attached through slack

Comment 6 Damien Grisonnet 2021-07-01 10:09:50 UTC
Thank you for sharing the must-gather Igal.

After looking a bit more into the logs, it seems that the issue is coming from a bug in prometheus-operator, but it is not a regression from 4.8.0-fc.9 since we haven't merged any code in the project between both versions. That said, a week prior to fc.9, we have bumped prometheus-operator to v0.48.0, which might have introduced the bug, since we haven't seen it before.

The issue seems to be that prometheus-operator is entering a particular code path which stops the sync loop but keeps the operator running as if nothing was wrong, so the operator can't sync the new Prometheus and Alertmanager instances and create statefulsets accordingly.

So far I haven't found the exact code path that make this happen, but it might be related to the following log lines that are pretty unusual:

2021-06-25T00:00:40.096571055Z level=error ts=2021-06-25T00:00:40.096464439Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="github.com/coreos/prometheus-operator/pkg/alertmanager/operator. go:475: Failed to watch *v1.Namespace: unknown (get namespaces)"
2021-06-25T00:00:40.096688152Z level=error ts=2021-06-25T00:00:40.096671045Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="github.com/coreos/prometheus-operator/pkg/informers/informers.go:75: Failed to watch *v1.ServiceMonitor: unknown (get servicemonitors.monitoring.coreos.com)"
2021-06-25T00:00:40.096764012Z level=error ts=2021-06-25T00:00:40.096748738Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="github.com/coreos/prometheus-operator/pkg/thanos/operator.go:    334: Failed to watch *v1.Namespace: unknown (get namespaces)"                                                                                                                                                      
2021-06-25T00:00:40.096847154Z level=error ts=2021-06-25T00:00:40.096834494Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="github.com/coreos/prometheus-operator/pkg/informers/informers.go:75: Failed to watch *v1.Prometheus: unknown (get prometheuses.monitoring.coreos.com)"

I will further investigate this, but in the meantime, you should be able to mitigate this issue by restarting the prometheus-operator pod in the openshift-monitoring namespace.

Comment 7 Damien Grisonnet 2021-07-06 13:02:55 UTC
A fix has been merged in upstream prometheus-operator and will be backported to 4.8.

Comment 9 Junqi Zhao 2021-07-07 07:06:25 UTC
tested with 4.9.0-0.nightly-2021-07-06-205913, prometheus-operator version is 0.49.0
# oc -n openshift-monitoring logs prometheus-operator-5565bf44c4-4gqb8 -c prometheus-operator | head
level=info ts=2021-07-07T06:14:54.459669957Z caller=main.go:295 msg="Starting Prometheus Operator" version="(version=0.49.0, branch=rhaos-4.9-rhel-8, revision=9564308)"

but resource label for prometheus-operator is still 0.48.1, exmaple:
# oc -n openshift-monitoring get deploy prometheus-operator -oyaml
...
  labels:
    app.kubernetes.io/component: controller
    app.kubernetes.io/name: prometheus-operator
    app.kubernetes.io/part-of: openshift-monitoring
    app.kubernetes.io/version: 0.48.1

# oc -n openshift-monitoring get servicemonitor prometheus-operator -oyaml | head -n 30
...
  labels:
    app.kubernetes.io/component: controller
    app.kubernetes.io/name: prometheus-operator
    app.kubernetes.io/part-of: openshift-monitoring
    app.kubernetes.io/version: 0.48.1
  name: prometheus-operator

Comment 10 Damien Grisonnet 2021-07-07 08:24:43 UTC
Seems like changes from https://github.com/openshift/cluster-monitoring-operator/pull/1267 didn't get into the last nightly build. Junqi, can you please try with the next build?

Comment 11 Junqi Zhao 2021-07-07 10:22:01 UTC
checked with 4.9.0-0.nightly-2021-07-07-030430, prometheus-operator version is 0.49.0, resource label for prometheus-operator is also 0.49.0
# oc -n openshift-monitoring logs prometheus-operator-7897867998-qhqst -c prometheus-operator | head -n 2
level=info ts=2021-07-07T09:38:22.97437722Z caller=main.go:295 msg="Starting Prometheus Operator" version="(version=0.49.0, branch=rhaos-4.9-rhel-8, revision=9564308)"
level=info ts=2021-07-07T09:38:22.974624033Z caller=main.go:296 build_context="(go=go1.16.4, user=root, date=20210706-19:50:49)"

# oc -n openshift-monitoring get deploy prometheus-operator -oyaml
...
  labels:
    app.kubernetes.io/component: controller
    app.kubernetes.io/name: prometheus-operator
    app.kubernetes.io/part-of: openshift-monitoring
    app.kubernetes.io/version: 0.49.0
  name: prometheus-operator

Comment 22 errata-xmlrpc 2021-10-18 17:37:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759


Note You need to log in before you can comment on or make changes to this bug.