Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1865896

Summary: Monitoring Operator degraded after upgrade to 4.5.3 with user workload tech preview enabled
Product: OpenShift Container Platform Reporter: Rob Szumski <rszumski>
Component: MonitoringAssignee: Pawel Krupa <pkrupa>
Status: CLOSED WONTFIX QA Contact: Junqi Zhao <juzhao>
Severity: low Docs Contact:
Priority: unspecified    
Version: 4.5CC: akhaire, alegrand, anowak, anpicker, erooth, jonas.demoor, kakkoyun, lcosic, mbarrett, mloibl, pkrupa, spasquie, surbania
Target Milestone: ---Keywords: Reopened
Target Release: 4.6.0   
Hardware: x86_64   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-09-23 07:19:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Rob Szumski 2020-08-04 13:24:39 UTC
Description of problem:
Monitoring Operator reports degraded with the following message:

```
Failed to rollout the stack. Error: running task Updating User Workload Thanos Ruler failed: reconciling ThanosRuler object failed: updating Thanos Ruler object failed: ThanosRuler.monitoring.coreos.com "user-workload" is invalid: spec.queryEndpoints: Required value
```

I am unsure of what the value of this field should be, or why it was not defaulted to the correct value during the upgrade.

Version-Release number of selected component (if applicable):
Cluster is currently running 4.5.3 but user workload tech preview as enabled on 4.4.z prior to upgrading to 4.5.

While investigating the Prometheus Operator in the user workload namespace, I do see a lot of this as well:

```
E0804 13:03:03.626790       1 reflector.go:280] github.com/coreos/prometheus-operator/pkg/thanos/operator.go:317: Failed to watch *v1.ThanosRuler: unknown (get thanosrulers.monitoring.coreos.com)
level=error ts=2020-08-04T13:05:21.861743798Z caller=operator.go:609 component=prometheusoperator msg="syncing nodes into Endpoints object failed" err="listing nodes failed: nodes is forbidden: User \"system:serviceaccount:openshift-user-workload-monitoring:prometheus-operator\" cannot list resource \"nodes\" in API group \"\" at the cluster scope"
level=error ts=2020-08-04T13:17:21.863918423Z caller=operator.go:609 component=prometheusoperator msg="syncing nodes into Endpoints object failed" err="listing nodes failed: nodes is forbidden: User \"system:serviceaccount:openshift-user-workload-monitoring:prometheus-operator\" cannot list resource \"nodes\" in API group \"\" at the cluster scope"
E0804 13:20:24.562423       1 reflector.go:280] github.com/coreos/prometheus-operator/pkg/thanos/operator.go:319: Failed to watch *v1.PrometheusRule: unknown (get prometheusrules.monitoring.coreos.com)
```

How reproducible:
4.4z -> 4.5.3

Comment 2 Pawel Krupa 2020-08-11 08:16:47 UTC
I cannot reproduce this when doing 4.4.16 -> 4.5.5 upgrade. Can I have a must-gather from the affected cluster?

Comment 3 jonas.demoor 2020-08-14 16:55:17 UTC
Hello,

We're a customer running OpenShift Container Platform 4.5.5 on vSphere and we're hitting this issue as well on our development cluster with the user workload tech preview.
When I configure a PrometheusRule object for alerting, the cluster monitoring operator becomes degraded with the above error.

This issue didn't appear when we were on 4.4.x. I'd be happy to provide more details, like a must-gather.

Kind regards,
Jonas De Moor

Comment 6 Simon Pasquier 2020-08-24 13:49:08 UTC
(In reply to jonas.demoor from comment #3)
> Hello,
> 
> We're a customer running OpenShift Container Platform 4.5.5 on vSphere and
> we're hitting this issue as well on our development cluster with the user
> workload tech preview.
> When I configure a PrometheusRule object for alerting, the cluster
> monitoring operator becomes degraded with the above error.
> 
> This issue didn't appear when we were on 4.4.x. I'd be happy to provide more
> details, like a must-gather.

@Jonas that would be great if you could!

> 
> Kind regards,
> Jonas De Moor

Comment 7 Simon Pasquier 2020-08-24 13:53:17 UTC
@Rob I've checked the must-gather but I don't see the openshift-user-workload-monitoring namespace being listed there.

Comment 8 Rob Szumski 2020-08-24 14:00:47 UTC
Ah crap, I forgot that I went back and disabled it in order to upgrade.

Comment 9 Simon Pasquier 2020-08-25 09:57:12 UTC
I tried to reproduce but couldn't either...

Comment 10 Sergiusz Urbaniak 2020-08-25 12:00:50 UTC
let's close this out then as worksforme.

Comment 11 Simon Pasquier 2020-08-25 12:08:01 UTC
We can still reopen if/when Jonas De Moor provides a must-gather.

Comment 12 jonas.demoor 2020-08-28 10:20:34 UTC
(In reply to Simon Pasquier from comment #11)
> We can still reopen if/when Jonas De Moor provides a must-gather.

The must-gather is ready. Can I safely attach it to this report? I noticed there are quite a few secrets included in the must-gather.

Comment 13 Simon Pasquier 2020-08-28 12:20:03 UTC
@Jonas try adding it as a private attachment, I might be able to download it.

Comment 14 jonas.demoor 2020-08-31 08:32:49 UTC
(In reply to Simon Pasquier from comment #13)
> @Jonas try adding it as a private attachment, I might be able to download it.

When I add the attachment, there doesn't seem to be an option to add it as private. Or is this done with the flags drop-down menu (options are: +, -, ?) ?

Comment 15 Simon Pasquier 2020-08-31 13:30:59 UTC
(In reply to jonas.demoor from comment #14)
> When I add the attachment, there doesn't seem to be an option to add it as
> private. Or is this done with the flags drop-down menu (options are: +, -,
> ?) ?

Ah private comments/attachments work only for Red Hat employees. If you have access to Red Hat support, you can create a case there. Otherwise copy the must-gather file wherever you want and send me the information to my RH email.

Comment 18 Rob Szumski 2020-09-17 12:40:06 UTC
The workaround for this bug is to unset the user workload setting, which will cause the Operator to go healthy. You are then free to upgrade as needed.

Comment 26 Simon Pasquier 2020-09-23 07:19:57 UTC
Closing as WONTFIX. We've documented in 4.5 that enabling user workload monitoring isn't compatible with a custom Prometheus operator installation and/or installing the prometheus operator from OLM:
https://access.redhat.com/documentation/en-us/openshift_container_platform/4.5/html/monitoring/monitoring-your-own-services