Bug 1978662

Summary:	monitoring operator needs to indicate non-durable data
Product:	OpenShift Container Platform	Reporter:	David Eads <deads>
Component:	Monitoring	Assignee:	Filip Petkovski <fpetkovs>
Status:	CLOSED ERRATA	QA Contact:	Junqi Zhao <juzhao>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	4.9	CC:	amuller, anpicker, aos-bugs, erooth, spasquie
Target Milestone:	---
Target Release:	4.9.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:	The Cluster Monitoring Operator will now set a message for the Degraded condition when persistent storage is not configured for Prometheus.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-10-18 17:38:00 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description David Eads 2021-07-02 12:41:05 UTC

Without persistent storage configured, the monitoring operator uses emptyDir.  While this is better than not running at all, there needs to be an obvious signal to a cluster-admin that his data will disappear on configuration changes, node reboots, and upgrades.

We spoke via slack and there are two good ways to present this data.  We should pursue both.
1. keep degraded=false, but add a message indicating that no storage is configured and data loss will occur.  This helps write layers on top that can use `oc` and the kube-apiserver data.
2. add an info (warning?) alert indicating that no storage is configured and data loss will occur.  This allows for collection from the field.

Comment 1 Christian Heidenreich 2021-07-02 14:27:37 UTC

I would vote for option 1. I mean I know a lot of customers that are straight ignoring "info" alerts anyways and I would argue the usefulness because of that.

Comment 2 Simon Pasquier 2021-07-02 15:06:48 UTC

To add to Christian's comment, if the alert is to understand how many clusters use persistent storage for prometheus/alertmanager, we can create a telemetry metric to record this information.

Comment 3 David Eads 2021-07-08 12:59:05 UTC

> To add to Christian's comment, if the alert is to understand how many clusters use persistent storage for prometheus/alertmanager, we can create a telemetry metric to record this information.

That is the first goal of that alert.  Depending on how many are in this situation, we can decide what to do next.  Losing historical metrics data is problem.

Comment 4 Filip Petkovski 2021-07-20 08:19:18 UTC

In the PR linked with this BZ we set a `PrometheusDataPersistanceNotConfigured` reason for the degraded condition when there is no metrics storage. All operator conditions are already exported to telemetry, so we will be able to see how many clusters are in this state.

Comment 7 Junqi Zhao 2021-07-21 03:09:07 UTC

tested with 4.9.0-0.nightly-2021-07-20-221331, no persistent volumes for monitoring
# oc -n openshift-monitoring get pvc
No resources found in openshift-monitoring namespace.

# oc get co monitoring -oyaml
...
  - lastTransitionTime: "2021-07-21T02:06:17Z"
    message: 'Prometheus is running without persistent storage which can lead to data
      loss during upgrades and cluster disruptions. Please refer to the official documentation
      to see how to configure storage for Prometheus: https://docs.openshift.com/container-platform/4.8/monitoring/configuring-the-monitoring-stack.html'
    reason: PrometheusDataPersistenceNotConfigured
    status: "False"
    type: Degraded

the doc links to 4.8, since the completion time for doc is very near the GA date, use a previous version is fine. 

also tested with bind PVCs for monitoring, no warn message
# oc get co monitoring -oyaml
...
status:
  conditions:
  - lastTransitionTime: "2021-07-21T01:57:30Z"
    reason: AsExpected
    status: "True"
    type: Upgradeable
  - lastTransitionTime: "2021-07-21T02:06:17Z"
    message: Successfully rolled out the stack.
    reason: RollOutDone
    status: "True"
    type: Available
  - lastTransitionTime: "2021-07-21T02:06:17Z"
    status: "False"
    type: Progressing
  - lastTransitionTime: "2021-07-21T02:59:58Z"
    status: "False"
    type: Degraded

Comment 16 errata-xmlrpc 2021-10-18 17:38:00 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759