1978662 – monitoring operator needs to indicate non-durable data

Bug 1978662 - monitoring operator needs to indicate non-durable data

Summary: monitoring operator needs to indicate non-durable data

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	4.9.0
Assignee:	Filip Petkovski
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-07-02 12:41 UTC by David Eads
Modified:	2021-10-18 17:38 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:	The Cluster Monitoring Operator will now set a message for the Degraded condition when persistent storage is not configured for Prometheus.
Clone Of:
Environment:
Last Closed:	2021-10-18 17:38:00 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-monitoring-operator pull 1270	0	None	open	Bug 1978662: Set a degraded message when persistent storage is not configured	2021-07-20 08:16:43 UTC
Red Hat Product Errata	RHSA-2021:3759	0	None	None	None	2021-10-18 17:38:19 UTC

Description David Eads 2021-07-02 12:41:05 UTC

Without persistent storage configured, the monitoring operator uses emptyDir.  While this is better than not running at all, there needs to be an obvious signal to a cluster-admin that his data will disappear on configuration changes, node reboots, and upgrades.

We spoke via slack and there are two good ways to present this data.  We should pursue both.
1. keep degraded=false, but add a message indicating that no storage is configured and data loss will occur.  This helps write layers on top that can use `oc` and the kube-apiserver data.
2. add an info (warning?) alert indicating that no storage is configured and data loss will occur.  This allows for collection from the field.

Comment 1 Christian Heidenreich 2021-07-02 14:27:37 UTC

I would vote for option 1. I mean I know a lot of customers that are straight ignoring "info" alerts anyways and I would argue the usefulness because of that.

Comment 2 Simon Pasquier 2021-07-02 15:06:48 UTC

To add to Christian's comment, if the alert is to understand how many clusters use persistent storage for prometheus/alertmanager, we can create a telemetry metric to record this information.

Comment 3 David Eads 2021-07-08 12:59:05 UTC

> To add to Christian's comment, if the alert is to understand how many clusters use persistent storage for prometheus/alertmanager, we can create a telemetry metric to record this information.

That is the first goal of that alert.  Depending on how many are in this situation, we can decide what to do next.  Losing historical metrics data is problem.

Comment 4 Filip Petkovski 2021-07-20 08:19:18 UTC

In the PR linked with this BZ we set a `PrometheusDataPersistanceNotConfigured` reason for the degraded condition when there is no metrics storage. All operator conditions are already exported to telemetry, so we will be able to see how many clusters are in this state.

Comment 7 Junqi Zhao 2021-07-21 03:09:07 UTC

tested with 4.9.0-0.nightly-2021-07-20-221331, no persistent volumes for monitoring
# oc -n openshift-monitoring get pvc
No resources found in openshift-monitoring namespace.

# oc get co monitoring -oyaml
...
  - lastTransitionTime: "2021-07-21T02:06:17Z"
    message: 'Prometheus is running without persistent storage which can lead to data
      loss during upgrades and cluster disruptions. Please refer to the official documentation
      to see how to configure storage for Prometheus: https://docs.openshift.com/container-platform/4.8/monitoring/configuring-the-monitoring-stack.html'
    reason: PrometheusDataPersistenceNotConfigured
    status: "False"
    type: Degraded

the doc links to 4.8, since the completion time for doc is very near the GA date, use a previous version is fine. 

also tested with bind PVCs for monitoring, no warn message
# oc get co monitoring -oyaml
...
status:
  conditions:
  - lastTransitionTime: "2021-07-21T01:57:30Z"
    reason: AsExpected
    status: "True"
    type: Upgradeable
  - lastTransitionTime: "2021-07-21T02:06:17Z"
    message: Successfully rolled out the stack.
    reason: RollOutDone
    status: "True"
    type: Available
  - lastTransitionTime: "2021-07-21T02:06:17Z"
    status: "False"
    type: Progressing
  - lastTransitionTime: "2021-07-21T02:59:58Z"
    status: "False"
    type: Degraded

Comment 16 errata-xmlrpc 2021-10-18 17:38:00 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759

Note You need to log in before you can comment on or make changes to this bug.