Without persistent storage configured, the monitoring operator uses emptyDir. While this is better than not running at all, there needs to be an obvious signal to a cluster-admin that his data will disappear on configuration changes, node reboots, and upgrades. We spoke via slack and there are two good ways to present this data. We should pursue both. 1. keep degraded=false, but add a message indicating that no storage is configured and data loss will occur. This helps write layers on top that can use `oc` and the kube-apiserver data. 2. add an info (warning?) alert indicating that no storage is configured and data loss will occur. This allows for collection from the field.
I would vote for option 1. I mean I know a lot of customers that are straight ignoring "info" alerts anyways and I would argue the usefulness because of that.
To add to Christian's comment, if the alert is to understand how many clusters use persistent storage for prometheus/alertmanager, we can create a telemetry metric to record this information.
> To add to Christian's comment, if the alert is to understand how many clusters use persistent storage for prometheus/alertmanager, we can create a telemetry metric to record this information. That is the first goal of that alert. Depending on how many are in this situation, we can decide what to do next. Losing historical metrics data is problem.
In the PR linked with this BZ we set a `PrometheusDataPersistanceNotConfigured` reason for the degraded condition when there is no metrics storage. All operator conditions are already exported to telemetry, so we will be able to see how many clusters are in this state.
tested with 4.9.0-0.nightly-2021-07-20-221331, no persistent volumes for monitoring # oc -n openshift-monitoring get pvc No resources found in openshift-monitoring namespace. # oc get co monitoring -oyaml ... - lastTransitionTime: "2021-07-21T02:06:17Z" message: 'Prometheus is running without persistent storage which can lead to data loss during upgrades and cluster disruptions. Please refer to the official documentation to see how to configure storage for Prometheus: https://docs.openshift.com/container-platform/4.8/monitoring/configuring-the-monitoring-stack.html' reason: PrometheusDataPersistenceNotConfigured status: "False" type: Degraded the doc links to 4.8, since the completion time for doc is very near the GA date, use a previous version is fine. also tested with bind PVCs for monitoring, no warn message # oc get co monitoring -oyaml ... status: conditions: - lastTransitionTime: "2021-07-21T01:57:30Z" reason: AsExpected status: "True" type: Upgradeable - lastTransitionTime: "2021-07-21T02:06:17Z" message: Successfully rolled out the stack. reason: RollOutDone status: "True" type: Available - lastTransitionTime: "2021-07-21T02:06:17Z" status: "False" type: Progressing - lastTransitionTime: "2021-07-21T02:59:58Z" status: "False" type: Degraded
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759