Bug 1834440

Summary: Update and improve description of OCS storage utilization alerts
Product: [Red Hat Storage] Red Hat OpenShift Container Storage Reporter: Martin Bukatovic <mbukatov>
Component: documentationAssignee: Anjana Suparna Sriram <asriram>
Status: NEW --- QA Contact: Elad <ebenahar>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.3   
Target Milestone: ---   
Target Release: OCS 4.3.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1809248    
Bug Blocks:    

Description Martin Bukatovic 2020-05-11 17:39:12 UTC
Document URL
============

Red Hat OpenShift Container Storage 4.3
Troubleshooting OpenShift Container Storage

Section Number and Name
=======================

Chapter 5. Troubleshooting alerts and errors in OpenShift Container Storage
5.1. Resolving alerts and errors

Describe the issue
==================

In the list of OCS alerts, I see entries for CephClusterCriticallyFull and
CephClusterNearFull, but it's description is insufficient, lacking clear and
precise meaning. What will happen when an action is not taken is not discussed.

Suggestions for improvement
===========================

For all storage utilization alerts (such as CephClusterNearFull and
CephClusterCriticallyFull), we should provide the following details in a clear
way:

- Exact definition of the alert, and how to understand it wrt cluster state.
  What is based on? How does it related to cluster vs usable storage? Does it
  mean I will be able to write 25% data untill hiting out of space issue when
  the alert states that utilization crossed 75%?
- What is going to happen when the alert is not acted upon (include worst case
  scenario)
- Impact on OCP Prometheus monitoring when it's storage is backed by OCS.

We should also make sure that all storage utilization alerts are listed.

Additional information
======================

Exact content depends on engineering resolution for BZ 1809248. Please reach
out to dev team when BZ 1809248 so that doc changes can be drafted.

Action items for admin to follow as listed in Procedure section needs to be
also revisited, if changes in eng. BZs makes it necessary.

Other related eng. bugs include BZ 1818736 and BZ 1775432.

Comment 2 Martin Bukatovic 2020-05-11 17:42:52 UTC
Marking BZ 1818736 as a blocker for this doc bug, as discussed in Additional information section above.

Comment 3 Martin Bukatovic 2020-05-11 17:44:20 UTC
Fixing copy-paste typo in a blocker bug.