1870083 – CephClusterNearFull and CephClusterCriticallyFull arlerts are on cluster level which can lead to not informing user in time that it is running out of space

Bug 1870083 - CephClusterNearFull and CephClusterCriticallyFull arlerts are on cluster level which can lead to not informing user in time that it is running out of space

Summary: CephClusterNearFull and CephClusterCriticallyFull arlerts are on cluster leve...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	ceph-monitoring
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Anmol Sachan
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-08-19 10:30 UTC by Filip Balák
Modified:	2023-09-18 00:22 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-06-11 08:04:36 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Filip Balák 2020-08-19 10:30:56 UTC

Description of problem:
There should be alerts for Ceph pool utilization.

There are currently alerts for cluster utilization (CephClusterNearFull and CephClusterCriticallyFull) but in extreme situation when user utilizes only one pool, user doesn't have to be informed by cluster level alerts in time.

Comment 3 Nishanth Thomas 2020-08-20 08:23:02 UTC

Not a blocker for 4.6, RFE
Also pool management coming in as part of 4.7, hence moving this out to 4.7

Comment 4 Martin Bukatovic 2020-08-20 08:40:21 UTC

> Also pool management coming in as part of 4.7

Could you provide a reference to a BZ or JIRA concerned with pool management for OCS 4.7?

Comment 6 Anmol Sachan 2020-08-23 16:30:59 UTC

Note: Even if we have Pool Mnagement, Pool level alerts can only be created when there is functionality of setting quotas on ceph pools through OCS, otherwise the size of pools remains dynamic according to storage consumption.

Comment 7 Anmol Sachan 2020-11-19 06:33:36 UTC

@Eran Is there a requirement from the customer side for Pool level alerts? Also, IMO this should be supplemented with the UI visualization we want to get the alerts.

Comment 13 Eran Tamir 2020-12-01 15:22:46 UTC

I'm not sure why pool level alert is more precise as the pool free space is a shared free space of the entire cluster. 
@Elad, pool management moved to OCS 4.8.

Comment 14 Nishanth Thomas 2020-12-01 17:42:29 UTC

@etamir , Its not clear to me what's fix you are looking at here? Can you elaborate?

Comment 17 Martin Bukatovic 2020-12-07 09:41:05 UTC

All usable storage space information from user perspective (how much data is ceph cluster still able to receive and store from ceph clients) needs to take ceph pool configuration into account.

Comment 20 Anmol Sachan 2020-12-15 11:55:24 UTC

Closing this BZ on the basis of https://bugzilla.redhat.com/show_bug.cgi?id=1870083#c16 and https://bugzilla.redhat.com/show_bug.cgi?id=1870083#c19

Comment 29 Eran Tamir 2021-01-03 09:56:52 UTC

@Martin - If I understand you correctly, you are saying that only on a pool level, we will be able to show free namespace without the overhead. Is that the current motivation? 

If so, this value can be calculated and used. @Anat, please keep me honest here.

Comment 30 Orit Wasserman 2021-01-04 10:12:11 UTC

(In reply to Eran Tamir from comment #29)
> @Martin - If I understand you correctly, you are saying that only on a pool
> level, we will be able to show free namespace without the overhead. Is that
> the current motivation? 

If the pools share the OSDs than per pool usable free space is confusing and misleading to the user.
Let say we have two pools replica 2 and replica 3 and 300G available raw capacity.
Pool 1 has 150G free space and Pool 2 has 100G, I think this will be very confusing for the user.
It will be even more confusing if you add compression.

As for Ceph's near full and full alerts those are calculated on the raw capacity and don't take into account the pool replication factor.
They are for the cluster admin to inform them when to expand the cluster or free space. 
The values that were chosen allow them enough time to handle the situation. This is especially true with small clusters as we don't have much spare capacity and the cluster can fill up quickly.
The reason for moving to read only in case of a full cluster is because deletion of data requires additional space for the metadata and we don't want to get into a situation it is impossible to delete data.
This can happen at an OSD level as well for that we have the osd full alert.
This is more likely in small cluster and/or small capacity OSDs.

There is work in Ceph Pacific to calculate this threshold automatically to support smaller cluster better.
> 
> If so, this value can be calculated and used. @Anat, please keep me honest
> here.

Comment 31 Anat Eyal 2021-01-07 08:47:54 UTC

(In reply to Eran Tamir from comment #29)
> @Martin - If I understand you correctly, you are saying that only on a pool
> level, we will be able to show free namespace without the overhead. Is that
> the current motivation? 
> 
> If so, this value can be calculated and used. @Anat, please keep me honest
> here.

The question has been addressed by Orit in comment #30. Clearing the needinfo request.

Comment 40 Michael Adam 2021-06-10 14:28:36 UTC

clearing stale needinfo.

Comment 42 Red Hat Bugzilla 2023-09-18 00:22:02 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.