1848798 – Cluster Full/NearFull status not accurately reported

Bug 1848798 - Cluster Full/NearFull status not accurately reported

Summary: Cluster Full/NearFull status not accurately reported

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	ceph
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Neha Ojha
QA Contact:	Raz Tamir
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-06-19 01:56 UTC by Jean-Charles Lopez
Modified:	2021-01-11 18:44 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-01-11 18:44:46 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Output of Ceoh commands (19.04 KB, application/rtf) 2020-06-19 01:56 UTC, Jean-Charles Lopez	no flags	Details
ceph cluster nearful UI alert (304.09 KB, image/png) 2020-11-24 06:50 UTC, krishnaram Karthick	no flags	Details
must_gather for comment#14 (2.32 MB, application/x-xz) 2020-11-24 10:16 UTC, krishnaram Karthick	no flags	Details
Near full Ceph condition matches OCP UI alerts (437.55 KB, image/png) 2020-12-22 22:40 UTC, Jean-Charles Lopez	no flags	Details
Backfill too full Ceph condition matches OCP UI alerts (389.94 KB, image/png) 2020-12-22 23:08 UTC, Jean-Charles Lopez	no flags	Details
View All

Description Jean-Charles Lopez 2020-06-19 01:56:10 UTC

Created attachment 1698028 [details]
Output of Ceoh commands

Description of problem (please be detailed as possible and provide log
snippests):
We filled up a test cluster using a set of container looping to write to RBD based PVC and CephFS based PVC. As the cluster filled up the nearfull OSDs are never reported (threshold set to 75%) and when the many OSDs are reaching the full threshold only one of them gets reported.

Version of all relevant components (if applicable):
Client Version: 4.4.7
Server Version: 4.4.3
Kubernetes Version: v1.17.1

oc get csv
NAME                            DISPLAY                       VERSION   REPLACES   PHASE
lib-bucket-provisioner.v1.0.0   lib-bucket-provisioner        1.0.0                Succeeded
ocs-operator.v4.4.0             OpenShift Container Storage   4.4.0                Succeeded


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
No

Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
Have not tried

Can this issue reproduce from the UI?
Have not tried

If this is a regression, please provide more details to justify this:
Remember that in the previous Ceph versions nearfull OSDs would be reported and for full and nearfull the list of OSDs crossing thre thresholds would be mentioned

Steps to Reproduce:
1.Create a loop pod writing to a PVC sized as big as the cluster usable storage size
2.Wait for the PVC to fill up the cluster
3.


Actual results:


Expected results:


Additional info:
I collected a must-gather for OCP and for OCS if you need them but they are too big to be attached to the BZ

Comment 6 Josh Durgin 2020-06-25 21:51:10 UTC

Moving to 4.6 since this is not a blocker for 4.5.

Comment 9 Mudit Agarwal 2020-09-30 15:51:17 UTC

Moving it to 4.7 as we don't have sufficient data at the moment.

Comment 11 Raz Tamir 2020-11-22 06:41:24 UTC

Hi Neha,

It is very similar to https://bugzilla.redhat.com/show_bug.cgi?id=1896959#c6 and the reproduction succeeded

Comment 12 Raz Tamir 2020-11-23 14:40:27 UTC

Karthick,

As part of the reproduction step of https://bugzilla.redhat.com/show_bug.cgi?id=1896959#c6 , did you notice the warning of the 75% threshold?

Comment 13 krishnaram Karthick 2020-11-24 03:02:28 UTC

(In reply to Raz Tamir from comment #12)
> Karthick,
> 
> As part of the reproduction step of
> https://bugzilla.redhat.com/show_bug.cgi?id=1896959#c6 , did you notice the
> warning of the 75% threshold?

I did not really look for the warning at the 75% threshold. I'll try once again and update the bug.

Comment 15 krishnaram Karthick 2020-11-24 06:50:17 UTC

Created attachment 1732850 [details]
ceph cluster nearful UI alert

Comment 17 krishnaram Karthick 2020-11-24 10:16:00 UTC

Created attachment 1732929 [details]
must_gather for comment#14

Comment 18 Raz Tamir 2020-11-24 14:35:46 UTC

Hi JC,

Based on QE's tests the alert is fired up at 75% threshold.
Could you please elaborate more on the scenario which caused this issue?

Comment 19 Jean-Charles Lopez 2020-11-24 22:51:07 UTC

In previous versions I tested the underlying RHCS cluster would show a backfill too full warning rather than an OSD nearfull error. So more of a Ceph problem I think than an OCS problem.

I'll try to reproduce with 4.6.0 RC3 to confirm if the problem is stull here

Comment 20 Jean-Charles Lopez 2020-12-22 18:42:31 UTC

Resteting today to verify. Will update later.

Comment 21 Jean-Charles Lopez 2020-12-22 22:39:08 UTC

So changing the backfill_too_full_ration to the default of 0.80 has solved the first problem indeed

Now the Ceph cluster is correctly reporting only near full when we cross the default 0.75 of the osd_nerafull_ratio

Capture below
Tue Dec 22 14:36:03 PST 2020
HEALTH_WARN 3 nearfull osd(s); 3 pool(s) nearfull
OSD_NEARFULL 3 nearfull osd(s)
    osd.0 is near full
    osd.1 is near full
    osd.2 is near full
POOL_NEARFULL 3 pool(s) nearfull
    pool 'ocs-storagecluster-cephblockpool' is nearfull
    pool 'ocs-storagecluster-cephfilesystem-metadata' is nearfull
    pool 'ocs-storagecluster-cephfilesystem-data0' is nearfull
5.20TiB-1.20TiB-3.90TiB
ID CLASS WEIGHT  REWEIGHT SIZE    RAW USE DATA    OMAP    META    AVAIL   %USE  VAR  PGS STATUS
 0   ssd 1.72800  1.00000 1.7 TiB 1.3 TiB 1.3 TiB 6.9 MiB 2.8 GiB 424 GiB 76.01 1.00  96     up
 2   ssd 1.72800  1.00000 1.7 TiB 1.3 TiB 1.3 TiB 6.9 MiB 2.7 GiB 424 GiB 76.01 1.00  96     up
 1   ssd 1.72800  1.00000 1.7 TiB 1.3 TiB 1.3 TiB 6.9 MiB 2.8 GiB 424 GiB 76.01 1.00  96     up
                    TOTAL 5.2 TiB 3.9 TiB 3.9 TiB  21 MiB 8.2 GiB 1.2 TiB 76.01
MIN/MAX VAR: 1.00/1.00  STDDEV: 0

The OCP UI does report a nearfull condition as expected. See attached nearfull-test-1-75pct-reached.png

Comment 22 Jean-Charles Lopez 2020-12-22 22:40:47 UTC

Created attachment 1741459 [details]
Near full Ceph condition matches OCP UI alerts

Comment 23 Jean-Charles Lopez 2020-12-22 23:08:06 UTC

Then when we cross the backfill_too_full_ratio we have a separate message

The cluster reports the following
Tue Dec 22 14:50:15 PST 2020
HEALTH_WARN 3 backfillfull osd(s); 3 pool(s) backfillfull
OSD_BACKFILLFULL 3 backfillfull osd(s)
    osd.0 is backfill full
    osd.1 is backfill full
    osd.2 is backfill full
POOL_BACKFILLFULL 3 pool(s) backfillfull
    pool 'ocs-storagecluster-cephblockpool' is backfillfull
    pool 'ocs-storagecluster-cephfilesystem-metadata' is backfillfull
    pool 'ocs-storagecluster-cephfilesystem-data0' is backfillfull
5.20TiB-1.00TiB-4.20TiB
ID CLASS WEIGHT  REWEIGHT SIZE    RAW USE DATA    OMAP    META    AVAIL   %USE  VAR  PGS STATUS
 0   ssd 1.72800  1.00000 1.7 TiB 1.4 TiB 1.4 TiB 6.9 MiB 2.6 GiB 346 GiB 80.47 1.00  96     up
 2   ssd 1.72800  1.00000 1.7 TiB 1.4 TiB 1.4 TiB 7.0 MiB 2.5 GiB 346 GiB 80.46 1.00  96     up
 1   ssd 1.72800  1.00000 1.7 TiB 1.4 TiB 1.4 TiB 6.9 MiB 2.6 GiB 346 GiB 80.47 1.00  96     up
                    TOTAL 5.2 TiB 4.2 TiB 4.2 TiB  21 MiB 7.7 GiB 1.0 TiB 80.47
MIN/MAX VAR: 1.00/1.00  STDDEV: 0

The OCP UI reports the 80% threshold was crossed as expected. See nearfull-test-1-80pct-reached.png attached

I consider this issue fixed thanks to the change in osd_backfill_toofull ratio by default.

Comment 24 Jean-Charles Lopez 2020-12-22 23:08:55 UTC

Created attachment 1741460 [details]
Backfill too full Ceph condition matches OCP UI alerts

Note You need to log in before you can comment on or make changes to this bug.