Created attachment 1698028 [details] Output of Ceoh commands Description of problem (please be detailed as possible and provide log snippests): We filled up a test cluster using a set of container looping to write to RBD based PVC and CephFS based PVC. As the cluster filled up the nearfull OSDs are never reported (threshold set to 75%) and when the many OSDs are reaching the full threshold only one of them gets reported. Version of all relevant components (if applicable): Client Version: 4.4.7 Server Version: 4.4.3 Kubernetes Version: v1.17.1 oc get csv NAME DISPLAY VERSION REPLACES PHASE lib-bucket-provisioner.v1.0.0 lib-bucket-provisioner 1.0.0 Succeeded ocs-operator.v4.4.0 OpenShift Container Storage 4.4.0 Succeeded Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? No Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? Have not tried Can this issue reproduce from the UI? Have not tried If this is a regression, please provide more details to justify this: Remember that in the previous Ceph versions nearfull OSDs would be reported and for full and nearfull the list of OSDs crossing thre thresholds would be mentioned Steps to Reproduce: 1.Create a loop pod writing to a PVC sized as big as the cluster usable storage size 2.Wait for the PVC to fill up the cluster 3. Actual results: Expected results: Additional info: I collected a must-gather for OCP and for OCS if you need them but they are too big to be attached to the BZ
Moving to 4.6 since this is not a blocker for 4.5.
Moving it to 4.7 as we don't have sufficient data at the moment.
Hi Neha, It is very similar to https://bugzilla.redhat.com/show_bug.cgi?id=1896959#c6 and the reproduction succeeded
Karthick, As part of the reproduction step of https://bugzilla.redhat.com/show_bug.cgi?id=1896959#c6 , did you notice the warning of the 75% threshold?
(In reply to Raz Tamir from comment #12) > Karthick, > > As part of the reproduction step of > https://bugzilla.redhat.com/show_bug.cgi?id=1896959#c6 , did you notice the > warning of the 75% threshold? I did not really look for the warning at the 75% threshold. I'll try once again and update the bug.
Created attachment 1732850 [details] ceph cluster nearful UI alert
Created attachment 1732929 [details] must_gather for comment#14
Hi JC, Based on QE's tests the alert is fired up at 75% threshold. Could you please elaborate more on the scenario which caused this issue?
In previous versions I tested the underlying RHCS cluster would show a backfill too full warning rather than an OSD nearfull error. So more of a Ceph problem I think than an OCS problem. I'll try to reproduce with 4.6.0 RC3 to confirm if the problem is stull here
Resteting today to verify. Will update later.
So changing the backfill_too_full_ration to the default of 0.80 has solved the first problem indeed Now the Ceph cluster is correctly reporting only near full when we cross the default 0.75 of the osd_nerafull_ratio Capture below Tue Dec 22 14:36:03 PST 2020 HEALTH_WARN 3 nearfull osd(s); 3 pool(s) nearfull OSD_NEARFULL 3 nearfull osd(s) osd.0 is near full osd.1 is near full osd.2 is near full POOL_NEARFULL 3 pool(s) nearfull pool 'ocs-storagecluster-cephblockpool' is nearfull pool 'ocs-storagecluster-cephfilesystem-metadata' is nearfull pool 'ocs-storagecluster-cephfilesystem-data0' is nearfull 5.20TiB-1.20TiB-3.90TiB ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS 0 ssd 1.72800 1.00000 1.7 TiB 1.3 TiB 1.3 TiB 6.9 MiB 2.8 GiB 424 GiB 76.01 1.00 96 up 2 ssd 1.72800 1.00000 1.7 TiB 1.3 TiB 1.3 TiB 6.9 MiB 2.7 GiB 424 GiB 76.01 1.00 96 up 1 ssd 1.72800 1.00000 1.7 TiB 1.3 TiB 1.3 TiB 6.9 MiB 2.8 GiB 424 GiB 76.01 1.00 96 up TOTAL 5.2 TiB 3.9 TiB 3.9 TiB 21 MiB 8.2 GiB 1.2 TiB 76.01 MIN/MAX VAR: 1.00/1.00 STDDEV: 0 The OCP UI does report a nearfull condition as expected. See attached nearfull-test-1-75pct-reached.png
Created attachment 1741459 [details] Near full Ceph condition matches OCP UI alerts
Then when we cross the backfill_too_full_ratio we have a separate message The cluster reports the following Tue Dec 22 14:50:15 PST 2020 HEALTH_WARN 3 backfillfull osd(s); 3 pool(s) backfillfull OSD_BACKFILLFULL 3 backfillfull osd(s) osd.0 is backfill full osd.1 is backfill full osd.2 is backfill full POOL_BACKFILLFULL 3 pool(s) backfillfull pool 'ocs-storagecluster-cephblockpool' is backfillfull pool 'ocs-storagecluster-cephfilesystem-metadata' is backfillfull pool 'ocs-storagecluster-cephfilesystem-data0' is backfillfull 5.20TiB-1.00TiB-4.20TiB ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS 0 ssd 1.72800 1.00000 1.7 TiB 1.4 TiB 1.4 TiB 6.9 MiB 2.6 GiB 346 GiB 80.47 1.00 96 up 2 ssd 1.72800 1.00000 1.7 TiB 1.4 TiB 1.4 TiB 7.0 MiB 2.5 GiB 346 GiB 80.46 1.00 96 up 1 ssd 1.72800 1.00000 1.7 TiB 1.4 TiB 1.4 TiB 6.9 MiB 2.6 GiB 346 GiB 80.47 1.00 96 up TOTAL 5.2 TiB 4.2 TiB 4.2 TiB 21 MiB 7.7 GiB 1.0 TiB 80.47 MIN/MAX VAR: 1.00/1.00 STDDEV: 0 The OCP UI reports the 80% threshold was crossed as expected. See nearfull-test-1-80pct-reached.png attached I consider this issue fixed thanks to the change in osd_backfill_toofull ratio by default.
Created attachment 1741460 [details] Backfill too full Ceph condition matches OCP UI alerts