Description of problem (please be detailed as possible and provide log snippests): The status given below is not reporting the correct value of "usage". The value of "objects" 29GiB is correct. This is the status after creating a RBD PVC and a file of size 30GB. With replica 3 the usage should be more than 90GiB. $ ceph status cluster: id: 0ec022be-cf05-479e-9778-ed6d8664139e health: HEALTH_OK services: mon: 3 daemons, quorum a,b,c (age 69m) mgr: a(active, since 69m) mds: 1/1 daemons up, 1 hot standby osd: 3 osds: 3 up (since 68m), 3 in (since 69m) data: volumes: 1/1 healthy pools: 4 pools, 97 pgs objects: 8.16k objects, 29 GiB usage: 3.8 GiB used, 1.5 TiB / 1.5 TiB avail pgs: 97 active+clean io: client: 1.2 KiB/s rd, 14 KiB/s wr, 2 op/s rd, 1 op/s wr $ ceph df --- RAW STORAGE --- CLASS SIZE AVAIL USED RAW USED %RAW USED ssd 1.5 TiB 1.5 TiB 3.8 GiB 3.8 GiB 0.25 TOTAL 1.5 TiB 1.5 TiB 3.8 GiB 3.8 GiB 0.25 --- POOLS --- POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL ocs-storagecluster-cephblockpool 1 32 1.0 GiB 8.14k 3.1 GiB 0.24 434 GiB device_health_metrics 2 1 0 B 0 0 B 0 434 GiB ocs-storagecluster-cephfilesystem-metadata 3 32 55 KiB 24 252 KiB 0 434 GiB ocs-storagecluster-cephfilesystem-data0 4 32 158 B 1 12 KiB 0 434 GiB PVC used for testing: NAMESPACE NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE namespace-test-c8bc8f4401e14155a310e7502 pvc-test-facfa9bc318542abac0483139092447 Bound pvc-917fdbc6-a9b7-4881-9887-2436876905ed 70Gi RWO ocs-storagecluster-ceph-rbd 36m The UI also not reporting the correct usage. The cluster is not used for any other testing. So only one test PVC is created in the cluster. No data was deleted. logs - http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-23nov/jijoy-23nov_20211123T132419/logs/deployment_1637682325/ocs_must_gather/ The value of "usage" in ceph status is increasing very slowing when idle. So in the must-gather logs, the usage is "usage: 4.1 GiB used, 1.5 TiB / 1.5 TiB avail". Tested in AWS platform. Version of all relevant components (if applicable): ceph version 16.2.0-143.el8cp odf-operator.v4.9.0 OCP 4.9.0-0.nightly-2021-11-23-041617 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Usage is not displayed correctly in CLI and GUI. Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? Yes 2/2 Can this issue reproduce from the UI? Yes If this is a regression, please provide more details to justify this: Usage was showing correctly in previous versions Steps to Reproduce: 1. Create an RBD PVC (eg: 70GiB ) and attach it to an app pod. 2. Run I/O to create a file (eg: 30 GiB file). 3. Check ceph status to display the usage. 4. Check storage system overview in UI to see the usage. Actual results: Usage is not reported correctly in step 3 and 4. Expected results: Usage should be reported correctly (in this test, the space reclamation issues will not come into picture because no data is deleted ) Additional info:
Tested in version: ODF full_version: 4.9.0-244.ci
Neha, PTAL. This looks similar to a recent bug which you reviewed.
Hi, Did can you provide details on how was the data written to the file? Thanks
Jiju in what release did this change?
(In reply to Orit Wasserman from comment #5) > Hi, > Did can you provide details on how was the data written to the file? Used fio. size given was 30G. So a single file of size 30G got created. > Thanks
(In reply to Scott Ostapovicz from comment #6) > Jiju in what release did this change? In recent builds. I found this ceph status output from ODF 4.9.0-189 cluster - https://bugzilla.redhat.com/show_bug.cgi?id=2014279#c0. This output is showing usage correctly.
(In reply to Jilju Joy from comment #8) > (In reply to Scott Ostapovicz from comment #6) > > Jiju in what release did this change? > In recent builds. I found this ceph status output from ODF 4.9.0-189 cluster > - https://bugzilla.redhat.com/show_bug.cgi?id=2014279#c0. This output is > showing usage correctly. Hey Jilju, did you use similar data and file size as seen in description and confirmed this ?
(In reply to Neha Berry from comment #9) > (In reply to Jilju Joy from comment #8) > > (In reply to Scott Ostapovicz from comment #6) > > > Jiju in what release did this change? > > In recent builds. I found this ceph status output from ODF 4.9.0-189 cluster > > - https://bugzilla.redhat.com/show_bug.cgi?id=2014279#c0. This output is > > showing usage correctly. > > Hey Jilju, did you use similar data and file size as seen in description and > confirmed this ? I used fio in both cases. But I don't remember the file size or the number of files created in 2014279#c0. I was trying to get the ceph status output of a 4.9 cluster for comparison and found 2014279#c0.
Reproduced in version (comment #c14): ODF 4.9.0-249.ci OCP 4.9.0-0.nightly-2021-12-01-050136 ceph version 16.2.0-146.el8cp (56f5e9cfe88a08b6899327eca5166ca1c4a392aa) pacific (stable)
I replaced fio with dd. Usage is now showing correctly. Not sure how this is dependent on the application.
(In reply to Jilju Joy from comment #17) > I replaced fio with dd. Usage is now showing correctly. Not sure how this is > dependent on the application It seems the data was not synced and is still cached on upper layer and was not written to the Ceph cluster. Was there any change in the fio command? You can call fsync after the fio command to force the data to be synced.
As discussed in the engineering weekly, not a 4.9 blocker. Moving it out.
(In reply to Orit Wasserman from comment #18) > (In reply to Jilju Joy from comment #17) > > I replaced fio with dd. Usage is now showing correctly. Not sure how this is > > dependent on the application > > It seems the data was not synced and is still cached on upper layer and was > not written to the Ceph cluster. > Was there any change in the fio command? > You can call fsync after the fio command to force the data to be synced. This fio command given in comment #14 is used in many automated test cases in ocs-ci. Just the size and runtime will be different in each test. This parameter --end_fsync=1 is to perform sync after the completion of I/O. I did it manually as well.
(In reply to Jilju Joy from comment #20) > (In reply to Orit Wasserman from comment #18) > > (In reply to Jilju Joy from comment #17) > > > I replaced fio with dd. Usage is now showing correctly. Not sure how this is > > > dependent on the application > > > > It seems the data was not synced and is still cached on upper layer and was > > not written to the Ceph cluster. > > Was there any change in the fio command? > > You can call fsync after the fio command to force the data to be synced. > > This fio command given in comment #14 is used in many automated test cases > in ocs-ci. Just the size and runtime will be different in each test. This > parameter --end_fsync=1 is to perform sync after the completion of I/O. I > did it manually as well. Did you see the same issue when manually running fsync after fio?
(In reply to Orit Wasserman from comment #21) > (In reply to Jilju Joy from comment #20) > > (In reply to Orit Wasserman from comment #18) > > > (In reply to Jilju Joy from comment #17) > > > > I replaced fio with dd. Usage is now showing correctly. Not sure how this is > > > > dependent on the application > > > > > > It seems the data was not synced and is still cached on upper layer and was > > > not written to the Ceph cluster. > > > Was there any change in the fio command? > > > You can call fsync after the fio command to force the data to be synced. > > > > This fio command given in comment #14 is used in many automated test cases > > in ocs-ci. Just the size and runtime will be different in each test. This > > parameter --end_fsync=1 is to perform sync after the completion of I/O. I > > did it manually as well. > > Did you see the same issue when manually running fsync after fio? sync command was executed. Used size did not change to actual even after running sync.
Doesn't look like there is an issue with ceph as it works with dd but fails with fsync.