Bug 2026207

Summary: Ceph status is not reporting the actual usage
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Jilju Joy <jijoy>
Component: cephAssignee: Neha Ojha <nojha>
Status: CLOSED NOTABUG QA Contact: Elad <ebenahar>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.9CC: assingh, bniver, ebenahar, etamir, madam, mmuench, muagarwa, nberry, nojha, ocs-bugs, odf-bz-bot, owasserm, sostapov
Target Milestone: ---Keywords: Regression
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-02-02 16:29:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jilju Joy 2021-11-24 04:58:26 UTC
Description of problem (please be detailed as possible and provide log
snippests):
The status given below is not reporting the correct value of "usage". The value of "objects" 29GiB is correct.
This is the status after creating a RBD PVC and a file of size 30GB. With replica 3 the usage should be more than 90GiB.


$ ceph status
  cluster:
    id:     0ec022be-cf05-479e-9778-ed6d8664139e
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum a,b,c (age 69m)
    mgr: a(active, since 69m)
    mds: 1/1 daemons up, 1 hot standby
    osd: 3 osds: 3 up (since 68m), 3 in (since 69m)
 
  data:
    volumes: 1/1 healthy
    pools:   4 pools, 97 pgs
    objects: 8.16k objects, 29 GiB
    usage:   3.8 GiB used, 1.5 TiB / 1.5 TiB avail
    pgs:     97 active+clean
 
  io:
    client:   1.2 KiB/s rd, 14 KiB/s wr, 2 op/s rd, 1 op/s wr


$ ceph df
--- RAW STORAGE ---
CLASS     SIZE    AVAIL     USED  RAW USED  %RAW USED
ssd    1.5 TiB  1.5 TiB  3.8 GiB   3.8 GiB       0.25
TOTAL  1.5 TiB  1.5 TiB  3.8 GiB   3.8 GiB       0.25
 
--- POOLS ---
POOL                                        ID  PGS   STORED  OBJECTS     USED  %USED  MAX AVAIL
ocs-storagecluster-cephblockpool             1   32  1.0 GiB    8.14k  3.1 GiB   0.24    434 GiB
device_health_metrics                        2    1      0 B        0      0 B      0    434 GiB
ocs-storagecluster-cephfilesystem-metadata   3   32   55 KiB       24  252 KiB      0    434 GiB
ocs-storagecluster-cephfilesystem-data0      4   32    158 B        1   12 KiB      0    434 GiB



PVC used for testing:
NAMESPACE                                  NAME                                        STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                  AGE
namespace-test-c8bc8f4401e14155a310e7502   pvc-test-facfa9bc318542abac0483139092447    Bound    pvc-917fdbc6-a9b7-4881-9887-2436876905ed   70Gi       RWO            ocs-storagecluster-ceph-rbd   36m

The UI also not reporting the correct usage. 

The cluster is not used for any other testing. So only one test PVC is created in the cluster. No data was deleted.

logs - http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-23nov/jijoy-23nov_20211123T132419/logs/deployment_1637682325/ocs_must_gather/

The value of "usage" in ceph status is increasing very slowing when idle. So in the must-gather logs, the usage is "usage:   4.1 GiB used, 1.5 TiB / 1.5 TiB avail".

Tested in AWS platform. 


Version of all relevant components (if applicable):
ceph version 16.2.0-143.el8cp
odf-operator.v4.9.0
OCP 4.9.0-0.nightly-2021-11-23-041617

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Usage is not displayed correctly in CLI and GUI.

Is there any workaround available to the best of your knowledge?
No


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
Yes 2/2

Can this issue reproduce from the UI?
Yes

If this is a regression, please provide more details to justify this:
Usage was showing correctly in previous versions

Steps to Reproduce:
1. Create an RBD PVC (eg: 70GiB ) and attach it to an app pod.
2. Run I/O to create a file (eg: 30 GiB file).
3. Check ceph status to display the usage.
4. Check storage system overview in UI to see the usage. 


Actual results:
Usage is not reported correctly in step 3 and 4.

Expected results:
Usage should be reported correctly (in this test, the space reclamation issues will not come into picture because no data is deleted )

Additional info:

Comment 2 Jilju Joy 2021-11-24 05:14:06 UTC
Tested in version:
ODF full_version: 4.9.0-244.ci

Comment 4 Mudit Agarwal 2021-11-24 06:24:49 UTC
Neha, PTAL. This looks similar to a recent bug which you reviewed.

Comment 5 Orit Wasserman 2021-11-29 14:13:35 UTC
Hi,
Did can you provide details on how was the data written to the file?
Thanks

Comment 6 Scott Ostapovicz 2021-11-30 15:15:25 UTC
Jiju in what release did this change?

Comment 7 Jilju Joy 2021-12-01 05:35:13 UTC
(In reply to Orit Wasserman from comment #5)
> Hi,
> Did can you provide details on how was the data written to the file?
Used fio.
size given was 30G. So a single file of size 30G got created.
> Thanks

Comment 8 Jilju Joy 2021-12-01 05:55:11 UTC
(In reply to Scott Ostapovicz from comment #6)
> Jiju in what release did this change?
In recent builds. I found this ceph status output from ODF 4.9.0-189 cluster - https://bugzilla.redhat.com/show_bug.cgi?id=2014279#c0. This output is showing usage correctly.

Comment 9 Neha Berry 2021-12-01 07:38:03 UTC
(In reply to Jilju Joy from comment #8)
> (In reply to Scott Ostapovicz from comment #6)
> > Jiju in what release did this change?
> In recent builds. I found this ceph status output from ODF 4.9.0-189 cluster
> - https://bugzilla.redhat.com/show_bug.cgi?id=2014279#c0. This output is
> showing usage correctly.

Hey Jilju, did you use similar data and file size as seen in description and confirmed this ?

Comment 13 Jilju Joy 2021-12-01 10:26:53 UTC
(In reply to Neha Berry from comment #9)
> (In reply to Jilju Joy from comment #8)
> > (In reply to Scott Ostapovicz from comment #6)
> > > Jiju in what release did this change?
> > In recent builds. I found this ceph status output from ODF 4.9.0-189 cluster
> > - https://bugzilla.redhat.com/show_bug.cgi?id=2014279#c0. This output is
> > showing usage correctly.
> 
> Hey Jilju, did you use similar data and file size as seen in description and
> confirmed this ?
I used fio in both cases. But I don't remember the file size or the number of files created in 2014279#c0. I was trying to get the ceph status output  of a 4.9 cluster for comparison and found 2014279#c0.

Comment 16 Jilju Joy 2021-12-01 10:48:41 UTC
Reproduced in version (comment #c14):
ODF 4.9.0-249.ci
OCP 4.9.0-0.nightly-2021-12-01-050136
ceph version 16.2.0-146.el8cp (56f5e9cfe88a08b6899327eca5166ca1c4a392aa) pacific (stable)

Comment 17 Jilju Joy 2021-12-01 11:17:09 UTC
I replaced fio with dd. Usage is now showing correctly. Not sure how this is dependent on the application.

Comment 18 Orit Wasserman 2021-12-06 14:34:39 UTC
(In reply to Jilju Joy from comment #17)
> I replaced fio with dd. Usage is now showing correctly. Not sure how this is
> dependent on the application

It seems the data was not synced and is still cached on upper layer and was not written to the Ceph cluster.
Was there any change in the fio command?
You can call fsync after the fio command to force the data to be synced.

Comment 19 Mudit Agarwal 2021-12-06 15:36:28 UTC
As discussed in the engineering weekly, not a 4.9 blocker. Moving it out.

Comment 20 Jilju Joy 2021-12-06 17:49:26 UTC
(In reply to Orit Wasserman from comment #18)
> (In reply to Jilju Joy from comment #17)
> > I replaced fio with dd. Usage is now showing correctly. Not sure how this is
> > dependent on the application
> 
> It seems the data was not synced and is still cached on upper layer and was
> not written to the Ceph cluster.
> Was there any change in the fio command?
> You can call fsync after the fio command to force the data to be synced.

This fio command given in comment #14 is used in many automated test cases in ocs-ci.  Just the size and runtime will be different in each test. This parameter --end_fsync=1 is to perform sync after the completion of I/O. I did it manually as well.

Comment 21 Orit Wasserman 2021-12-07 13:22:09 UTC
(In reply to Jilju Joy from comment #20)
> (In reply to Orit Wasserman from comment #18)
> > (In reply to Jilju Joy from comment #17)
> > > I replaced fio with dd. Usage is now showing correctly. Not sure how this is
> > > dependent on the application
> > 
> > It seems the data was not synced and is still cached on upper layer and was
> > not written to the Ceph cluster.
> > Was there any change in the fio command?
> > You can call fsync after the fio command to force the data to be synced.
> 
> This fio command given in comment #14 is used in many automated test cases
> in ocs-ci.  Just the size and runtime will be different in each test. This
> parameter --end_fsync=1 is to perform sync after the completion of I/O. I
> did it manually as well.

Did you see the same issue when manually running fsync after fio?

Comment 22 Jilju Joy 2021-12-08 16:35:56 UTC
(In reply to Orit Wasserman from comment #21)
> (In reply to Jilju Joy from comment #20)
> > (In reply to Orit Wasserman from comment #18)
> > > (In reply to Jilju Joy from comment #17)
> > > > I replaced fio with dd. Usage is now showing correctly. Not sure how this is
> > > > dependent on the application
> > > 
> > > It seems the data was not synced and is still cached on upper layer and was
> > > not written to the Ceph cluster.
> > > Was there any change in the fio command?
> > > You can call fsync after the fio command to force the data to be synced.
> > 
> > This fio command given in comment #14 is used in many automated test cases
> > in ocs-ci.  Just the size and runtime will be different in each test. This
> > parameter --end_fsync=1 is to perform sync after the completion of I/O. I
> > did it manually as well.
> 
> Did you see the same issue when manually running fsync after fio?

sync command was executed. Used size did not change to actual even after running sync.

Comment 23 Mudit Agarwal 2022-02-02 16:29:51 UTC
Doesn't look like there is an issue with ceph as it works with dd but fails with fsync.