Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

This project is now read‑only. Starting Monday, February 2, please use https://ibm-ceph.atlassian.net/ for all bug tracking management.

Bug 2112853

Summary:	Cluster doesn't recover from storage filling up to 100%
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	guy chen <guchen>
Component:	RADOS	Assignee:	Prashant Dhange <pdhange>
Status:	CLOSED NOTABUG	QA Contact:	Pawan <pdhiran>
Severity:	high	Docs Contact:
Priority:	high
Version:	5.1	CC:	akupczyk, amathuri, bhubbard, ceph-eng-bugs, cephqe-warriors, choffman, ebenahar, fdeutsch, jhopper, ksirivad, lflores, nojha, pdhange, pelauter, rfriedma, rzarzyns, skanta, sseshasa, vumrao, ycui
Target Milestone:	---
Target Release:	7.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-07-06 18:07:21 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description guy chen 2022-08-01 10:15:41 UTC

Description of problem:
I am running a performance test (OCP 4.11.0-rc.4) and started about ~800 VMS and got to 100% storage utilization (storage is ODF on top of local storage).
After that I erased VMS so the system should recover, but the storage still stuck on 100%.
Also I see that pvc is stuck on Terminating state.
Global recovery is running for 3 day's and not stopping.


oc get dv -A | grep -c scale-test
599
oc get vm -A | grep -c scale-test
599
oc get pvc -A | grep -c scale-test
812
oc get pvc -A | grep -c Terminating 
213



sh-4.4$ ceph -s 
  cluster:
    id:     4ed4381b-ae07-4d20-950b-94a140676dee
    health: HEALTH_ERR
            15 backfillfull osd(s)
            1 full osd(s)
            8 nearfull osd(s)
            Low space hindering backfill (add storage if this doesn't resolve itself): 1 pg backfill_toofull
            11 pool(s) full
 
  services:
    mon: 3 daemons, quorum a,b,c (age 6d)
    mgr: a(active, since 6d)
    mds: 1/1 daemons up, 1 hot standby
    osd: 57 osds: 57 up (since 6d), 57 in (since 6d); 1 remapped pgs
    rgw: 1 daemon active (1 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   11 pools, 1102 pgs
    objects: 2.18M objects, 7.6 TiB
    usage:   18 TiB used, 7.1 TiB / 25 TiB avail
    pgs:     1101 active+clean
             1    active+remapped+backfill_toofull
 
  io:
    client:   14 MiB/s rd, 528 op/s rd, 0 op/s wr
 
  progress:
    Global Recovery Event (3d)
      [===========================.] (remaining: 4m)

sh-4.4$ ceph df detail
--- RAW STORAGE ---
CLASS    SIZE    AVAIL    USED  RAW USED  %RAW USED
ssd    25 TiB  7.1 TiB  18 TiB    18 TiB      71.41
TOTAL  25 TiB  7.1 TiB  18 TiB    18 TiB      71.41
 
--- POOLS ---
POOL                                                   ID  PGS   STORED   (DATA)   (OMAP)  OBJECTS     USED   (DATA)   (OMAP)   %USED  MAX AVAIL  QUOTA OBJECTS  QUOTA BYTES  DIRTY  USED COMPR  UNDER COMPR
device_health_metrics                                   1    1  6.9 MiB      0 B  6.9 MiB       57   21 MiB      0 B   21 MiB  100.00        0 B            N/A          N/A    N/A         0 B          0 B
ocs-storagecluster-cephblockpool                        2  512  5.9 TiB  5.9 TiB  562 KiB    2.18M   18 TiB   18 TiB  1.6 MiB  100.00        0 B            N/A          N/A    N/A         0 B          0 B
ocs-storagecluster-cephobjectstore.rgw.meta             3    8   11 KiB  3.9 KiB  7.5 KiB       16  191 KiB  168 KiB   23 KiB  100.00        0 B            N/A          N/A    N/A         0 B          0 B
ocs-storagecluster-cephobjectstore.rgw.buckets.index    4    8   27 KiB      0 B   27 KiB       22   82 KiB      0 B   82 KiB  100.00        0 B            N/A          N/A    N/A         0 B          0 B
ocs-storagecluster-cephobjectstore.rgw.control          5    8      0 B      0 B      0 B        8      0 B      0 B      0 B       0        0 B            N/A          N/A    N/A         0 B          0 B
.rgw.root                                               6    8  4.9 KiB  4.9 KiB      0 B       16  180 KiB  180 KiB      0 B  100.00        0 B            N/A          N/A    N/A         0 B          0 B
ocs-storagecluster-cephobjectstore.rgw.buckets.non-ec   7    8      0 B      0 B      0 B        0      0 B      0 B      0 B       0        0 B            N/A          N/A    N/A         0 B          0 B
ocs-storagecluster-cephobjectstore.rgw.log              8    8  172 KiB   23 KiB  149 KiB      340  2.3 MiB  1.9 MiB  446 KiB  100.00        0 B            N/A          N/A    N/A         0 B          0 B
ocs-storagecluster-cephfilesystem-metadata              9   32   45 KiB  2.3 KiB   43 KiB       22  226 KiB   96 KiB  130 KiB  100.00        0 B            N/A          N/A    N/A         0 B          0 B
ocs-storagecluster-cephobjectstore.rgw.buckets.data    10  253    1 KiB    1 KiB      0 B        1   12 KiB   12 KiB      0 B  100.00        0 B            N/A          N/A    N/A         0 B          0 B
ocs-storagecluster-cephfilesystem-data0                11  256      0 B      0 B      0 B        0      0 B      0 B      0 B       0        0 B            N/A          N/A    N/A         0 B          0 B
sh-4.4$


Version-Release number of selected component (if applicable):

sh-4.4$ ceph -v
ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable)

sh-4.4$ ceph versions
{
    "mon": {
        "ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable)": 3
    },
    "mgr": {
        "ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable)": 1
    },
    "osd": {
        "ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable)": 57
    },
    "mds": {
        "ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable)": 2
    },
    "rgw": {
        "ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable)": 1
    },
    "overall": {
        "ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable)": 64
    }
}



How reproducible:
Always

Steps to Reproduce:
1.Build OCP 4.11-rc4
2.Create ODF storage
3.Create VMS until storage is 100%
4.Erase VMS with their disks 

Actual results:
Storage is at error state and does not recover

Expected results:
Storage should go down from 100% and resume health

Additional info:
will be added

Comment 7 Elad 2022-08-17 13:36:30 UTC

Hi Guy, 

About "4.Erase VMS with their disks " -

Does that mean PVCs were also deleted? If so, are the PVCs of reclaimPolicy=delete?