Bug 2112853 - Cluster doesn't recover from storage filling up to 100%
Summary: Cluster doesn't recover from storage filling up to 100%
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RADOS
Version: 5.1
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 7.0
Assignee: Prashant Dhange
QA Contact: Pawan
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-08-01 10:15 UTC by guy chen
Modified: 2023-07-06 18:07 UTC (History)
20 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-07-06 18:07:21 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHCEPH-4976 0 None None None 2022-08-01 10:17:37 UTC

Description guy chen 2022-08-01 10:15:41 UTC
Description of problem:
I am running a performance test (OCP 4.11.0-rc.4) and started about ~800 VMS and got to 100% storage utilization (storage is ODF on top of local storage).
After that I erased VMS so the system should recover, but the storage still stuck on 100%.
Also I see that pvc is stuck on Terminating state.
Global recovery is running for 3 day's and not stopping.


oc get dv -A | grep -c scale-test
599
oc get vm -A | grep -c scale-test
599
oc get pvc -A | grep -c scale-test
812
oc get pvc -A | grep -c Terminating 
213



sh-4.4$ ceph -s 
  cluster:
    id:     4ed4381b-ae07-4d20-950b-94a140676dee
    health: HEALTH_ERR
            15 backfillfull osd(s)
            1 full osd(s)
            8 nearfull osd(s)
            Low space hindering backfill (add storage if this doesn't resolve itself): 1 pg backfill_toofull
            11 pool(s) full
 
  services:
    mon: 3 daemons, quorum a,b,c (age 6d)
    mgr: a(active, since 6d)
    mds: 1/1 daemons up, 1 hot standby
    osd: 57 osds: 57 up (since 6d), 57 in (since 6d); 1 remapped pgs
    rgw: 1 daemon active (1 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   11 pools, 1102 pgs
    objects: 2.18M objects, 7.6 TiB
    usage:   18 TiB used, 7.1 TiB / 25 TiB avail
    pgs:     1101 active+clean
             1    active+remapped+backfill_toofull
 
  io:
    client:   14 MiB/s rd, 528 op/s rd, 0 op/s wr
 
  progress:
    Global Recovery Event (3d)
      [===========================.] (remaining: 4m)

sh-4.4$ ceph df detail
--- RAW STORAGE ---
CLASS    SIZE    AVAIL    USED  RAW USED  %RAW USED
ssd    25 TiB  7.1 TiB  18 TiB    18 TiB      71.41
TOTAL  25 TiB  7.1 TiB  18 TiB    18 TiB      71.41
 
--- POOLS ---
POOL                                                   ID  PGS   STORED   (DATA)   (OMAP)  OBJECTS     USED   (DATA)   (OMAP)   %USED  MAX AVAIL  QUOTA OBJECTS  QUOTA BYTES  DIRTY  USED COMPR  UNDER COMPR
device_health_metrics                                   1    1  6.9 MiB      0 B  6.9 MiB       57   21 MiB      0 B   21 MiB  100.00        0 B            N/A          N/A    N/A         0 B          0 B
ocs-storagecluster-cephblockpool                        2  512  5.9 TiB  5.9 TiB  562 KiB    2.18M   18 TiB   18 TiB  1.6 MiB  100.00        0 B            N/A          N/A    N/A         0 B          0 B
ocs-storagecluster-cephobjectstore.rgw.meta             3    8   11 KiB  3.9 KiB  7.5 KiB       16  191 KiB  168 KiB   23 KiB  100.00        0 B            N/A          N/A    N/A         0 B          0 B
ocs-storagecluster-cephobjectstore.rgw.buckets.index    4    8   27 KiB      0 B   27 KiB       22   82 KiB      0 B   82 KiB  100.00        0 B            N/A          N/A    N/A         0 B          0 B
ocs-storagecluster-cephobjectstore.rgw.control          5    8      0 B      0 B      0 B        8      0 B      0 B      0 B       0        0 B            N/A          N/A    N/A         0 B          0 B
.rgw.root                                               6    8  4.9 KiB  4.9 KiB      0 B       16  180 KiB  180 KiB      0 B  100.00        0 B            N/A          N/A    N/A         0 B          0 B
ocs-storagecluster-cephobjectstore.rgw.buckets.non-ec   7    8      0 B      0 B      0 B        0      0 B      0 B      0 B       0        0 B            N/A          N/A    N/A         0 B          0 B
ocs-storagecluster-cephobjectstore.rgw.log              8    8  172 KiB   23 KiB  149 KiB      340  2.3 MiB  1.9 MiB  446 KiB  100.00        0 B            N/A          N/A    N/A         0 B          0 B
ocs-storagecluster-cephfilesystem-metadata              9   32   45 KiB  2.3 KiB   43 KiB       22  226 KiB   96 KiB  130 KiB  100.00        0 B            N/A          N/A    N/A         0 B          0 B
ocs-storagecluster-cephobjectstore.rgw.buckets.data    10  253    1 KiB    1 KiB      0 B        1   12 KiB   12 KiB      0 B  100.00        0 B            N/A          N/A    N/A         0 B          0 B
ocs-storagecluster-cephfilesystem-data0                11  256      0 B      0 B      0 B        0      0 B      0 B      0 B       0        0 B            N/A          N/A    N/A         0 B          0 B
sh-4.4$


Version-Release number of selected component (if applicable):

sh-4.4$ ceph -v
ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable)

sh-4.4$ ceph versions
{
    "mon": {
        "ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable)": 3
    },
    "mgr": {
        "ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable)": 1
    },
    "osd": {
        "ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable)": 57
    },
    "mds": {
        "ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable)": 2
    },
    "rgw": {
        "ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable)": 1
    },
    "overall": {
        "ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable)": 64
    }
}



How reproducible:
Always

Steps to Reproduce:
1.Build OCP 4.11-rc4
2.Create ODF storage
3.Create VMS until storage is 100%
4.Erase VMS with their disks 

Actual results:
Storage is at error state and does not recover

Expected results:
Storage should go down from 100% and resume health

Additional info:
will be added

Comment 7 Elad 2022-08-17 13:36:30 UTC
Hi Guy, 

About "4.Erase VMS with their disks " -

Does that mean PVCs were also deleted? If so, are the PVCs of reclaimPolicy=delete?


Note You need to log in before you can comment on or make changes to this bug.