Description of problem:
I am running a performance test (OCP 4.11.0-rc.4) and started about ~800 VMS and got to 100% storage utilization (storage is ODF on top of local storage).
After that I erased VMS so the system should recover, but the storage still stuck on 100%.
Also I see that pvc is stuck on Terminating state.
Global recovery is running for 3 day's and not stopping.
oc get dv -A | grep -c scale-test
599
oc get vm -A | grep -c scale-test
599
oc get pvc -A | grep -c scale-test
812
oc get pvc -A | grep -c Terminating
213
sh-4.4$ ceph -s
cluster:
id: 4ed4381b-ae07-4d20-950b-94a140676dee
health: HEALTH_ERR
15 backfillfull osd(s)
1 full osd(s)
8 nearfull osd(s)
Low space hindering backfill (add storage if this doesn't resolve itself): 1 pg backfill_toofull
11 pool(s) full
services:
mon: 3 daemons, quorum a,b,c (age 6d)
mgr: a(active, since 6d)
mds: 1/1 daemons up, 1 hot standby
osd: 57 osds: 57 up (since 6d), 57 in (since 6d); 1 remapped pgs
rgw: 1 daemon active (1 hosts, 1 zones)
data:
volumes: 1/1 healthy
pools: 11 pools, 1102 pgs
objects: 2.18M objects, 7.6 TiB
usage: 18 TiB used, 7.1 TiB / 25 TiB avail
pgs: 1101 active+clean
1 active+remapped+backfill_toofull
io:
client: 14 MiB/s rd, 528 op/s rd, 0 op/s wr
progress:
Global Recovery Event (3d)
[===========================.] (remaining: 4m)
sh-4.4$ ceph df detail
--- RAW STORAGE ---
CLASS SIZE AVAIL USED RAW USED %RAW USED
ssd 25 TiB 7.1 TiB 18 TiB 18 TiB 71.41
TOTAL 25 TiB 7.1 TiB 18 TiB 18 TiB 71.41
--- POOLS ---
POOL ID PGS STORED (DATA) (OMAP) OBJECTS USED (DATA) (OMAP) %USED MAX AVAIL QUOTA OBJECTS QUOTA BYTES DIRTY USED COMPR UNDER COMPR
device_health_metrics 1 1 6.9 MiB 0 B 6.9 MiB 57 21 MiB 0 B 21 MiB 100.00 0 B N/A N/A N/A 0 B 0 B
ocs-storagecluster-cephblockpool 2 512 5.9 TiB 5.9 TiB 562 KiB 2.18M 18 TiB 18 TiB 1.6 MiB 100.00 0 B N/A N/A N/A 0 B 0 B
ocs-storagecluster-cephobjectstore.rgw.meta 3 8 11 KiB 3.9 KiB 7.5 KiB 16 191 KiB 168 KiB 23 KiB 100.00 0 B N/A N/A N/A 0 B 0 B
ocs-storagecluster-cephobjectstore.rgw.buckets.index 4 8 27 KiB 0 B 27 KiB 22 82 KiB 0 B 82 KiB 100.00 0 B N/A N/A N/A 0 B 0 B
ocs-storagecluster-cephobjectstore.rgw.control 5 8 0 B 0 B 0 B 8 0 B 0 B 0 B 0 0 B N/A N/A N/A 0 B 0 B
.rgw.root 6 8 4.9 KiB 4.9 KiB 0 B 16 180 KiB 180 KiB 0 B 100.00 0 B N/A N/A N/A 0 B 0 B
ocs-storagecluster-cephobjectstore.rgw.buckets.non-ec 7 8 0 B 0 B 0 B 0 0 B 0 B 0 B 0 0 B N/A N/A N/A 0 B 0 B
ocs-storagecluster-cephobjectstore.rgw.log 8 8 172 KiB 23 KiB 149 KiB 340 2.3 MiB 1.9 MiB 446 KiB 100.00 0 B N/A N/A N/A 0 B 0 B
ocs-storagecluster-cephfilesystem-metadata 9 32 45 KiB 2.3 KiB 43 KiB 22 226 KiB 96 KiB 130 KiB 100.00 0 B N/A N/A N/A 0 B 0 B
ocs-storagecluster-cephobjectstore.rgw.buckets.data 10 253 1 KiB 1 KiB 0 B 1 12 KiB 12 KiB 0 B 100.00 0 B N/A N/A N/A 0 B 0 B
ocs-storagecluster-cephfilesystem-data0 11 256 0 B 0 B 0 B 0 0 B 0 B 0 B 0 0 B N/A N/A N/A 0 B 0 B
sh-4.4$
Version-Release number of selected component (if applicable):
sh-4.4$ ceph -v
ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable)
sh-4.4$ ceph versions
{
"mon": {
"ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable)": 3
},
"mgr": {
"ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable)": 1
},
"osd": {
"ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable)": 57
},
"mds": {
"ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable)": 2
},
"rgw": {
"ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable)": 1
},
"overall": {
"ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable)": 64
}
}
How reproducible:
Always
Steps to Reproduce:
1.Build OCP 4.11-rc4
2.Create ODF storage
3.Create VMS until storage is 100%
4.Erase VMS with their disks
Actual results:
Storage is at error state and does not recover
Expected results:
Storage should go down from 100% and resume health
Additional info:
will be added
Description of problem: I am running a performance test (OCP 4.11.0-rc.4) and started about ~800 VMS and got to 100% storage utilization (storage is ODF on top of local storage). After that I erased VMS so the system should recover, but the storage still stuck on 100%. Also I see that pvc is stuck on Terminating state. Global recovery is running for 3 day's and not stopping. oc get dv -A | grep -c scale-test 599 oc get vm -A | grep -c scale-test 599 oc get pvc -A | grep -c scale-test 812 oc get pvc -A | grep -c Terminating 213 sh-4.4$ ceph -s cluster: id: 4ed4381b-ae07-4d20-950b-94a140676dee health: HEALTH_ERR 15 backfillfull osd(s) 1 full osd(s) 8 nearfull osd(s) Low space hindering backfill (add storage if this doesn't resolve itself): 1 pg backfill_toofull 11 pool(s) full services: mon: 3 daemons, quorum a,b,c (age 6d) mgr: a(active, since 6d) mds: 1/1 daemons up, 1 hot standby osd: 57 osds: 57 up (since 6d), 57 in (since 6d); 1 remapped pgs rgw: 1 daemon active (1 hosts, 1 zones) data: volumes: 1/1 healthy pools: 11 pools, 1102 pgs objects: 2.18M objects, 7.6 TiB usage: 18 TiB used, 7.1 TiB / 25 TiB avail pgs: 1101 active+clean 1 active+remapped+backfill_toofull io: client: 14 MiB/s rd, 528 op/s rd, 0 op/s wr progress: Global Recovery Event (3d) [===========================.] (remaining: 4m) sh-4.4$ ceph df detail --- RAW STORAGE --- CLASS SIZE AVAIL USED RAW USED %RAW USED ssd 25 TiB 7.1 TiB 18 TiB 18 TiB 71.41 TOTAL 25 TiB 7.1 TiB 18 TiB 18 TiB 71.41 --- POOLS --- POOL ID PGS STORED (DATA) (OMAP) OBJECTS USED (DATA) (OMAP) %USED MAX AVAIL QUOTA OBJECTS QUOTA BYTES DIRTY USED COMPR UNDER COMPR device_health_metrics 1 1 6.9 MiB 0 B 6.9 MiB 57 21 MiB 0 B 21 MiB 100.00 0 B N/A N/A N/A 0 B 0 B ocs-storagecluster-cephblockpool 2 512 5.9 TiB 5.9 TiB 562 KiB 2.18M 18 TiB 18 TiB 1.6 MiB 100.00 0 B N/A N/A N/A 0 B 0 B ocs-storagecluster-cephobjectstore.rgw.meta 3 8 11 KiB 3.9 KiB 7.5 KiB 16 191 KiB 168 KiB 23 KiB 100.00 0 B N/A N/A N/A 0 B 0 B ocs-storagecluster-cephobjectstore.rgw.buckets.index 4 8 27 KiB 0 B 27 KiB 22 82 KiB 0 B 82 KiB 100.00 0 B N/A N/A N/A 0 B 0 B ocs-storagecluster-cephobjectstore.rgw.control 5 8 0 B 0 B 0 B 8 0 B 0 B 0 B 0 0 B N/A N/A N/A 0 B 0 B .rgw.root 6 8 4.9 KiB 4.9 KiB 0 B 16 180 KiB 180 KiB 0 B 100.00 0 B N/A N/A N/A 0 B 0 B ocs-storagecluster-cephobjectstore.rgw.buckets.non-ec 7 8 0 B 0 B 0 B 0 0 B 0 B 0 B 0 0 B N/A N/A N/A 0 B 0 B ocs-storagecluster-cephobjectstore.rgw.log 8 8 172 KiB 23 KiB 149 KiB 340 2.3 MiB 1.9 MiB 446 KiB 100.00 0 B N/A N/A N/A 0 B 0 B ocs-storagecluster-cephfilesystem-metadata 9 32 45 KiB 2.3 KiB 43 KiB 22 226 KiB 96 KiB 130 KiB 100.00 0 B N/A N/A N/A 0 B 0 B ocs-storagecluster-cephobjectstore.rgw.buckets.data 10 253 1 KiB 1 KiB 0 B 1 12 KiB 12 KiB 0 B 100.00 0 B N/A N/A N/A 0 B 0 B ocs-storagecluster-cephfilesystem-data0 11 256 0 B 0 B 0 B 0 0 B 0 B 0 B 0 0 B N/A N/A N/A 0 B 0 B sh-4.4$ Version-Release number of selected component (if applicable): sh-4.4$ ceph -v ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable) sh-4.4$ ceph versions { "mon": { "ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable)": 3 }, "mgr": { "ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable)": 1 }, "osd": { "ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable)": 57 }, "mds": { "ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable)": 2 }, "rgw": { "ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable)": 1 }, "overall": { "ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable)": 64 } } How reproducible: Always Steps to Reproduce: 1.Build OCP 4.11-rc4 2.Create ODF storage 3.Create VMS until storage is 100% 4.Erase VMS with their disks Actual results: Storage is at error state and does not recover Expected results: Storage should go down from 100% and resume health Additional info: will be added