Bug 2228072
| Summary: | FS is not accepting IOs when 1 OSD is full | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Amarnath <amk> |
| Component: | RADOS | Assignee: | Radoslaw Zarzynski <rzarzyns> |
| Status: | NEW --- | QA Contact: | Pawan <pdhiran> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 6.1 | CC: | bhubbard, ceph-eng-bugs, cephqe-warriors, nojha, rzarzyns, vumrao |
| Target Milestone: | --- | ||
| Target Release: | 7.1 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | Type: | Bug | |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Description of problem: FS is not accepting IOs when 1 OSD is full Test Steps Followed: 1. Created a cluster and started IOs to fill it 100%(Size of cluster is 189TB). 2. After the cluster reaching 45%(83 TB), FS stopped accepting the IOs and all pools are marked as 100% [root@extensa019 cephfs_io_94zkeqg8cf_1]# ceph df --- RAW STORAGE --- CLASS SIZE AVAIL USED RAW USED %RAW USED hdd 186 TiB 102 TiB 84 TiB 84 TiB 45.20 ssd 745 GiB 743 GiB 1.9 GiB 1.9 GiB 0.25 TOTAL 186 TiB 102 TiB 84 TiB 84 TiB 45.02 --- POOLS --- POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL .mgr 1 1 18 MiB 6 54 MiB 100.00 0 B cephfs.cephfs.meta 2 16 396 MiB 142 1.2 GiB 100.00 0 B cephfs.cephfs.data 3 32 1.2 GiB 50.26k 3.5 GiB 100.00 0 B .nfs 4 32 1.5 KiB 0 4.5 KiB 100.00 0 B cephfs.cephfs_io_1.meta 7 16 15 GiB 933.58k 45 GiB 100.00 0 B cephfs.cephfs_io_1.data 8 190 28 TiB 28.91M 84 TiB 100.00 0 B [root@extensa019 cephfs_io_94zkeqg8cf_1]# ceph -s cluster: id: 16659610-2bb3-11ee-885d-ac1f6bb270c6 health: HEALTH_ERR 1 full osd(s) 4 nearfull osd(s) Low space hindering backfill (add storage if this doesn't resolve itself): 12 pgs backfill_toofull 6 pool(s) full services: mon: 5 daemons, quorum extensa001,extensa015,extensa004,extensa014,extensa003 (age 4d) mgr: extensa003.rgqsss(active, since 4d), standbys: extensa015.hvzezm mds: 3/3 daemons up, 3 standby osd: 53 osds: 53 up (since 4d), 53 in (since 4d); 14 remapped pgs data: volumes: 2/2 healthy pools: 6 pools, 287 pgs objects: 29.89M objects, 28 TiB usage: 84 TiB used, 102 TiB / 186 TiB avail pgs: 4918834/89671746 objects misplaced (5.485%) 271 active+clean 8 active+remapped+backfill_wait+backfill_toofull 4 active+remapped+backfill_toofull 1 active+remapped+backfilling 1 active+clean+scrubbing+deep 1 active+clean+scrubbing 1 active+remapped+backfill_wait io: recovery: 8.1 MiB/s, 8 objects/s progress: Global Recovery Event (4d) [==========================..] (remaining: 5h) 3. We have marked the OSDs with REweight as 0 for the OSD which was full . ceph osd reweight osd.43 0 4. After his Fs started accepting the IOs for while again one more OSD was full and we followed step 3 for the full OSD 5. After a few hours of IOs, now cluster is in HEALTH_WARN 1 MDSs report slow metadata IOs 1 MDSs behind on trimming IOs are not going in to the cluster root@extensa013 ~]# ceph health detail HEALTH_WARN 2 failed cephadm daemon(s); 1 MDSs report slow metadata IOs; 1 MDSs behind on trimming; 1 backfillfull osd(s); 3 nearfull osd(s); Reduced data availability: 3 pgs inactive; Low space hindering backfill (add storage if this doesn't resolve itself): 13 pgs backfill_toofull; Degraded data redundancy: 2637108/101031984 objects degraded (2.610%), 18 pgs degraded, 18 pgs undersized; 10 pool(s) backfillfull [WRN] CEPHADM_FAILED_DAEMON: 2 failed cephadm daemon(s) daemon osd.43 on extensa011 is in error state daemon osd.41 on extensa015 is in error state [WRN] MDS_SLOW_METADATA_IO: 1 MDSs report slow metadata IOs mds.cephfs_io_1.extensa009.uagafl(mds.0): 100+ slow metadata IOs are blocked > 30 secs, oldest blocked for 54870 secs [WRN] MDS_TRIM: 1 MDSs behind on trimming mds.cephfs_io_1.extensa009.uagafl(mds.0): Behind on trimming (397/128) max_segments: 128, num_segments: 397 [WRN] OSD_BACKFILLFULL: 1 backfillfull osd(s) osd.24 is backfill full [WRN] OSD_NEARFULL: 3 nearfull osd(s) osd.14 is near full osd.21 is near full osd.29 is near full [WRN] PG_AVAILABILITY: Reduced data availability: 3 pgs inactive pg 8.14 is stuck inactive for 15h, current state undersized+degraded+remapped+backfilling+peered, last acting [24] pg 8.54 is stuck inactive for 15h, current state undersized+degraded+remapped+backfill_wait+backfill_toofull+peered, last acting [24] pg 8.94 is stuck inactive for 15h, current state undersized+degraded+remapped+backfill_wait+backfill_toofull+peered, last acting [24] [WRN] PG_BACKFILL_FULL: Low space hindering backfill (add storage if this doesn't resolve itself): 13 pgs backfill_toofull pg 8.15 is active+remapped+backfill_toofull, acting [29,26,42] pg 8.53 is active+undersized+degraded+remapped+backfill_wait+backfill_toofull, acting [38,42] pg 8.54 is undersized+degraded+remapped+backfill_wait+backfill_toofull+peered, acting [24] pg 8.5e is active+undersized+degraded+remapped+backfill_wait+backfill_toofull, acting [0,51] pg 8.6c is active+undersized+degraded+remapped+backfill_toofull, acting [14,25] pg 8.74 is active+remapped+backfill_toofull, acting [48,38,7] pg 8.77 is active+remapped+backfill_toofull, acting [7,8,36] pg 8.7a is active+remapped+backfill_wait+backfill_toofull, acting [24,32,46] pg 8.7f is active+remapped+backfill_toofull, acting [4,10,14] pg 8.94 is undersized+degraded+remapped+backfill_wait+backfill_toofull+peered, acting [24] pg 8.95 is active+remapped+backfill_toofull, acting [29,26,42] pg 8.9e is active+undersized+degraded+remapped+backfill_wait+backfill_toofull, acting [0,51] pg 8.ac is active+undersized+degraded+remapped+backfill_toofull, acting [14,25] [WRN] PG_DEGRADED: Degraded data redundancy: 2637108/101031984 objects degraded (2.610%), 18 pgs degraded, 18 pgs undersized pg 8.13 is stuck undersized for 15h, current state active+undersized+degraded+remapped+backfill_wait, last acting [38,42] pg 8.14 is stuck undersized for 15h, current state undersized+degraded+remapped+backfilling+peered, last acting [24] pg 8.1e is stuck undersized for 15h, current state active+undersized+degraded+remapped+backfill_wait, last acting [0,51] pg 8.28 is stuck undersized for 15h, current state active+undersized+degraded+remapped+backfill_wait, last acting [0,6] pg 8.42 is stuck undersized for 15h, current state active+undersized+degraded+remapped+backfill_wait, last acting [47,0] pg 8.44 is stuck undersized for 15h, current state active+undersized+degraded+remapped+backfilling, last acting [34,32] pg 8.4c is stuck undersized for 15h, current state active+undersized+degraded+remapped+backfilling, last acting [28,25] pg 8.53 is stuck undersized for 15h, current state active+undersized+degraded+remapped+backfill_wait+backfill_toofull, last acting [38,42] pg 8.54 is stuck undersized for 15h, current state undersized+degraded+remapped+backfill_wait+backfill_toofull+peered, last acting [24] pg 8.5e is stuck undersized for 15h, current state active+undersized+degraded+remapped+backfill_wait+backfill_toofull, last acting [0,51] pg 8.68 is stuck undersized for 15h, current state active+undersized+degraded+remapped+backfilling, last acting [0,6] pg 8.6c is stuck undersized for 15h, current state active+undersized+degraded+remapped+backfill_toofull, last acting [14,25] pg 8.7e is stuck undersized for 15h, current state active+undersized+degraded+remapped+backfill_wait, last acting [0,51] pg 8.93 is stuck undersized for 15h, current state active+undersized+degraded+remapped+backfill_wait, last acting [38,42] pg 8.94 is stuck undersized for 15h, current state undersized+degraded+remapped+backfill_wait+backfill_toofull+peered, last acting [24] pg 8.9e is stuck undersized for 15h, current state active+undersized+degraded+remapped+backfill_wait+backfill_toofull, last acting [0,51] pg 8.ac is stuck undersized for 15h, current state active+undersized+degraded+remapped+backfill_toofull, last acting [14,25] pg 8.be is stuck undersized for 15h, current state active+undersized+degraded+remapped+backfill_wait, last acting [0,51] [WRN] POOL_BACKFILLFULL: 10 pool(s) backfillfull pool '.mgr' is backfillfull pool 'cephfs.cephfs.meta' is backfillfull pool 'cephfs.cephfs.data' is backfillfull pool '.nfs' is backfillfull pool 'cephfs.cephfs_io_1.meta' is backfillfull pool 'cephfs.cephfs_io_1.data' is backfillfull pool '.rgw.root' is backfillfull pool 'default.rgw.log' is backfillfull pool 'default.rgw.control' is backfillfull pool 'default.rgw.meta' is backfillfull [root@extensa013 ~]# [root@extensa013 ~]# ceph fs status cephfs - 9 clients ====== RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS 0 active cephfs.extensa004.dwgfpl Reqs: 0 /s 48.2k 48.2k 8211 546 1 active cephfs.extensa003.otrrap Reqs: 0 /s 12.1k 12.1k 2101 21 POOL TYPE USED AVAIL cephfs.cephfs.meta metadata 1271M 3151G cephfs.cephfs.data data 3724M 3151G cephfs_io_1 - 4 clients =========== RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS 0 active cephfs_io_1.extensa009.uagafl Reqs: 0 /s 1344k 1344k 71.2k 4546 POOL TYPE USED AVAIL cephfs.cephfs_io_1.meta metadata 57.7G 3151G cephfs.cephfs_io_1.data data 99.5T 3238G STANDBY MDS cephfs_io_1.extensa007.ppjqjk cephfs.extensa014.vetifx cephfs.extensa011.ktzine MDS version: ceph version 17.2.6-99.el9cp (6869830013a8878a3930e23c75d8b990f6b0c491) quincy (stable) [root@extensa013 ~]# Client Node : extensa013.ceph.redhat.com(root/passwd) Version-Release number of selected component (if applicable): [root@extensa013 ~]# ceph versions { "mon": { "ceph version 17.2.6-99.el9cp (6869830013a8878a3930e23c75d8b990f6b0c491) quincy (stable)": 5 }, "mgr": { "ceph version 17.2.6-99.el9cp (6869830013a8878a3930e23c75d8b990f6b0c491) quincy (stable)": 2 }, "osd": { "ceph version 17.2.6-99.el9cp (6869830013a8878a3930e23c75d8b990f6b0c491) quincy (stable)": 51 }, "mds": { "ceph version 17.2.6-99.el9cp (6869830013a8878a3930e23c75d8b990f6b0c491) quincy (stable)": 6 }, "overall": { "ceph version 17.2.6-99.el9cp (6869830013a8878a3930e23c75d8b990f6b0c491) quincy (stable)": 64 } } [root@extensa013 ~]# How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: