Created attachment 2018018 [details] ceph.log Description of problem: Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
[1] cluster stuck in backfilling for more than 24 hours # ceph -s cluster: id: d41b35fc-20ef-11ef-bda4-000af7995d6c health: HEALTH_WARN noscrub,nodeep-scrub flag(s) set Degraded data redundancy: 313071902/1477314630 objects degraded (21.192%), 3732 pgs degraded, 3732 pgs undersized services: mon: 3 daemons, quorum f28-h21-000-r630,f28-h22-000-r630,f28-h23-000-r630 (age 43h) mgr: f28-h23-000-r630.nbspst(active, since 2d), standbys: f28-h22-000-r630.kaarli, f28-h21-000-r630.eksini osd: 192 osds: 192 up (since 38h), 192 in (since 38h); 3953 remapped pgs flags noscrub,nodeep-scrub rgw: 8 daemons active (8 hosts, 1 zones) data: pools: 7 pools, 4769 pgs objects: 246.22M objects, 16 TiB usage: 36 TiB used, 325 TiB / 361 TiB avail pgs: 313071902/1477314630 objects degraded (21.192%) 38533716/1477314630 objects misplaced (2.608%) 3731 active+undersized+degraded+remapped+backfill_wait 816 active+clean 221 active+remapped+backfill_wait 1 active+undersized+degraded+remapped+backfilling io: client: 31 KiB/s rd, 0 B/s wr, 31 op/s rd, 20 op/s wr [2] PG 7.109 stuck in backfilling for more than 24 hours. Due to this more than 3000 PGs are stuck in backfill_wait. # ceph pg ls backfilling PG OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG LOG_DUPS STATE SINCE VERSION REPORTED UP ACTING SCRUB_STAMP DEEP_SCRUB_STAMP LAST_SCRUB_DURATION SCRUB_SCHEDULING 7.109 60223 120446 0 0 4162054144 0 0 1343 3000 active+undersized+degraded+remapped+backfilling 35h 1907'68103 2411:157724 [103,146,142,156,180,89]p103 [NONE,146,142,NONE,180,89]p146 2024-06-02T15:03:35.735370+0000 2024-06-02T15:03:35.735370+0000 0 queued for scrub [3] the backfill_targets are 103(0) and 156(3) which are missing all 60223 objects # cat pg_7.109.query { "snap_trimq": "[]", "snap_trimq_len": 0, "state": "active+undersized+degraded+remapped+backfilling", "epoch": 2250, "up": [ 103, 146, 142, 156, 180, 89 ], "acting": [ 2147483647, 146, 142, 2147483647, 180, 89 ], "backfill_targets": [ "103(0)", "156(3)" ], "acting_recovery_backfill": [ "89(5)", "103(0)", "142(2)", "146(1)", "156(3)", "180(4)" ], .... "up": [ 103, 146, 142, 156, 180, 89 ], "acting": [ 2147483647, 146, 142, 2147483647, 180, 89 ], "avail_no_missing": [ "146(1)", "89(5)", "142(2)", "180(4)" ], "object_location_counts": [ { "shards": "89(5),142(2),146(1),180(4)", "objects": 60223 } ], "blocked_by": [], "up_primary": 103, "acting_primary": 146, "purged_snaps": [] }, [4] The 7.109 PG started backfiling for shard0 from osd.1(0) to osd.103(0) and for shard3 from osd.125(3) to osd.156(3). # grep -r "7\.109" /var/log/ceph/d41b35fc-20ef-11ef-bda4-000af7995d6c/ceph.log 2024-06-02T18:46:07.109482+0000 mgr.f28-h23-000-r630.nbspst (mgr.14283) 6806 : cluster 0 pgmap v6810: 4769 pgs: 4769 active+clean; 10 TiB data, 28 TiB used, 333 TiB / 361 TiB avail; 586 MiB/s rd, 406 MiB/s wr, 190.59k op/s 2024-06-02T19:43:17.109450+0000 mgr.f28-h23-000-r630.nbspst (mgr.14283) 8524 : cluster 0 pgmap v8508: 4769 pgs: 4769 active+clean; 11 TiB data, 30 TiB used, 331 TiB / 361 TiB avail; 428 MiB/s rd, 302 MiB/s wr, 140.31k op/s 2024-06-02T21:25:46.258120+0000 osd.146 (osd.146) 30 : cluster 0 7.109s1 starting backfill to osd.1(0) from (0'0,0'0] MAX to 620'50964 2024-06-02T21:53:27.109562+0000 osd.115 (osd.115) 1508 : cluster 3 10 slow requests (by type [ 'queued for pg' : 10 ] most affected pool [ 'default.rgw.buckets.data' : 10 ]) 2024-06-02T23:34:19.734731+0000 osd.146 (osd.146) 1128 : cluster 1 7.109s1 continuing backfill to osd.1(0) from (615'50002,1471'51574] 7:9085d2cc:::b5aa4475-a87c-41d2-a21a-65355e281454.19284.189_hX2YOVGF%2f77781.NmZhvuybxNiNp2xC.rnd:head to 1471'51574 2024-06-02T23:44:25.878502+0000 osd.146 (osd.146) 1149 : cluster 1 7.109s1 continuing backfill to osd.1(0) from (615'50602,1505'52181] 7:9085d2cc:::b5aa4475-a87c-41d2-a21a-65355e281454.19284.189_hX2YOVGF%2f77781.NmZhvuybxNiNp2xC.rnd:head to 1505'52181 2024-06-02T23:44:25.878518+0000 osd.146 (osd.146) 1150 : cluster 0 7.109s1 starting backfill to osd.125(3) from (0'0,0'0] MAX to 1505'52181 2024-06-03T00:30:47.109852+0000 osd.40 (osd.40) 225 : cluster 3 14 slow requests (by type [ 'queued for pg' : 14 ] most affected pool [ 'default.rgw.buckets.data' : 14 ]) 2024-06-03T02:15:39.176352+0000 osd.146 (osd.146) 1194 : cluster 1 7.109s1 continuing backfill to osd.1(0) from (1635'57260,1773'58638] 7:9085d2cc:::b5aa4475-a87c-41d2-a21a-65355e281454.19284.189_hX2YOVGF%2f77781.NmZhvuybxNiNp2xC.rnd:head to 1773'58638 2024-06-03T02:15:39.176482+0000 osd.146 (osd.146) 1196 : cluster 0 7.109s1 starting backfill to osd.7(3) from (0'0,0'0] MAX to 1773'58638 2024-06-03T02:15:56.207617+0000 osd.146 (osd.146) 1263 : cluster 0 7.109s1 starting backfill to osd.15(0) from (0'0,0'0] MAX to 1786'58641 2024-06-03T02:15:56.221911+0000 osd.146 (osd.146) 1268 : cluster 0 7.109s1 starting backfill to osd.125(3) from (1635'57260,1773'58638] MIN to 1786'58641 2024-06-03T02:16:10.467454+0000 osd.146 (osd.146) 1346 : cluster 0 7.109s1 starting backfill to osd.13(3) from (0'0,0'0] MAX to 1798'58643 2024-06-03T02:16:10.482341+0000 osd.146 (osd.146) 1349 : cluster 1 7.109s1 continuing backfill to osd.15(0) from (1635'57260,1798'58643] MIN to 1798'58643 2024-06-03T02:16:13.583865+0000 osd.146 (osd.146) 1363 : cluster 0 7.109s1 starting backfill to osd.13(3) from (0'0,0'0] MAX to 1798'58643 2024-06-03T02:16:13.590155+0000 osd.146 (osd.146) 1364 : cluster 0 7.109s1 starting backfill to osd.103(0) from (611'47500,611'49542] MAX to 1798'58643 2024-06-03T02:16:15.288379+0000 osd.146 (osd.146) 1378 : cluster 1 7.109s1 continuing backfill to osd.103(0) from (1635'57260,1802'58644] MIN to 1802'58644 2024-06-03T02:16:15.288394+0000 osd.146 (osd.146) 1379 : cluster 0 7.109s1 starting backfill to osd.156(3) from (615'50002,1471'51574] MAX to 1802'58644 2024-06-03T10:34:17.109297+0000 mgr.f28-h23-000-r630.nbspst (mgr.14283) 35064 : cluster 0 pgmap v36537: 4769 pgs: 1 active+undersized+degraded+remapped+backfilling, 221 active+remapped+backfill_wait, 3731 active+undersized+degraded+remapped+backfill_wait, 816 active+clean; 16 TiB data, 36 TiB used, 325 TiB / 361 TiB avail; 313071902/1477314627 objects degraded (21.192%); 38533716/1477314627 objects misplaced (2.608%) 2024-06-03T17:47:27.109839+0000 mgr.f28-h23-000-r630.nbspst (mgr.14283) 47921 : cluster 0 pgmap v49427: 4769 pgs: 1 active+undersized+degraded+remapped+backfilling, 221 active+remapped+backfill_wait, 3731 active+undersized+degraded+remapped+backfill_wait, 816 active+clean; 16 TiB data, 36 TiB used, 325 TiB / 361 TiB avail; 313071902/1477314630 objects degraded (21.192%); 38533716/1477314630 objects misplaced (2.608%) [5] Nothing stands out from debug logs collected for osd.146(1) but osd.146 only reports osd.103(0) and osd.156(3) missing all the 60223 objects to be backfilled osd.146 : 2024-06-03T20:41:09.478+0000 7fd764cdd700 20 osd.146 op_wq(0) _process 7.109s1 to_process <> waiting <> waiting_peering {} 2024-06-03T20:41:09.478+0000 7fd764cdd700 20 osd.146 op_wq(0) _process OpSchedulerItem(7.109s1 PGPeeringEvent(epoch_sent: 2291 epoch_requested: 2291 MLeaseAck epoch 2291 from osd.180 pg_lease_ack(ruub 106835.609375000s)) class_id 2 prio 255 cost 10 e2291) queued 2024-06-03T20:41:09.478+0000 7fd764cdd700 20 osd.146 op_wq(0) _process 7.109s1 to_process <OpSchedulerItem(7.109s1 PGPeeringEvent(epoch_sent: 2291 epoch_requested: 2291 MLeaseAck epoch 2291 from osd.180 pg_lease_ack(ruub 106835.609375000s)) class_id 2 prio 255 cost 10 e2291)> waiting <> waiting_peering {} 2024-06-03T20:41:09.478+0000 7fd764cdd700 20 osd.146 op_wq(0) _process OpSchedulerItem(7.109s1 PGPeeringEvent(epoch_sent: 2291 epoch_requested: 2291 MLeaseAck epoch 2291 from osd.180 pg_lease_ack(ruub 106835.609375000s)) class_id 2 prio 255 cost 10 e2291) pg 0x55eafb3f4000 2024-06-03T20:41:09.478+0000 7fd764cdd700 10 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] do_peering_event: epoch_sent: 2291 epoch_requested: 2291 MLeaseAck epoch 2291 from osd.180 pg_lease_ack(ruub 106835.609375000s) 2024-06-03T20:41:09.478+0000 7fd764cdd700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] recalc_readable_until peer osd.142 ruub 106835.609375000s 2024-06-03T20:41:09.478+0000 7fd764cdd700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] recalc_readable_until peer osd.180 ruub 106835.609375000s 2024-06-03T20:41:09.478+0000 7fd764cdd700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] recalc_readable_until peer osd.89 ruub 106835.609375000s 2024-06-03T20:41:09.478+0000 7fd764cdd700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] recalc_readable_until readable_until[_ub] 106835.609375000s (sent 106835.609375000s) 2024-06-03T20:41:11.198+0000 7fd77ff16700 1 -- 172.17.42.221:0/3633088982 <== osd.111 v2:172.17.42.172:6910/1246231084 20624 ==== osd_ping(ping_reply e2291 up_from 1791 ping_stamp 2024-06-03T20:41:11.191449+0000/106821.320312500s send_stamp 66337.109375000s delta_ub -40484.214843750s) v5 ==== 2033+0+0 (crc 0 0 0) 0x55eafbcf1080 con 0x55eca23f0800 2024-06-03T20:41:11.198+0000 7fd77ff16700 1 -- 172.16.42.221:0/3633088982 <== osd.112 v2:172.16.43.90:6812/1726708053 23629 ==== osd_ping(ping_reply e2291 up_from 209 ping_stamp 2024-06-03T20:41:11.191449+0000/106821.320312500s send_stamp 106857.109375000s delta_ub 35.789901733s) v5 ==== 2033+0+0 (crc 0 0 0) 0x55eafbcf1080 con 0x55eafe0f7400 2024-06-03T20:41:11.198+0000 7fd77f715700 1 -- 172.17.42.221:0/3633088982 <== osd.112 v2:172.17.43.90:6814/1726708053 23629 ==== osd_ping(ping_reply e2291 up_from 209 ping_stamp 2024-06-03T20:41:11.191449+0000/106821.320312500s send_stamp 106857.109375000s delta_ub 35.790058136s) v5 ==== 2033+0+0 (crc 0 0 0) 0x55eafb3fa180 con 0x55eafe0f6800 2024-06-03T20:41:11.199+0000 7fd77f715700 1 -- 172.16.42.221:0/3633088982 <== osd.111 v2:172.16.42.172:6908/1246231084 20624 ==== osd_ping(ping_reply e2291 up_from 1791 ping_stamp 2024-06-03T20:41:11.191449+0000/106821.320312500s send_stamp 66337.109375000s delta_ub -40484.214843750s) v5 ==== 2033+0+0 (crc 0 0 0) 0x55eafb3fa180 con 0x55eca23f1800 2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] scrubber<NotActive>: update_scrub_stats: period: 15s/145s-> 145s last:2024-06-03T20:38:46 2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats actingset 89(5),142(2),146(1),180(4) upset 89(5),103(0),142(2),146(1),156(3),180(4) acting_recovery_backfill 89(5),103(0),142(2),146(1),156(3),180(4) 2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats shard 146(1) primary objects 60223 missing 0 2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats no peer_missing found for 1(0) 2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats shard 1(0) objects 53145 missing 7078 2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats no peer_missing found for 7(3) 2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats shard 7(3) objects 0 missing 60223 2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats no peer_missing found for 13(3) 2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats shard 13(3) objects 0 missing 60223 2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats no peer_missing found for 15(0) 2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats shard 15(0) objects 0 missing 60223 2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats shard 89(5) objects 53150 missing 0 2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats shard 103(0) objects 0 missing 60223 2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats no peer_missing found for 125(3) 2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats shard 125(3) objects 0 missing 60223 2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats shard 142(2) objects 53150 missing 0 2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats shard 156(3) objects 0 missing 60223 2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats shard 180(4) objects 53150 missing 0 2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats object_location_counts {89(5),142(2),146(1),180(4)=60223} 2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats missing shard 89(5) missing= 0 2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats missing shard 142(2) missing= 0 2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats missing shard 146(1) missing= 0 2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats missing shard 180(4) missing= 0 2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats missing shard 103(0) missing= 60223 2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats missing shard 156(3) missing= 60223 2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats degraded 120446 2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats misplaced 0 2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] prepare_stats_for_publish reporting purged_snaps [] 2024-06-03T20:41:11.667+0000 7fd770614700 15 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] publish_stats_to_osd 2291:157075 2024-06-03T20:41:13.022+0000 7fd77ff16700 1 -- [v2:172.17.42.221:6854/3633088982,v1:172.17.42.221:6855/3633088982] <== osd.69 172.17.43.86:0/1315089877 23513 ==== osd_ping(ping e2291 up_from 344 ping_stamp 2024-06-03T20:41:13.021291+0000/106717.109375000s send_stamp 106717.109375000s delta_ub -106.039131165s) v5 ==== 2033+0+0 (crc 0 0 0) 0x55ecbb3b1b00 con 0x55eca23f1c00 2024-06-03T20:41:13.022+0000 7fd77ff16700 1 -- [v2:172.17.42.221:6854/3633088982,v1:172.17.42.221:6855/3633088982] --> 172.17.43.86:0/1315089877 -- osd_ping(ping_reply e2291 up_from 245 ping_stamp 2024-06-03T20:41:13.021291+0000/106717.109375000s send_stamp 106823.148437500s delta_ub 106.045753479s) v5 -- 0x55ecbb3b1080 con 0x55eca23f1c00 2024-06-03T20:41:13.022+0000 7fd77f715700 1 -- [v2:172.16.42.221:6852/3633088982,v1:172.16.42.221:6853/3633088982] <== osd.69 172.16.43.86:0/1315089877 23513 ==== osd_ping(ping e2291 up_from 344 ping_stamp 2024-06-03T20:41:13.021291+0000/106717.109375000s send_stamp 106717.109375000s delta_ub -106.039131165s) v5 ==== 2033+0+0 (crc 0 0 0) 0x55ecbb3b0480 con 0x55eafdd6f800 2024-06-03T20:41:13.022+0000 7fd77f715700 1 -- [v2:172.16.42.221:6852/3633088982,v1:172.16.42.221:6853/3633088982] --> 172.16.43.86:0/1315089877 -- osd_ping(ping_reply e2291 up_from 245 ping_stamp 2024-06-03T20:41:13.021291+0000/106717.109375000s send_stamp 106823.148437500s delta_ub 106.045982361s) v5 -- 0x55ecbb3b1680 con 0x55eafdd6f800 2024-06-03T20:41:16.668+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] scrubber<NotActive>: update_scrub_stats: period: 15s/145s-> 145s last:2024-06-03T20:41:11 2024-06-03T20:41:16.975+0000 7fd780717700 1 -- [v2:172.16.42.221:6852/3633088982,v1:172.16.42.221:6853/3633088982] --> 172.16.43.67:0/2023387295 -- osd_ping(ping_reply e2291 up_from 245 ping_stamp 2024-06-03T20:41:16.974985+0000/106766.507812500s send_stamp 106827.109375000s delta_ub 60.594390869s) v5 -- 0x55eafddac480 con 0x55eafab5d000 2024-06-03T20:41:16.975+0000 7fd77ff16700 1 -- [v2:172.17.42.221:6854/3633088982,v1:172.17.42.221:6855/3633088982] --> 172.17.43.67:0/2023387295 -- osd_ping(ping_reply e2291 up_from 245 ping_stamp 2024-06-03T20:41:16.974985+0000/106766.507812500s send_stamp 106827.109375000s delta_ub 60.594425201s) v5 -- 0x55eafd2eb080 con 0x55eafbca1400 2024-06-03T20:41:17.477+0000 7fd765cdf700 15 osd.146 2291 enqueue_peering_evt 7.109s1 epoch_sent: 1803 epoch_requested: 1803 RenewLease 2024-06-03T20:41:17.477+0000 7fd765cdf700 20 osd.146 op_wq(0) _enqueue OpSchedulerItem(7.109s1 PGPeeringEvent(epoch_sent: 1803 epoch_requested: 1803 RenewLease) class_id 2 prio 255 cost 10 e1803) 2024-06-03T20:41:17.477+0000 7fd764cdd700 20 osd.146 op_wq(0) _process 7.109s1 to_process <> waiting <> waiting_peering {} We are yet to collect some additional debug logs with debug_osd=30, debug_ms=5 and debug_mclock=5 level from the up/acting set OSDs. Meanwhile I have asked Aishwarya to check for the possible deadlock from mclock side as I am suspecting a mclock issue something on a similare line compared to backfill stuck issue reported by some of the upstream users https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/IPHBE3DLW5ABCZHSNYOBUBSI3TLWVD22/#RBR4PIODUSHH3C3WE2BEAJ4UPSHHDO5X
*** Bug 2212945 has been marked as a duplicate of this bug. ***