Bug 2265371

Summary: [RHCS 8.1] [Workload-DFG] pg stuck in backfill / recovery (bug 2212945)
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Tim Wilkinson <twilkins>
Component: RADOSAssignee: Prashant Dhange <pdhange>
Status: CLOSED ERRATA QA Contact: Harsh Kumar <hakumar>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 7.0CC: bhubbard, bkunal, ceph-eng-bugs, cephqe-warriors, mhackett, moagrawa, ngangadh, nojha, pdhange, rzarzyns, sseshasa, tserlin, vumrao
Target Milestone: ---   
Target Release: 8.1   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: ceph-19.2.1-72.el9cp Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2025-06-26 12:10:38 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2351271, 2351272    

Description Tim Wilkinson 2024-02-21 17:09:11 UTC
Created attachment 2018018 [details]
ceph.log

Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 15 Prashant Dhange 2024-06-04 18:35:44 UTC
[1] cluster stuck in backfilling for more than 24 hours
# ceph -s
  cluster:
    id:     d41b35fc-20ef-11ef-bda4-000af7995d6c
    health: HEALTH_WARN
            noscrub,nodeep-scrub flag(s) set
            Degraded data redundancy: 313071902/1477314630 objects degraded (21.192%), 3732 pgs degraded, 3732 pgs undersized
 
  services:
    mon: 3 daemons, quorum f28-h21-000-r630,f28-h22-000-r630,f28-h23-000-r630 (age 43h)
    mgr: f28-h23-000-r630.nbspst(active, since 2d), standbys: f28-h22-000-r630.kaarli, f28-h21-000-r630.eksini
    osd: 192 osds: 192 up (since 38h), 192 in (since 38h); 3953 remapped pgs
         flags noscrub,nodeep-scrub
    rgw: 8 daemons active (8 hosts, 1 zones)
 
  data:
    pools:   7 pools, 4769 pgs
    objects: 246.22M objects, 16 TiB
    usage:   36 TiB used, 325 TiB / 361 TiB avail
    pgs:     313071902/1477314630 objects degraded (21.192%)
             38533716/1477314630 objects misplaced (2.608%)
             3731 active+undersized+degraded+remapped+backfill_wait
             816  active+clean
             221  active+remapped+backfill_wait
             1    active+undersized+degraded+remapped+backfilling
 
  io:
    client:   31 KiB/s rd, 0 B/s wr, 31 op/s rd, 20 op/s wr


[2] PG 7.109 stuck in backfilling for more than 24 hours. Due to this more than 3000 PGs are stuck in backfill_wait.
# ceph pg ls backfilling
PG     OBJECTS  DEGRADED  MISPLACED  UNFOUND  BYTES       OMAP_BYTES*  OMAP_KEYS*  LOG   LOG_DUPS  STATE                                            SINCE  VERSION     REPORTED     UP                            ACTING                          SCRUB_STAMP                      DEEP_SCRUB_STAMP                 LAST_SCRUB_DURATION  SCRUB_SCHEDULING
7.109    60223    120446          0        0  4162054144            0           0  1343      3000  active+undersized+degraded+remapped+backfilling    35h  1907'68103  2411:157724  [103,146,142,156,180,89]p103  [NONE,146,142,NONE,180,89]p146  2024-06-02T15:03:35.735370+0000  2024-06-02T15:03:35.735370+0000                    0  queued for scrub

[3] the backfill_targets are 103(0) and 156(3) which are missing all 60223 objects
# cat pg_7.109.query
{
    "snap_trimq": "[]",
    "snap_trimq_len": 0,
    "state": "active+undersized+degraded+remapped+backfilling",
    "epoch": 2250,
    "up": [
        103,
        146,
        142,
        156,
        180,
        89
    ],
    "acting": [
        2147483647,
        146,
        142,
        2147483647,
        180,
        89
    ],
    "backfill_targets": [
        "103(0)",
        "156(3)"
    ],
    "acting_recovery_backfill": [
        "89(5)",
        "103(0)",
        "142(2)",
        "146(1)",
        "156(3)",
        "180(4)"
    ],
....
            "up": [
                103,
                146,
                142,
                156,
                180,
                89
            ],
            "acting": [
                2147483647,
                146,
                142,
                2147483647,
                180,
                89
            ],
            "avail_no_missing": [
                "146(1)",
                "89(5)",
                "142(2)",
                "180(4)"
            ],
            "object_location_counts": [
                {
                    "shards": "89(5),142(2),146(1),180(4)",
                    "objects": 60223
                }
            ],
            "blocked_by": [],
            "up_primary": 103,
            "acting_primary": 146,
            "purged_snaps": []
        },    
   
[4] The 7.109 PG started backfiling for shard0 from osd.1(0) to osd.103(0) and for shard3 from osd.125(3) to osd.156(3).
# grep -r "7\.109" /var/log/ceph/d41b35fc-20ef-11ef-bda4-000af7995d6c/ceph.log
2024-06-02T18:46:07.109482+0000 mgr.f28-h23-000-r630.nbspst (mgr.14283) 6806 : cluster 0 pgmap v6810: 4769 pgs: 4769 active+clean; 10 TiB data, 28 TiB used, 333 TiB / 361 TiB avail; 586 MiB/s rd, 406 MiB/s wr, 190.59k op/s
2024-06-02T19:43:17.109450+0000 mgr.f28-h23-000-r630.nbspst (mgr.14283) 8524 : cluster 0 pgmap v8508: 4769 pgs: 4769 active+clean; 11 TiB data, 30 TiB used, 331 TiB / 361 TiB avail; 428 MiB/s rd, 302 MiB/s wr, 140.31k op/s
2024-06-02T21:25:46.258120+0000 osd.146 (osd.146) 30 : cluster 0 7.109s1 starting backfill to osd.1(0) from (0'0,0'0] MAX to 620'50964
2024-06-02T21:53:27.109562+0000 osd.115 (osd.115) 1508 : cluster 3 10 slow requests (by type [ 'queued for pg' : 10 ] most affected pool [ 'default.rgw.buckets.data' : 10 ])
2024-06-02T23:34:19.734731+0000 osd.146 (osd.146) 1128 : cluster 1 7.109s1 continuing backfill to osd.1(0) from (615'50002,1471'51574] 7:9085d2cc:::b5aa4475-a87c-41d2-a21a-65355e281454.19284.189_hX2YOVGF%2f77781.NmZhvuybxNiNp2xC.rnd:head to 1471'51574
2024-06-02T23:44:25.878502+0000 osd.146 (osd.146) 1149 : cluster 1 7.109s1 continuing backfill to osd.1(0) from (615'50602,1505'52181] 7:9085d2cc:::b5aa4475-a87c-41d2-a21a-65355e281454.19284.189_hX2YOVGF%2f77781.NmZhvuybxNiNp2xC.rnd:head to 1505'52181
2024-06-02T23:44:25.878518+0000 osd.146 (osd.146) 1150 : cluster 0 7.109s1 starting backfill to osd.125(3) from (0'0,0'0] MAX to 1505'52181
2024-06-03T00:30:47.109852+0000 osd.40 (osd.40) 225 : cluster 3 14 slow requests (by type [ 'queued for pg' : 14 ] most affected pool [ 'default.rgw.buckets.data' : 14 ])
2024-06-03T02:15:39.176352+0000 osd.146 (osd.146) 1194 : cluster 1 7.109s1 continuing backfill to osd.1(0) from (1635'57260,1773'58638] 7:9085d2cc:::b5aa4475-a87c-41d2-a21a-65355e281454.19284.189_hX2YOVGF%2f77781.NmZhvuybxNiNp2xC.rnd:head to 1773'58638
2024-06-03T02:15:39.176482+0000 osd.146 (osd.146) 1196 : cluster 0 7.109s1 starting backfill to osd.7(3) from (0'0,0'0] MAX to 1773'58638
2024-06-03T02:15:56.207617+0000 osd.146 (osd.146) 1263 : cluster 0 7.109s1 starting backfill to osd.15(0) from (0'0,0'0] MAX to 1786'58641
2024-06-03T02:15:56.221911+0000 osd.146 (osd.146) 1268 : cluster 0 7.109s1 starting backfill to osd.125(3) from (1635'57260,1773'58638] MIN to 1786'58641
2024-06-03T02:16:10.467454+0000 osd.146 (osd.146) 1346 : cluster 0 7.109s1 starting backfill to osd.13(3) from (0'0,0'0] MAX to 1798'58643
2024-06-03T02:16:10.482341+0000 osd.146 (osd.146) 1349 : cluster 1 7.109s1 continuing backfill to osd.15(0) from (1635'57260,1798'58643] MIN to 1798'58643
2024-06-03T02:16:13.583865+0000 osd.146 (osd.146) 1363 : cluster 0 7.109s1 starting backfill to osd.13(3) from (0'0,0'0] MAX to 1798'58643
2024-06-03T02:16:13.590155+0000 osd.146 (osd.146) 1364 : cluster 0 7.109s1 starting backfill to osd.103(0) from (611'47500,611'49542] MAX to 1798'58643
2024-06-03T02:16:15.288379+0000 osd.146 (osd.146) 1378 : cluster 1 7.109s1 continuing backfill to osd.103(0) from (1635'57260,1802'58644] MIN to 1802'58644
2024-06-03T02:16:15.288394+0000 osd.146 (osd.146) 1379 : cluster 0 7.109s1 starting backfill to osd.156(3) from (615'50002,1471'51574] MAX to 1802'58644
2024-06-03T10:34:17.109297+0000 mgr.f28-h23-000-r630.nbspst (mgr.14283) 35064 : cluster 0 pgmap v36537: 4769 pgs: 1 active+undersized+degraded+remapped+backfilling, 221 active+remapped+backfill_wait, 3731 active+undersized+degraded+remapped+backfill_wait, 816 active+clean; 16 TiB data, 36 TiB used, 325 TiB / 361 TiB avail; 313071902/1477314627 objects degraded (21.192%); 38533716/1477314627 objects misplaced (2.608%)
2024-06-03T17:47:27.109839+0000 mgr.f28-h23-000-r630.nbspst (mgr.14283) 47921 : cluster 0 pgmap v49427: 4769 pgs: 1 active+undersized+degraded+remapped+backfilling, 221 active+remapped+backfill_wait, 3731 active+undersized+degraded+remapped+backfill_wait, 816 active+clean; 16 TiB data, 36 TiB used, 325 TiB / 361 TiB avail; 313071902/1477314630 objects degraded (21.192%); 38533716/1477314630 objects misplaced (2.608%)

[5] Nothing stands out from debug logs collected for osd.146(1) but osd.146 only reports osd.103(0) and osd.156(3) missing all the 60223 objects to be backfilled
osd.146 :
2024-06-03T20:41:09.478+0000 7fd764cdd700 20 osd.146 op_wq(0) _process 7.109s1 to_process <> waiting <> waiting_peering {}
2024-06-03T20:41:09.478+0000 7fd764cdd700 20 osd.146 op_wq(0) _process OpSchedulerItem(7.109s1 PGPeeringEvent(epoch_sent: 2291 epoch_requested: 2291 MLeaseAck epoch 2291 from osd.180 pg_lease_ack(ruub 106835.609375000s)) class_id 2 prio 255 cost 10 e2291) queued
2024-06-03T20:41:09.478+0000 7fd764cdd700 20 osd.146 op_wq(0) _process 7.109s1 to_process <OpSchedulerItem(7.109s1 PGPeeringEvent(epoch_sent: 2291 epoch_requested: 2291 MLeaseAck epoch 2291 from osd.180 pg_lease_ack(ruub 106835.609375000s)) class_id 2 prio 255 cost 10 e2291)> waiting <> waiting_peering {}
2024-06-03T20:41:09.478+0000 7fd764cdd700 20 osd.146 op_wq(0) _process OpSchedulerItem(7.109s1 PGPeeringEvent(epoch_sent: 2291 epoch_requested: 2291 MLeaseAck epoch 2291 from osd.180 pg_lease_ack(ruub 106835.609375000s)) class_id 2 prio 255 cost 10 e2291) pg 0x55eafb3f4000
2024-06-03T20:41:09.478+0000 7fd764cdd700 10 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] do_peering_event: epoch_sent: 2291 epoch_requested: 2291 MLeaseAck epoch 2291 from osd.180 pg_lease_ack(ruub 106835.609375000s)
2024-06-03T20:41:09.478+0000 7fd764cdd700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] recalc_readable_until peer osd.142 ruub 106835.609375000s
2024-06-03T20:41:09.478+0000 7fd764cdd700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] recalc_readable_until peer osd.180 ruub 106835.609375000s
2024-06-03T20:41:09.478+0000 7fd764cdd700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] recalc_readable_until peer osd.89 ruub 106835.609375000s
2024-06-03T20:41:09.478+0000 7fd764cdd700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] recalc_readable_until readable_until[_ub] 106835.609375000s (sent 106835.609375000s)
2024-06-03T20:41:11.198+0000 7fd77ff16700  1 -- 172.17.42.221:0/3633088982 <== osd.111 v2:172.17.42.172:6910/1246231084 20624 ==== osd_ping(ping_reply e2291 up_from 1791 ping_stamp 2024-06-03T20:41:11.191449+0000/106821.320312500s send_stamp 66337.109375000s delta_ub -40484.214843750s) v5 ==== 2033+0+0 (crc 0 0 0) 0x55eafbcf1080 con 0x55eca23f0800
2024-06-03T20:41:11.198+0000 7fd77ff16700  1 -- 172.16.42.221:0/3633088982 <== osd.112 v2:172.16.43.90:6812/1726708053 23629 ==== osd_ping(ping_reply e2291 up_from 209 ping_stamp 2024-06-03T20:41:11.191449+0000/106821.320312500s send_stamp 106857.109375000s delta_ub 35.789901733s) v5 ==== 2033+0+0 (crc 0 0 0) 0x55eafbcf1080 con 0x55eafe0f7400
2024-06-03T20:41:11.198+0000 7fd77f715700  1 -- 172.17.42.221:0/3633088982 <== osd.112 v2:172.17.43.90:6814/1726708053 23629 ==== osd_ping(ping_reply e2291 up_from 209 ping_stamp 2024-06-03T20:41:11.191449+0000/106821.320312500s send_stamp 106857.109375000s delta_ub 35.790058136s) v5 ==== 2033+0+0 (crc 0 0 0) 0x55eafb3fa180 con 0x55eafe0f6800
2024-06-03T20:41:11.199+0000 7fd77f715700  1 -- 172.16.42.221:0/3633088982 <== osd.111 v2:172.16.42.172:6908/1246231084 20624 ==== osd_ping(ping_reply e2291 up_from 1791 ping_stamp 2024-06-03T20:41:11.191449+0000/106821.320312500s send_stamp 66337.109375000s delta_ub -40484.214843750s) v5 ==== 2033+0+0 (crc 0 0 0) 0x55eafb3fa180 con 0x55eca23f1800
2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] scrubber<NotActive>: update_scrub_stats: period: 15s/145s-> 145s last:2024-06-03T20:38:46
2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats actingset 89(5),142(2),146(1),180(4) upset 89(5),103(0),142(2),146(1),156(3),180(4) acting_recovery_backfill 89(5),103(0),142(2),146(1),156(3),180(4)
2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats shard 146(1) primary objects 60223 missing 0
2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats no peer_missing found for 1(0)
2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats shard 1(0) objects 53145 missing 7078
2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats no peer_missing found for 7(3)
2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats shard 7(3) objects 0 missing 60223
2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats no peer_missing found for 13(3)
2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats shard 13(3) objects 0 missing 60223
2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats no peer_missing found for 15(0)
2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats shard 15(0) objects 0 missing 60223
2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats shard 89(5) objects 53150 missing 0
2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats shard 103(0) objects 0 missing 60223
2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats no peer_missing found for 125(3)
2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats shard 125(3) objects 0 missing 60223
2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats shard 142(2) objects 53150 missing 0
2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats shard 156(3) objects 0 missing 60223
2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats shard 180(4) objects 53150 missing 0
2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats object_location_counts {89(5),142(2),146(1),180(4)=60223}
2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats missing shard 89(5) missing= 0
2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats missing shard 142(2) missing= 0
2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats missing shard 146(1) missing= 0
2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats missing shard 180(4) missing= 0
2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats missing shard 103(0) missing= 60223
2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats missing shard 156(3) missing= 60223
2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats degraded 120446
2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats misplaced 0
2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] prepare_stats_for_publish reporting purged_snaps []
2024-06-03T20:41:11.667+0000 7fd770614700 15 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] publish_stats_to_osd 2291:157075
2024-06-03T20:41:13.022+0000 7fd77ff16700  1 -- [v2:172.17.42.221:6854/3633088982,v1:172.17.42.221:6855/3633088982] <== osd.69 172.17.43.86:0/1315089877 23513 ==== osd_ping(ping e2291 up_from 344 ping_stamp 2024-06-03T20:41:13.021291+0000/106717.109375000s send_stamp 106717.109375000s delta_ub -106.039131165s) v5 ==== 2033+0+0 (crc 0 0 0) 0x55ecbb3b1b00 con 0x55eca23f1c00
2024-06-03T20:41:13.022+0000 7fd77ff16700  1 -- [v2:172.17.42.221:6854/3633088982,v1:172.17.42.221:6855/3633088982] --> 172.17.43.86:0/1315089877 -- osd_ping(ping_reply e2291 up_from 245 ping_stamp 2024-06-03T20:41:13.021291+0000/106717.109375000s send_stamp 106823.148437500s delta_ub 106.045753479s) v5 -- 0x55ecbb3b1080 con 0x55eca23f1c00
2024-06-03T20:41:13.022+0000 7fd77f715700  1 -- [v2:172.16.42.221:6852/3633088982,v1:172.16.42.221:6853/3633088982] <== osd.69 172.16.43.86:0/1315089877 23513 ==== osd_ping(ping e2291 up_from 344 ping_stamp 2024-06-03T20:41:13.021291+0000/106717.109375000s send_stamp 106717.109375000s delta_ub -106.039131165s) v5 ==== 2033+0+0 (crc 0 0 0) 0x55ecbb3b0480 con 0x55eafdd6f800
2024-06-03T20:41:13.022+0000 7fd77f715700  1 -- [v2:172.16.42.221:6852/3633088982,v1:172.16.42.221:6853/3633088982] --> 172.16.43.86:0/1315089877 -- osd_ping(ping_reply e2291 up_from 245 ping_stamp 2024-06-03T20:41:13.021291+0000/106717.109375000s send_stamp 106823.148437500s delta_ub 106.045982361s) v5 -- 0x55ecbb3b1680 con 0x55eafdd6f800
2024-06-03T20:41:16.668+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] scrubber<NotActive>: update_scrub_stats: period: 15s/145s-> 145s last:2024-06-03T20:41:11
2024-06-03T20:41:16.975+0000 7fd780717700  1 -- [v2:172.16.42.221:6852/3633088982,v1:172.16.42.221:6853/3633088982] --> 172.16.43.67:0/2023387295 -- osd_ping(ping_reply e2291 up_from 245 ping_stamp 2024-06-03T20:41:16.974985+0000/106766.507812500s send_stamp 106827.109375000s delta_ub 60.594390869s) v5 -- 0x55eafddac480 con 0x55eafab5d000
2024-06-03T20:41:16.975+0000 7fd77ff16700  1 -- [v2:172.17.42.221:6854/3633088982,v1:172.17.42.221:6855/3633088982] --> 172.17.43.67:0/2023387295 -- osd_ping(ping_reply e2291 up_from 245 ping_stamp 2024-06-03T20:41:16.974985+0000/106766.507812500s send_stamp 106827.109375000s delta_ub 60.594425201s) v5 -- 0x55eafd2eb080 con 0x55eafbca1400
2024-06-03T20:41:17.477+0000 7fd765cdf700 15 osd.146 2291 enqueue_peering_evt 7.109s1 epoch_sent: 1803 epoch_requested: 1803 RenewLease
2024-06-03T20:41:17.477+0000 7fd765cdf700 20 osd.146 op_wq(0) _enqueue OpSchedulerItem(7.109s1 PGPeeringEvent(epoch_sent: 1803 epoch_requested: 1803 RenewLease) class_id 2 prio 255 cost 10 e1803)
2024-06-03T20:41:17.477+0000 7fd764cdd700 20 osd.146 op_wq(0) _process 7.109s1 to_process <> waiting <> waiting_peering {}

We are yet to collect some additional debug logs with debug_osd=30, debug_ms=5 and debug_mclock=5 level from the up/acting set OSDs.

Meanwhile I have asked Aishwarya to check for the possible deadlock from mclock side as I am suspecting a mclock issue something on a similare line compared to backfill stuck issue reported by some of the upstream users https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/IPHBE3DLW5ABCZHSNYOBUBSI3TLWVD22/#RBR4PIODUSHH3C3WE2BEAJ4UPSHHDO5X

Comment 17 Vikhyat Umrao 2024-06-05 14:50:35 UTC
*** Bug 2212945 has been marked as a duplicate of this bug. ***

Comment 63 errata-xmlrpc 2025-06-26 12:10:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat Ceph Storage 8.1 security, bug fix, and enhancement updates), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2025:9775