Bug 2265371 - [RHCS 8.1] [Workload-DFG] pg stuck in backfill / recovery (bug 2212945)
Summary: [RHCS 8.1] [Workload-DFG] pg stuck in backfill / recovery (bug 2212945)
Keywords:
Status: VERIFIED
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RADOS
Version: 7.0
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
: 8.1
Assignee: Prashant Dhange
QA Contact: Harsh Kumar
URL:
Whiteboard:
: 2212945 2292743 (view as bug list)
Depends On:
Blocks: 2351272 2351271
TreeView+ depends on / blocked
 
Reported: 2024-02-21 17:09 UTC by Tim Wilkinson
Modified: 2025-04-28 02:56 UTC (History)
13 users (show)

Fixed In Version: ceph-19.2.1-72.el9cp
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Ceph Project Bug Tracker 61594 0 None None None 2024-06-05 14:46:14 UTC
Red Hat Bugzilla 2212945 0 medium CLOSED [Workload-DFG] pg stuck in recovery 2024-06-05 14:54:51 UTC
Red Hat Issue Tracker RHCEPH-8359 0 None None None 2024-02-21 17:14:12 UTC

Description Tim Wilkinson 2024-02-21 17:09:11 UTC
Created attachment 2018018 [details]
ceph.log

Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 15 Prashant Dhange 2024-06-04 18:35:44 UTC
[1] cluster stuck in backfilling for more than 24 hours
# ceph -s
  cluster:
    id:     d41b35fc-20ef-11ef-bda4-000af7995d6c
    health: HEALTH_WARN
            noscrub,nodeep-scrub flag(s) set
            Degraded data redundancy: 313071902/1477314630 objects degraded (21.192%), 3732 pgs degraded, 3732 pgs undersized
 
  services:
    mon: 3 daemons, quorum f28-h21-000-r630,f28-h22-000-r630,f28-h23-000-r630 (age 43h)
    mgr: f28-h23-000-r630.nbspst(active, since 2d), standbys: f28-h22-000-r630.kaarli, f28-h21-000-r630.eksini
    osd: 192 osds: 192 up (since 38h), 192 in (since 38h); 3953 remapped pgs
         flags noscrub,nodeep-scrub
    rgw: 8 daemons active (8 hosts, 1 zones)
 
  data:
    pools:   7 pools, 4769 pgs
    objects: 246.22M objects, 16 TiB
    usage:   36 TiB used, 325 TiB / 361 TiB avail
    pgs:     313071902/1477314630 objects degraded (21.192%)
             38533716/1477314630 objects misplaced (2.608%)
             3731 active+undersized+degraded+remapped+backfill_wait
             816  active+clean
             221  active+remapped+backfill_wait
             1    active+undersized+degraded+remapped+backfilling
 
  io:
    client:   31 KiB/s rd, 0 B/s wr, 31 op/s rd, 20 op/s wr


[2] PG 7.109 stuck in backfilling for more than 24 hours. Due to this more than 3000 PGs are stuck in backfill_wait.
# ceph pg ls backfilling
PG     OBJECTS  DEGRADED  MISPLACED  UNFOUND  BYTES       OMAP_BYTES*  OMAP_KEYS*  LOG   LOG_DUPS  STATE                                            SINCE  VERSION     REPORTED     UP                            ACTING                          SCRUB_STAMP                      DEEP_SCRUB_STAMP                 LAST_SCRUB_DURATION  SCRUB_SCHEDULING
7.109    60223    120446          0        0  4162054144            0           0  1343      3000  active+undersized+degraded+remapped+backfilling    35h  1907'68103  2411:157724  [103,146,142,156,180,89]p103  [NONE,146,142,NONE,180,89]p146  2024-06-02T15:03:35.735370+0000  2024-06-02T15:03:35.735370+0000                    0  queued for scrub

[3] the backfill_targets are 103(0) and 156(3) which are missing all 60223 objects
# cat pg_7.109.query
{
    "snap_trimq": "[]",
    "snap_trimq_len": 0,
    "state": "active+undersized+degraded+remapped+backfilling",
    "epoch": 2250,
    "up": [
        103,
        146,
        142,
        156,
        180,
        89
    ],
    "acting": [
        2147483647,
        146,
        142,
        2147483647,
        180,
        89
    ],
    "backfill_targets": [
        "103(0)",
        "156(3)"
    ],
    "acting_recovery_backfill": [
        "89(5)",
        "103(0)",
        "142(2)",
        "146(1)",
        "156(3)",
        "180(4)"
    ],
....
            "up": [
                103,
                146,
                142,
                156,
                180,
                89
            ],
            "acting": [
                2147483647,
                146,
                142,
                2147483647,
                180,
                89
            ],
            "avail_no_missing": [
                "146(1)",
                "89(5)",
                "142(2)",
                "180(4)"
            ],
            "object_location_counts": [
                {
                    "shards": "89(5),142(2),146(1),180(4)",
                    "objects": 60223
                }
            ],
            "blocked_by": [],
            "up_primary": 103,
            "acting_primary": 146,
            "purged_snaps": []
        },    
   
[4] The 7.109 PG started backfiling for shard0 from osd.1(0) to osd.103(0) and for shard3 from osd.125(3) to osd.156(3).
# grep -r "7\.109" /var/log/ceph/d41b35fc-20ef-11ef-bda4-000af7995d6c/ceph.log
2024-06-02T18:46:07.109482+0000 mgr.f28-h23-000-r630.nbspst (mgr.14283) 6806 : cluster 0 pgmap v6810: 4769 pgs: 4769 active+clean; 10 TiB data, 28 TiB used, 333 TiB / 361 TiB avail; 586 MiB/s rd, 406 MiB/s wr, 190.59k op/s
2024-06-02T19:43:17.109450+0000 mgr.f28-h23-000-r630.nbspst (mgr.14283) 8524 : cluster 0 pgmap v8508: 4769 pgs: 4769 active+clean; 11 TiB data, 30 TiB used, 331 TiB / 361 TiB avail; 428 MiB/s rd, 302 MiB/s wr, 140.31k op/s
2024-06-02T21:25:46.258120+0000 osd.146 (osd.146) 30 : cluster 0 7.109s1 starting backfill to osd.1(0) from (0'0,0'0] MAX to 620'50964
2024-06-02T21:53:27.109562+0000 osd.115 (osd.115) 1508 : cluster 3 10 slow requests (by type [ 'queued for pg' : 10 ] most affected pool [ 'default.rgw.buckets.data' : 10 ])
2024-06-02T23:34:19.734731+0000 osd.146 (osd.146) 1128 : cluster 1 7.109s1 continuing backfill to osd.1(0) from (615'50002,1471'51574] 7:9085d2cc:::b5aa4475-a87c-41d2-a21a-65355e281454.19284.189_hX2YOVGF%2f77781.NmZhvuybxNiNp2xC.rnd:head to 1471'51574
2024-06-02T23:44:25.878502+0000 osd.146 (osd.146) 1149 : cluster 1 7.109s1 continuing backfill to osd.1(0) from (615'50602,1505'52181] 7:9085d2cc:::b5aa4475-a87c-41d2-a21a-65355e281454.19284.189_hX2YOVGF%2f77781.NmZhvuybxNiNp2xC.rnd:head to 1505'52181
2024-06-02T23:44:25.878518+0000 osd.146 (osd.146) 1150 : cluster 0 7.109s1 starting backfill to osd.125(3) from (0'0,0'0] MAX to 1505'52181
2024-06-03T00:30:47.109852+0000 osd.40 (osd.40) 225 : cluster 3 14 slow requests (by type [ 'queued for pg' : 14 ] most affected pool [ 'default.rgw.buckets.data' : 14 ])
2024-06-03T02:15:39.176352+0000 osd.146 (osd.146) 1194 : cluster 1 7.109s1 continuing backfill to osd.1(0) from (1635'57260,1773'58638] 7:9085d2cc:::b5aa4475-a87c-41d2-a21a-65355e281454.19284.189_hX2YOVGF%2f77781.NmZhvuybxNiNp2xC.rnd:head to 1773'58638
2024-06-03T02:15:39.176482+0000 osd.146 (osd.146) 1196 : cluster 0 7.109s1 starting backfill to osd.7(3) from (0'0,0'0] MAX to 1773'58638
2024-06-03T02:15:56.207617+0000 osd.146 (osd.146) 1263 : cluster 0 7.109s1 starting backfill to osd.15(0) from (0'0,0'0] MAX to 1786'58641
2024-06-03T02:15:56.221911+0000 osd.146 (osd.146) 1268 : cluster 0 7.109s1 starting backfill to osd.125(3) from (1635'57260,1773'58638] MIN to 1786'58641
2024-06-03T02:16:10.467454+0000 osd.146 (osd.146) 1346 : cluster 0 7.109s1 starting backfill to osd.13(3) from (0'0,0'0] MAX to 1798'58643
2024-06-03T02:16:10.482341+0000 osd.146 (osd.146) 1349 : cluster 1 7.109s1 continuing backfill to osd.15(0) from (1635'57260,1798'58643] MIN to 1798'58643
2024-06-03T02:16:13.583865+0000 osd.146 (osd.146) 1363 : cluster 0 7.109s1 starting backfill to osd.13(3) from (0'0,0'0] MAX to 1798'58643
2024-06-03T02:16:13.590155+0000 osd.146 (osd.146) 1364 : cluster 0 7.109s1 starting backfill to osd.103(0) from (611'47500,611'49542] MAX to 1798'58643
2024-06-03T02:16:15.288379+0000 osd.146 (osd.146) 1378 : cluster 1 7.109s1 continuing backfill to osd.103(0) from (1635'57260,1802'58644] MIN to 1802'58644
2024-06-03T02:16:15.288394+0000 osd.146 (osd.146) 1379 : cluster 0 7.109s1 starting backfill to osd.156(3) from (615'50002,1471'51574] MAX to 1802'58644
2024-06-03T10:34:17.109297+0000 mgr.f28-h23-000-r630.nbspst (mgr.14283) 35064 : cluster 0 pgmap v36537: 4769 pgs: 1 active+undersized+degraded+remapped+backfilling, 221 active+remapped+backfill_wait, 3731 active+undersized+degraded+remapped+backfill_wait, 816 active+clean; 16 TiB data, 36 TiB used, 325 TiB / 361 TiB avail; 313071902/1477314627 objects degraded (21.192%); 38533716/1477314627 objects misplaced (2.608%)
2024-06-03T17:47:27.109839+0000 mgr.f28-h23-000-r630.nbspst (mgr.14283) 47921 : cluster 0 pgmap v49427: 4769 pgs: 1 active+undersized+degraded+remapped+backfilling, 221 active+remapped+backfill_wait, 3731 active+undersized+degraded+remapped+backfill_wait, 816 active+clean; 16 TiB data, 36 TiB used, 325 TiB / 361 TiB avail; 313071902/1477314630 objects degraded (21.192%); 38533716/1477314630 objects misplaced (2.608%)

[5] Nothing stands out from debug logs collected for osd.146(1) but osd.146 only reports osd.103(0) and osd.156(3) missing all the 60223 objects to be backfilled
osd.146 :
2024-06-03T20:41:09.478+0000 7fd764cdd700 20 osd.146 op_wq(0) _process 7.109s1 to_process <> waiting <> waiting_peering {}
2024-06-03T20:41:09.478+0000 7fd764cdd700 20 osd.146 op_wq(0) _process OpSchedulerItem(7.109s1 PGPeeringEvent(epoch_sent: 2291 epoch_requested: 2291 MLeaseAck epoch 2291 from osd.180 pg_lease_ack(ruub 106835.609375000s)) class_id 2 prio 255 cost 10 e2291) queued
2024-06-03T20:41:09.478+0000 7fd764cdd700 20 osd.146 op_wq(0) _process 7.109s1 to_process <OpSchedulerItem(7.109s1 PGPeeringEvent(epoch_sent: 2291 epoch_requested: 2291 MLeaseAck epoch 2291 from osd.180 pg_lease_ack(ruub 106835.609375000s)) class_id 2 prio 255 cost 10 e2291)> waiting <> waiting_peering {}
2024-06-03T20:41:09.478+0000 7fd764cdd700 20 osd.146 op_wq(0) _process OpSchedulerItem(7.109s1 PGPeeringEvent(epoch_sent: 2291 epoch_requested: 2291 MLeaseAck epoch 2291 from osd.180 pg_lease_ack(ruub 106835.609375000s)) class_id 2 prio 255 cost 10 e2291) pg 0x55eafb3f4000
2024-06-03T20:41:09.478+0000 7fd764cdd700 10 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] do_peering_event: epoch_sent: 2291 epoch_requested: 2291 MLeaseAck epoch 2291 from osd.180 pg_lease_ack(ruub 106835.609375000s)
2024-06-03T20:41:09.478+0000 7fd764cdd700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] recalc_readable_until peer osd.142 ruub 106835.609375000s
2024-06-03T20:41:09.478+0000 7fd764cdd700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] recalc_readable_until peer osd.180 ruub 106835.609375000s
2024-06-03T20:41:09.478+0000 7fd764cdd700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] recalc_readable_until peer osd.89 ruub 106835.609375000s
2024-06-03T20:41:09.478+0000 7fd764cdd700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] recalc_readable_until readable_until[_ub] 106835.609375000s (sent 106835.609375000s)
2024-06-03T20:41:11.198+0000 7fd77ff16700  1 -- 172.17.42.221:0/3633088982 <== osd.111 v2:172.17.42.172:6910/1246231084 20624 ==== osd_ping(ping_reply e2291 up_from 1791 ping_stamp 2024-06-03T20:41:11.191449+0000/106821.320312500s send_stamp 66337.109375000s delta_ub -40484.214843750s) v5 ==== 2033+0+0 (crc 0 0 0) 0x55eafbcf1080 con 0x55eca23f0800
2024-06-03T20:41:11.198+0000 7fd77ff16700  1 -- 172.16.42.221:0/3633088982 <== osd.112 v2:172.16.43.90:6812/1726708053 23629 ==== osd_ping(ping_reply e2291 up_from 209 ping_stamp 2024-06-03T20:41:11.191449+0000/106821.320312500s send_stamp 106857.109375000s delta_ub 35.789901733s) v5 ==== 2033+0+0 (crc 0 0 0) 0x55eafbcf1080 con 0x55eafe0f7400
2024-06-03T20:41:11.198+0000 7fd77f715700  1 -- 172.17.42.221:0/3633088982 <== osd.112 v2:172.17.43.90:6814/1726708053 23629 ==== osd_ping(ping_reply e2291 up_from 209 ping_stamp 2024-06-03T20:41:11.191449+0000/106821.320312500s send_stamp 106857.109375000s delta_ub 35.790058136s) v5 ==== 2033+0+0 (crc 0 0 0) 0x55eafb3fa180 con 0x55eafe0f6800
2024-06-03T20:41:11.199+0000 7fd77f715700  1 -- 172.16.42.221:0/3633088982 <== osd.111 v2:172.16.42.172:6908/1246231084 20624 ==== osd_ping(ping_reply e2291 up_from 1791 ping_stamp 2024-06-03T20:41:11.191449+0000/106821.320312500s send_stamp 66337.109375000s delta_ub -40484.214843750s) v5 ==== 2033+0+0 (crc 0 0 0) 0x55eafb3fa180 con 0x55eca23f1800
2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] scrubber<NotActive>: update_scrub_stats: period: 15s/145s-> 145s last:2024-06-03T20:38:46
2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats actingset 89(5),142(2),146(1),180(4) upset 89(5),103(0),142(2),146(1),156(3),180(4) acting_recovery_backfill 89(5),103(0),142(2),146(1),156(3),180(4)
2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats shard 146(1) primary objects 60223 missing 0
2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats no peer_missing found for 1(0)
2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats shard 1(0) objects 53145 missing 7078
2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats no peer_missing found for 7(3)
2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats shard 7(3) objects 0 missing 60223
2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats no peer_missing found for 13(3)
2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats shard 13(3) objects 0 missing 60223
2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats no peer_missing found for 15(0)
2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats shard 15(0) objects 0 missing 60223
2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats shard 89(5) objects 53150 missing 0
2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats shard 103(0) objects 0 missing 60223
2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats no peer_missing found for 125(3)
2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats shard 125(3) objects 0 missing 60223
2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats shard 142(2) objects 53150 missing 0
2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats shard 156(3) objects 0 missing 60223
2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats shard 180(4) objects 53150 missing 0
2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats object_location_counts {89(5),142(2),146(1),180(4)=60223}
2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats missing shard 89(5) missing= 0
2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats missing shard 142(2) missing= 0
2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats missing shard 146(1) missing= 0
2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats missing shard 180(4) missing= 0
2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats missing shard 103(0) missing= 60223
2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats missing shard 156(3) missing= 60223
2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats degraded 120446
2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] update_calc_stats misplaced 0
2024-06-03T20:41:11.667+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] prepare_stats_for_publish reporting purged_snaps []
2024-06-03T20:41:11.667+0000 7fd770614700 15 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] publish_stats_to_osd 2291:157075
2024-06-03T20:41:13.022+0000 7fd77ff16700  1 -- [v2:172.17.42.221:6854/3633088982,v1:172.17.42.221:6855/3633088982] <== osd.69 172.17.43.86:0/1315089877 23513 ==== osd_ping(ping e2291 up_from 344 ping_stamp 2024-06-03T20:41:13.021291+0000/106717.109375000s send_stamp 106717.109375000s delta_ub -106.039131165s) v5 ==== 2033+0+0 (crc 0 0 0) 0x55ecbb3b1b00 con 0x55eca23f1c00
2024-06-03T20:41:13.022+0000 7fd77ff16700  1 -- [v2:172.17.42.221:6854/3633088982,v1:172.17.42.221:6855/3633088982] --> 172.17.43.86:0/1315089877 -- osd_ping(ping_reply e2291 up_from 245 ping_stamp 2024-06-03T20:41:13.021291+0000/106717.109375000s send_stamp 106823.148437500s delta_ub 106.045753479s) v5 -- 0x55ecbb3b1080 con 0x55eca23f1c00
2024-06-03T20:41:13.022+0000 7fd77f715700  1 -- [v2:172.16.42.221:6852/3633088982,v1:172.16.42.221:6853/3633088982] <== osd.69 172.16.43.86:0/1315089877 23513 ==== osd_ping(ping e2291 up_from 344 ping_stamp 2024-06-03T20:41:13.021291+0000/106717.109375000s send_stamp 106717.109375000s delta_ub -106.039131165s) v5 ==== 2033+0+0 (crc 0 0 0) 0x55ecbb3b0480 con 0x55eafdd6f800
2024-06-03T20:41:13.022+0000 7fd77f715700  1 -- [v2:172.16.42.221:6852/3633088982,v1:172.16.42.221:6853/3633088982] --> 172.16.43.86:0/1315089877 -- osd_ping(ping_reply e2291 up_from 245 ping_stamp 2024-06-03T20:41:13.021291+0000/106717.109375000s send_stamp 106823.148437500s delta_ub 106.045982361s) v5 -- 0x55ecbb3b1680 con 0x55eafdd6f800
2024-06-03T20:41:16.668+0000 7fd770614700 20 osd.146 pg_epoch: 2291 pg[7.109s1( v 1907'68103 (1891'66760,1907'68103] local-lis/les=1803/1804 n=60223 ec=412/387 lis/c=1803/610 les/c/f=1804/611/0 sis=1803) [103,146,142,156,180,89]/[NONE,146,142,NONE,180,89]p146(1) backfill=[103(0),156(3)] r=1 lpr=1803 pi=[610,1803)/1 crt=1907'68103 lcod 1907'68102 mlcod 1907'68102 active+undersized+degraded+remapped+backfilling mbc={}] scrubber<NotActive>: update_scrub_stats: period: 15s/145s-> 145s last:2024-06-03T20:41:11
2024-06-03T20:41:16.975+0000 7fd780717700  1 -- [v2:172.16.42.221:6852/3633088982,v1:172.16.42.221:6853/3633088982] --> 172.16.43.67:0/2023387295 -- osd_ping(ping_reply e2291 up_from 245 ping_stamp 2024-06-03T20:41:16.974985+0000/106766.507812500s send_stamp 106827.109375000s delta_ub 60.594390869s) v5 -- 0x55eafddac480 con 0x55eafab5d000
2024-06-03T20:41:16.975+0000 7fd77ff16700  1 -- [v2:172.17.42.221:6854/3633088982,v1:172.17.42.221:6855/3633088982] --> 172.17.43.67:0/2023387295 -- osd_ping(ping_reply e2291 up_from 245 ping_stamp 2024-06-03T20:41:16.974985+0000/106766.507812500s send_stamp 106827.109375000s delta_ub 60.594425201s) v5 -- 0x55eafd2eb080 con 0x55eafbca1400
2024-06-03T20:41:17.477+0000 7fd765cdf700 15 osd.146 2291 enqueue_peering_evt 7.109s1 epoch_sent: 1803 epoch_requested: 1803 RenewLease
2024-06-03T20:41:17.477+0000 7fd765cdf700 20 osd.146 op_wq(0) _enqueue OpSchedulerItem(7.109s1 PGPeeringEvent(epoch_sent: 1803 epoch_requested: 1803 RenewLease) class_id 2 prio 255 cost 10 e1803)
2024-06-03T20:41:17.477+0000 7fd764cdd700 20 osd.146 op_wq(0) _process 7.109s1 to_process <> waiting <> waiting_peering {}

We are yet to collect some additional debug logs with debug_osd=30, debug_ms=5 and debug_mclock=5 level from the up/acting set OSDs.

Meanwhile I have asked Aishwarya to check for the possible deadlock from mclock side as I am suspecting a mclock issue something on a similare line compared to backfill stuck issue reported by some of the upstream users https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/IPHBE3DLW5ABCZHSNYOBUBSI3TLWVD22/#RBR4PIODUSHH3C3WE2BEAJ4UPSHHDO5X

Comment 17 Vikhyat Umrao 2024-06-05 14:50:35 UTC
*** Bug 2212945 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.