+++ This bug was initially created as a clone of Bug #2391692 +++
Description of problem:
When the osd sends a beacon to the monitor it includes superblock.last_purged_snaps_scrub and when the monitor handles the beacon and sees this value is updated it issues a new OSDMap. In large scale environments where there are a lot of purged snaps that can lead to a lot of new OSDMaps (20 a minute has been recorded). This can exacerbate or perhaps cause issues such as https://tracker.ceph.com/issues/72337 where excessive lock contention leads to the manager missing the beacon timeout and being failed over by the monitor with all the performance issues that entails. At the moment it's not clear where we reference this value in the OSDMap for any valuable work so I question whether it needs to be included in the OSDMap at all? It seems to me that the mechanism for doing the purged_snaps scrub is all self-contained within src/osd/OSD.cc and all we are doing is reporting. Also, we seem to update superblock.last_purged_snaps_scrub any time we call scrub_purged_snaps() which means the next beacon update we do will carry a changed value and cause an OSDMap update. In environments where there are a lot of osds regularly scrubbing
2025-08-01T04:48:50.876+0000 7fc55280b640 10 XXX@0(leader).paxosservice(osdmap 1627448..1627998) dispatch 0x563b45da6780 osd_beacon(pgs [8.2d,3.17c,3.1103,2.2e0,3.3571,3.6c8,3.3418,3.3292,3.39f4,3.26d5,2.ab,3.20b4,3.32bb,3.1934,3.d95,3.2550,3.15e3,3.25f6,7.8,3.1f1b,3.16f6,3.28cc,3.1b34,3.207f] lec 1627997 last_purged_snaps_scrub 2025-08-01T04:48:50.877595+0000 osd_beacon_report_interval 300 v1627998) from osd.586 v2:1XX.1XX.41.110:7032/2510700571 con 0x563b1815b880
2025-08-01T04:48:50.876+0000 7fc55280b640 5 XXX@0(leader).paxos(paxos active c 143178935..143179587) is_readable = 1 - now=2025-08-01T04:48:50.878132+0000 lease_expire=2025-08-01T04:48:55.578002+0000 has v0 lc 143179587
2025-08-01T04:48:50.876+0000 7fc55280b640 10 XXX@0(leader).osd e1627998 preprocess_query osd_beacon(pgs [8.2d,3.17c,3.1103,2.2e0,3.3571,3.6c8,3.3418,3.3292,3.39f4,3.26d5,2.ab,3.20b4,3.32bb,3.1934,3.d95,3.2550,3.15e3,3.25f6,7.8,3.1f1b,3.16f6,3.28cc,3.1b34,3.207f] lec 1627997 last_purged_snaps_scrub 2025-08-01T04:48:50.877595+0000 osd_beacon_report_interval 300 v1627998) from osd.586 v2:1XX.1XX.41.110:7032/2510700571
2025-08-01T04:48:50.876+0000 7fc55280b640 10 XXX@0(leader) e6 no_reply to osd.586 v2:1XX.1XX.41.110:7032/2510700571 via v2:1XX.1XX.40.2:3300/0 for request osd_beacon(pgs [8.2d,3.17c,3.1103,2.2e0,3.3571,3.6c8,3.3418,3.3292,3.39f4,3.26d5,2.ab,3.20b4,3.32bb,3.1934,3.d95,3.2550,3.15e3,3.25f6,7.8,3.1f1b,3.16f6,3.28cc,3.1b34,3.207f] lec 1627997 last_purged_snaps_scrub 2025-08-01T04:48:50.877595+0000 osd_beacon_report_interval 300 v1627998)
2025-08-01T04:48:50.876+0000 7fc55280b640 20 is_capable service=osd command= exec addr v2:1XX.1XX.41.110:7032/2510700571 on cap allow profile osd
2025-08-01T04:48:50.876+0000 7fc55280b640 20 allow so far , doing grant allow profile osd
2025-08-01T04:48:50.876+0000 7fc55280b640 20 match
2025-08-01T04:48:50.876+0000 7fc55280b640 7 XXX@0(leader).osd e1627998 prepare_update osd_beacon(pgs [8.2d,3.17c,3.1103,2.2e0,3.3571,3.6c8,3.3418,3.3292,3.39f4,3.26d5,2.ab,3.20b4,3.32bb,3.1934,3.d95,3.2550,3.15e3,3.25f6,7.8,3.1f1b,3.16f6,3.28cc,3.1b34,3.207f] lec 1627997 last_purged_snaps_scrub 2025-08-01T04:48:50.877595+0000 osd_beacon_report_interval 300 v1627998) from osd.586 v2:1XX.1XX.41.110:7032/2510700571
2025-08-01T04:48:50.876+0000 7fc55280b640 10 XXX@0(leader).osd e1627998 prepare_beacon osd_beacon(pgs [8.2d,3.17c,3.1103,2.2e0,3.3571,3.6c8,3.3418,3.3292,3.39f4,3.26d5,2.ab,3.20b4,3.32bb,3.1934,3.d95,3.2550,3.15e3,3.25f6,7.8,3.1f1b,3.16f6,3.28cc,3.1b34,3.207f] lec 1627997 last_purged_snaps_scrub 2025-08-01T04:48:50.877595+0000 osd_beacon_report_interval 300 v1627998) from osd.586
2025-08-01T04:48:50.876+0000 7fc55280b640 10 XXX@0(leader).osd e1627998 should_propose
2025-08-01T04:48:50.876+0000 7fc55280b640 10 XXX@0(leader).paxosservice(osdmap 1627448..1627998) setting proposal_timer 0x563b27ffac80 with delay of 0.695339
$ bin/osdmaptool --print --dump json-pretty ./osdmap-6.bin >map6dump.out
$ bin/osdmaptool --print --dump json-pretty ./osdmap-7.bin >map7dump.out
$ diff map6dump.out map7dump.out
2c2
< "epoch": 1625346,
---
> "epoch": 1625347,
5c5
< "modified": "2025-07-31T03:41:23.283333+0000",
---
> "modified": "2025-07-31T03:41:37.461253+0000",
172377c172377
< "last_purged_snaps_scrub": "2025-07-29T23:08:47.986194+0000",
---
> "last_purged_snaps_scrub": "2025-07-31T03:41:37.377661+0000",
$ diff map5dump.out map6dump.out
2c2
< "epoch": 1625345,
---
> "epoch": 1625346,
5c5
< "modified": "2025-07-31T03:41:11.295001+0000",
---
> "modified": "2025-07-31T03:41:23.283333+0000",
186997c186997
< "last_purged_snaps_scrub": "2025-07-29T16:39:43.390592+0000",
---
> "last_purged_snaps_scrub": "2025-07-31T03:41:22.672210+0000",
Strongly related to https://bugzilla.redhat.com/show_bug.cgi?id=2359626
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (Red Hat Ceph Storage 8.1 security, bug fix and enhancement updates), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHBA-2025:17047