Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
This project is now read‑only. Starting Monday, February 2, please use https://ibm-ceph.atlassian.net/ for all bug tracking management.

Bug 2392394

Summary: [8.1z backport] [GSS] last_purged_snaps_scrub updates can cause high frequency OSDMap updates leading to performance issues
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Bipin Kunal <bkunal>
Component: RADOSAssignee: Radoslaw Zarzynski <rzarzyns>
Status: CLOSED ERRATA QA Contact: Pawan <pdhiran>
Severity: high Docs Contact:
Priority: unspecified    
Version: 7.1CC: bhubbard, ceph-eng-bugs, cephqe-warriors, jcaratza, nojha, pdhiran, rzarzyns, tserlin, vereddy, vumrao
Target Milestone: ---   
Target Release: 8.1z3   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ceph-19.2.1-273 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 2391692 Environment:
Last Closed: 2025-09-30 09:23:04 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2391692    
Bug Blocks:    

Description Bipin Kunal 2025-09-01 13:56:05 UTC
+++ This bug was initially created as a clone of Bug #2391692 +++

Description of problem:

When the osd sends a beacon to the monitor it includes superblock.last_purged_snaps_scrub and when the monitor handles the beacon and sees this value is updated it issues a new OSDMap. In large scale environments where there are a lot of purged snaps that can lead to a lot of new OSDMaps (20 a minute has been recorded). This can exacerbate or perhaps cause issues such as https://tracker.ceph.com/issues/72337 where excessive lock contention leads to the manager missing the beacon timeout and being failed over by the monitor with all the performance issues that entails. At the moment it's not clear where we reference this value in the OSDMap for any valuable work so I question whether it needs to be included in the OSDMap at all? It seems to me that the mechanism for doing the purged_snaps scrub is all self-contained within src/osd/OSD.cc and all we are doing is reporting. Also, we seem to update superblock.last_purged_snaps_scrub any time we call scrub_purged_snaps() which means the next beacon update we do will carry a changed value and cause an OSDMap update. In environments where there are a lot of osds regularly scrubbing

2025-08-01T04:48:50.876+0000 7fc55280b640 10 XXX@0(leader).paxosservice(osdmap 1627448..1627998) dispatch 0x563b45da6780 osd_beacon(pgs [8.2d,3.17c,3.1103,2.2e0,3.3571,3.6c8,3.3418,3.3292,3.39f4,3.26d5,2.ab,3.20b4,3.32bb,3.1934,3.d95,3.2550,3.15e3,3.25f6,7.8,3.1f1b,3.16f6,3.28cc,3.1b34,3.207f] lec 1627997 last_purged_snaps_scrub 2025-08-01T04:48:50.877595+0000 osd_beacon_report_interval 300 v1627998) from osd.586 v2:1XX.1XX.41.110:7032/2510700571 con 0x563b1815b880
2025-08-01T04:48:50.876+0000 7fc55280b640  5 XXX@0(leader).paxos(paxos active c 143178935..143179587) is_readable = 1 - now=2025-08-01T04:48:50.878132+0000 lease_expire=2025-08-01T04:48:55.578002+0000 has v0 lc 143179587
2025-08-01T04:48:50.876+0000 7fc55280b640 10 XXX@0(leader).osd e1627998 preprocess_query osd_beacon(pgs [8.2d,3.17c,3.1103,2.2e0,3.3571,3.6c8,3.3418,3.3292,3.39f4,3.26d5,2.ab,3.20b4,3.32bb,3.1934,3.d95,3.2550,3.15e3,3.25f6,7.8,3.1f1b,3.16f6,3.28cc,3.1b34,3.207f] lec 1627997 last_purged_snaps_scrub 2025-08-01T04:48:50.877595+0000 osd_beacon_report_interval 300 v1627998) from osd.586 v2:1XX.1XX.41.110:7032/2510700571
2025-08-01T04:48:50.876+0000 7fc55280b640 10 XXX@0(leader) e6 no_reply to osd.586 v2:1XX.1XX.41.110:7032/2510700571 via v2:1XX.1XX.40.2:3300/0 for request osd_beacon(pgs [8.2d,3.17c,3.1103,2.2e0,3.3571,3.6c8,3.3418,3.3292,3.39f4,3.26d5,2.ab,3.20b4,3.32bb,3.1934,3.d95,3.2550,3.15e3,3.25f6,7.8,3.1f1b,3.16f6,3.28cc,3.1b34,3.207f] lec 1627997 last_purged_snaps_scrub 2025-08-01T04:48:50.877595+0000 osd_beacon_report_interval 300 v1627998)
2025-08-01T04:48:50.876+0000 7fc55280b640 20 is_capable service=osd command= exec addr v2:1XX.1XX.41.110:7032/2510700571 on cap allow profile osd
2025-08-01T04:48:50.876+0000 7fc55280b640 20  allow so far , doing grant allow profile osd
2025-08-01T04:48:50.876+0000 7fc55280b640 20  match
2025-08-01T04:48:50.876+0000 7fc55280b640  7 XXX@0(leader).osd e1627998 prepare_update osd_beacon(pgs [8.2d,3.17c,3.1103,2.2e0,3.3571,3.6c8,3.3418,3.3292,3.39f4,3.26d5,2.ab,3.20b4,3.32bb,3.1934,3.d95,3.2550,3.15e3,3.25f6,7.8,3.1f1b,3.16f6,3.28cc,3.1b34,3.207f] lec 1627997 last_purged_snaps_scrub 2025-08-01T04:48:50.877595+0000 osd_beacon_report_interval 300 v1627998) from osd.586 v2:1XX.1XX.41.110:7032/2510700571
2025-08-01T04:48:50.876+0000 7fc55280b640 10 XXX@0(leader).osd e1627998 prepare_beacon osd_beacon(pgs [8.2d,3.17c,3.1103,2.2e0,3.3571,3.6c8,3.3418,3.3292,3.39f4,3.26d5,2.ab,3.20b4,3.32bb,3.1934,3.d95,3.2550,3.15e3,3.25f6,7.8,3.1f1b,3.16f6,3.28cc,3.1b34,3.207f] lec 1627997 last_purged_snaps_scrub 2025-08-01T04:48:50.877595+0000 osd_beacon_report_interval 300 v1627998) from osd.586
2025-08-01T04:48:50.876+0000 7fc55280b640 10 XXX@0(leader).osd e1627998 should_propose
2025-08-01T04:48:50.876+0000 7fc55280b640 10 XXX@0(leader).paxosservice(osdmap 1627448..1627998)  setting proposal_timer 0x563b27ffac80 with delay of 0.695339

$ bin/osdmaptool --print --dump json-pretty ./osdmap-6.bin >map6dump.out
$ bin/osdmaptool --print --dump json-pretty ./osdmap-7.bin >map7dump.out
$ diff map6dump.out map7dump.out
2c2
<     "epoch": 1625346,
---
>     "epoch": 1625347,
5c5
<     "modified": "2025-07-31T03:41:23.283333+0000",
---
>     "modified": "2025-07-31T03:41:37.461253+0000",
172377c172377
<             "last_purged_snaps_scrub": "2025-07-29T23:08:47.986194+0000",
---
>             "last_purged_snaps_scrub": "2025-07-31T03:41:37.377661+0000",

$ diff map5dump.out map6dump.out 
2c2
<     "epoch": 1625345,
---
>     "epoch": 1625346,
5c5
<     "modified": "2025-07-31T03:41:11.295001+0000",
---
>     "modified": "2025-07-31T03:41:23.283333+0000",
186997c186997
<             "last_purged_snaps_scrub": "2025-07-29T16:39:43.390592+0000",
---
>             "last_purged_snaps_scrub": "2025-07-31T03:41:22.672210+0000",

Strongly related to https://bugzilla.redhat.com/show_bug.cgi?id=2359626

Comment 12 errata-xmlrpc 2025-09-30 09:23:04 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 8.1 security, bug fix and enhancement updates), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2025:17047