Bug 2349077

Summary: [8.x] [Read Balancer] Make rm-pg-upmap-primary able to remove mappings by force
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Laura Flores <lflores>
Component: RADOSAssignee: Laura Flores <lflores>
Status: VERIFIED --- QA Contact: Pawan <pdhiran>
Severity: high Docs Contact:
Priority: unspecified    
Version: 7.1CC: bhubbard, ceph-eng-bugs, cephqe-warriors, ngangadh, nojha, pdhange, pdhiran, tserlin, vumrao, yhatuka
Target Milestone: ---Flags: lflores: needinfo-
Target Release: 8.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ceph-19.2.1-70.el9cp Doc Type: Enhancement
Doc Text:
Feature: A new command, `ceph osd rm-pg-upmap-primary-all`, has been added that allows users to clear all pg-upmap-primary mappings in the osdmap when desired. As with the existing command `ceph osd rm-pg-upmap-primary <pgid>`, this new command should be used with caution, as it directly modifies primary PG mappings and can impact read performance (this excludes any data movement). Reason: Users who want to remove all pg-upmap-primary mappings may do so more easily now with one command. This command may also be used to remove invalid mappings left over from a bug where pg-upmap-primary entries were left in the osdmap after users deleted a pool. Result: If a user has pg-upmap-primary mappings in their osdmap, the expected result after running the new command should be that all pg-upmap-primary mappings have been removed from the cluster. This includes valid and invalid pg-upmap-primary mappings.
Story Points: ---
Clone Of:
: 2357063 (view as bug list) Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2357063    

Description Laura Flores 2025-02-28 20:35:36 UTC
Description of problem:

Corresponding upstream tracker here: https://tracker.ceph.com/issues/69760

Essentially, the user was running a v18.2.1 cluster and hit BZ#2290580, which we know occurs when clients older than Reef are erroneously allowed to connect to the cluster when pg_upmap_primary, a strictly-Reef feature, is employed.

The user also hit BZ#2348970, which occurs when a pool is deleted and "phantom" pg_upmap_primary entries for that pool are left in the OSDMap. Therefore, the user cannot remove the pg_upmap_primary entries prior to upgrading from the broken encoder to the fixed encoder, which is the suggested workaround for BZ#2290580.

The idea for a fix is to provide the option to force-removal of a "phantom" pg_upmap_primary mapping, and potentially to relax the assertion in the OSDMap encoder.

The net effect: Although fixes for BZ#2290580 are already included in v18.2.4, the user still experiences difficulty if they hit the crash try to upgrade.

Version-Release number of selected component (if applicable):
v18.2.1

Comment 1 Storage PM bot 2025-02-28 20:35:48 UTC
Please specify the severity of this bug. Severity is defined here:
https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity.