Bug 2349077

Summary:	[8.x] [Read Balancer] Make rm-pg-upmap-primary able to remove mappings by force
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	Laura Flores <lflores>
Component:	RADOS	Assignee:	Laura Flores <lflores>
Status:	VERIFIED ---	QA Contact:	Pawan <pdhiran>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	7.1	CC:	bhubbard, ceph-eng-bugs, cephqe-warriors, ngangadh, nojha, pdhange, pdhiran, tserlin, vumrao, yhatuka
Target Milestone:	---	Flags:	lflores: needinfo-
Target Release:	8.1
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	ceph-19.2.1-70.el9cp	Doc Type:	Enhancement
Doc Text:	Feature: A new command, `ceph osd rm-pg-upmap-primary-all`, has been added that allows users to clear all pg-upmap-primary mappings in the osdmap when desired. As with the existing command `ceph osd rm-pg-upmap-primary <pgid>`, this new command should be used with caution, as it directly modifies primary PG mappings and can impact read performance (this excludes any data movement). Reason: Users who want to remove all pg-upmap-primary mappings may do so more easily now with one command. This command may also be used to remove invalid mappings left over from a bug where pg-upmap-primary entries were left in the osdmap after users deleted a pool. Result: If a user has pg-upmap-primary mappings in their osdmap, the expected result after running the new command should be that all pg-upmap-primary mappings have been removed from the cluster. This includes valid and invalid pg-upmap-primary mappings.	Story Points:	---
Clone Of:
Clones:	2357063 (view as bug list)		Environment:
Last Closed:		Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2357063

Description Laura Flores 2025-02-28 20:35:36 UTC

Description of problem:

Corresponding upstream tracker here: https://tracker.ceph.com/issues/69760

Essentially, the user was running a v18.2.1 cluster and hit BZ#2290580, which we know occurs when clients older than Reef are erroneously allowed to connect to the cluster when pg_upmap_primary, a strictly-Reef feature, is employed.

The user also hit BZ#2348970, which occurs when a pool is deleted and "phantom" pg_upmap_primary entries for that pool are left in the OSDMap. Therefore, the user cannot remove the pg_upmap_primary entries prior to upgrading from the broken encoder to the fixed encoder, which is the suggested workaround for BZ#2290580.

The idea for a fix is to provide the option to force-removal of a "phantom" pg_upmap_primary mapping, and potentially to relax the assertion in the OSDMap encoder.

The net effect: Although fixes for BZ#2290580 are already included in v18.2.4, the user still experiences difficulty if they hit the crash try to upgrade.

Version-Release number of selected component (if applicable):
v18.2.1

Comment 1 Storage PM bot 2025-02-28 20:35:48 UTC

Please specify the severity of this bug. Severity is defined here:
https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity.