Bug 2253672

Summary:	[RHCS 5.3] [GSS] ceph_abort_msg("past_interval start interval mismatch")
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	kelwhite
Component:	RADOS	Assignee:	Matan Breizman <mbreizma>
Status:	CLOSED ERRATA	QA Contact:	Harsh Kumar <hakumar>
Severity:	urgent	Docs Contact:	Ranjini M N <rmandyam>
Priority:	urgent
Version:	5.3	CC:	bhubbard, bhull, bkunal, ceph-eng-bugs, cephqe-warriors, crwayman, fwissing, gjose, hakumar, jquinn, linuxkidd, mbreizma, nojha, olim, pdhange, pdhiran, rmandyam, rzarzyns, sshome, tserlin, vdas, vumrao
Target Milestone:	---
Target Release:	5.3z6
Hardware:	All
OS:	All
Whiteboard:
Fixed In Version:	ceph-16.2.10-233.el8cp	Doc Type:	If docs needed, set a value
Doc Text:	.The `check_past_interval_bounds` uses the `max_oldest_map` to calculate the start interval Previously, the oldest OSDMap which was used to calculate the past interval bounds was local to the OSD and not the `max_oldest_map` received with other peers instead. A specific OSD’s `oldest_map` can lag for a while behind the `max_oldest_map` across all peers. As a result, an assert would be triggered in `check_past_interval_bounds`. With this fix, `check_past_interval_bounds` uses the `max_oldest_map` (renamed to `cluster_osdmap_trim_lower_bound`) to calculate the start interval. In addition, the option `osd_skip_check_past_interval_bounds` is introduced to allow OSDs to recover from this issue after applying the fix.	Story Points:	---
Clone Of:
Clones:	2254065 2256752 (view as bug list)		Environment:
Last Closed:	2024-02-08 16:57:09 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	2254065, 2256752
Bug Blocks:	2258797

Description kelwhite 2023-12-08 18:28:41 UTC

Description of problem:
After a power failure on 1 ceph node, the node was brought back up and now the OSDs are failing to start with the abort mgs: ceph_abort_msg("past_interval start interval mismatch"). 

Upstream tracker https://tracker.ceph.com/issues/49689
git issue: https://gist.github.com/Matan-B/ca564b6789f6ae6fc2ebc6a5b7e2aa69

Version-Release number of selected component (if applicable):
RHCS 5.3

Comment 64 errata-xmlrpc 2024-02-08 16:57:09 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Ceph Storage 5.3 Security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:0745

Comment 68 Michael J. Kidd 2024-03-07 16:57:51 UTC

*** Bug 2267818 has been marked as a duplicate of this bug. ***