2267818 – OSDs are failing with error "ceph_abort_msg("past_interval start interval mismatch")" - 15 PGs Incomplete Post Upgrade to Ceph v5.3z6

Bug 2267818 - OSDs are failing with error "ceph_abort_msg("past_interval start interval mismatch")" - 15 PGs Incomplete Post Upgrade to Ceph v5.3z6

Summary: OSDs are failing with error "ceph_abort_msg("past_interval start interval mis...

Keywords:
Status:	CLOSED DUPLICATE of bug 2253672
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	RADOS
Sub Component:
Version:	5.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	8.0
Assignee:	Radoslaw Zarzynski
QA Contact:	Pawan
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2024-03-04 21:08 UTC by Craig Wayman
Modified:	2024-09-05 10:59 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2024-03-07 16:57:51 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHCEPH-8427	0	None	None	None	2024-03-04 21:10:12 UTC

Description Craig Wayman 2024-03-04 21:08:15 UTC

Description of problem:

  It seems the customer was hitting the same issue outlined in: 

https://bugzilla.redhat.com/show_bug.cgi?id=2253672

  We applied the fix by having the customer upgrade to Ceph v5.3z6, using the method outlined in: https://bugzilla.redhat.com/show_bug.cgi?id=2253672#c46

  The customer had managed to bring 108 of the 120 OSDs however, had some issues bringing the following OSDs back up/in:

ID   CLASS  WEIGHT    TYPE NAME                STATUS  REWEIGHT  PRI-AFF
 -1         61.12427  root default
 -9         17.46460      host ceph01-ap-perf
  1    ssd   0.72769          osd.1              down         0  1.00000
 10    ssd   0.72769          osd.10             down         0  1.00000
 86    ssd   0.72769          osd.86             down         0  1.00000
 89    ssd   0.72769          osd.89             down         0  1.00000
-11         17.46460      host ceph02-ap-perf
107    ssd   0.72769          osd.107            down         0  1.00000
108    ssd   0.72769          osd.108            down         0  1.00000
109    ssd   0.72769          osd.109            down         0  1.00000
113    ssd   0.72769          osd.113            down         0  1.00000
117    ssd   0.72769          osd.117            down         0  1.00000
 -3         17.46460      host ceph03-ap-perf
 30    ssd   0.72769          osd.30             down         0  1.00000
 72    ssd   0.72769          osd.72             down         0  1.00000
 -7          4.36523      host ceph1
  5    ssd   0.18188          osd.5              down         0  1.00000


  We have sos-reports for the nodes above. We’ll most likely request the logs for all of those OSDs as well. 

  The following process was able to get those ODSs back up/in: 

-  $ ceph config set osd.<osdid> osd_skip_check_past_interval_bounds true

-  $ ceph orch daemon redeploy osd.<osdid> --image <image-name>

-  Verify that we are safe to enable `check_past_interval_bounds` check:

    - Get `osdmap_first_committed` from ceph report. The osdmap_first_committed should be equal to `ceph daemon osd.<osdid> status | grep cluster_osdmap_trim_lower_bound`

 
        $ ceph report|grep osdmap_first_committed
        $ ceph daemon osd.<osdid> status | grep cluster_osdmap_trim_lower_bound

- If osd.<osdid> comes fine && joins the cluster successfully && condition [3] satisfy then only enable the `check_past_interval_bounds` check by removing config `osd_skip_check_past_interval_bounds`.

**NOTE:** Do not remove `osd_skip_check_past_interval_bounds` config if any of the above conditions fail.

- $ ceph config rm osd.<osdid> osd_skip_check_past_interval_bounds`

- Repeat steps from [2] to [4] for the remaining OSDs.


Version-Release number of selected component (if applicable):

  The MDSs still need to be upgraded. Not sure if they have or not. I am still waiting on more outputs from the customer. The most recent output from the case is below:

$ ceph versions

{
    "mon": {
        "ceph version 16.2.10-248.el8cp (0edb63afd9bd3edb333364f2e0031b77e62f4896) pacific (stable)": 3
    },
    "mgr": {
        "ceph version 16.2.10-248.el8cp (0edb63afd9bd3edb333364f2e0031b77e62f4896) pacific (stable)": 3
    },
    "osd": {
        "ceph version 16.2.10-248.el8cp (0edb63afd9bd3edb333364f2e0031b77e62f4896) pacific (stable)": 108
    },
    "mds": {
        "ceph version 16.2.10-208.el8cp (791f73fbb4bbca2ffe53a2ea0f8706dbffadcc0b) pacific (stable)": 4 <--- Still needs upgrade
    },
    "overall": {
        "ceph version 16.2.10-208.el8cp (791f73fbb4bbca2ffe53a2ea0f8706dbffadcc0b) pacific (stable)": 4,
        "ceph version 16.2.10-248.el8cp (0edb63afd9bd3edb333364f2e0031b77e62f4896) pacific (stable)": 114
    }
}


Additional info:


  We're still waiting on additional logs and outputs. Once uploaded we'll set this BZ to needinfo.


Regards,


Craig Wayman
TSE Red Hat OpenShift Data Foundations (ODF) 
Customer Experience and Engagement, NA

Note You need to log in before you can comment on or make changes to this bug.