Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
This project is now read‑only. Starting Monday, February 2, please use https://ibm-ceph.atlassian.net/ for all bug tracking management.

Bug 2267818

Summary: OSDs are failing with error "ceph_abort_msg("past_interval start interval mismatch")" - 15 PGs Incomplete Post Upgrade to Ceph v5.3z6
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Craig Wayman <crwayman>
Component: RADOSAssignee: Radoslaw Zarzynski <rzarzyns>
Status: CLOSED DUPLICATE QA Contact: Pawan <pdhiran>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 5.3CC: bhubbard, ceph-eng-bugs, cephqe-warriors, linuxkidd, ngangadh, nojha, rzarzyns, vumrao
Target Milestone: ---   
Target Release: 8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2024-03-07 16:57:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Craig Wayman 2024-03-04 21:08:15 UTC
Description of problem:

  It seems the customer was hitting the same issue outlined in: 

https://bugzilla.redhat.com/show_bug.cgi?id=2253672

  We applied the fix by having the customer upgrade to Ceph v5.3z6, using the method outlined in: https://bugzilla.redhat.com/show_bug.cgi?id=2253672#c46

  The customer had managed to bring 108 of the 120 OSDs however, had some issues bringing the following OSDs back up/in:

ID   CLASS  WEIGHT    TYPE NAME                STATUS  REWEIGHT  PRI-AFF
 -1         61.12427  root default
 -9         17.46460      host ceph01-ap-perf
  1    ssd   0.72769          osd.1              down         0  1.00000
 10    ssd   0.72769          osd.10             down         0  1.00000
 86    ssd   0.72769          osd.86             down         0  1.00000
 89    ssd   0.72769          osd.89             down         0  1.00000
-11         17.46460      host ceph02-ap-perf
107    ssd   0.72769          osd.107            down         0  1.00000
108    ssd   0.72769          osd.108            down         0  1.00000
109    ssd   0.72769          osd.109            down         0  1.00000
113    ssd   0.72769          osd.113            down         0  1.00000
117    ssd   0.72769          osd.117            down         0  1.00000
 -3         17.46460      host ceph03-ap-perf
 30    ssd   0.72769          osd.30             down         0  1.00000
 72    ssd   0.72769          osd.72             down         0  1.00000
 -7          4.36523      host ceph1
  5    ssd   0.18188          osd.5              down         0  1.00000


  We have sos-reports for the nodes above. We’ll most likely request the logs for all of those OSDs as well. 

  The following process was able to get those ODSs back up/in: 

-  $ ceph config set osd.<osdid> osd_skip_check_past_interval_bounds true

-  $ ceph orch daemon redeploy osd.<osdid> --image <image-name>

-  Verify that we are safe to enable `check_past_interval_bounds` check:

    - Get `osdmap_first_committed` from ceph report. The osdmap_first_committed should be equal to `ceph daemon osd.<osdid> status | grep cluster_osdmap_trim_lower_bound`

 
        $ ceph report|grep osdmap_first_committed
        $ ceph daemon osd.<osdid> status | grep cluster_osdmap_trim_lower_bound

- If osd.<osdid> comes fine && joins the cluster successfully && condition [3] satisfy then only enable the `check_past_interval_bounds` check by removing config `osd_skip_check_past_interval_bounds`.

**NOTE:** Do not remove `osd_skip_check_past_interval_bounds` config if any of the above conditions fail.

- $ ceph config rm osd.<osdid> osd_skip_check_past_interval_bounds`

- Repeat steps from [2] to [4] for the remaining OSDs.


Version-Release number of selected component (if applicable):

  The MDSs still need to be upgraded. Not sure if they have or not. I am still waiting on more outputs from the customer. The most recent output from the case is below:

$ ceph versions

{
    "mon": {
        "ceph version 16.2.10-248.el8cp (0edb63afd9bd3edb333364f2e0031b77e62f4896) pacific (stable)": 3
    },
    "mgr": {
        "ceph version 16.2.10-248.el8cp (0edb63afd9bd3edb333364f2e0031b77e62f4896) pacific (stable)": 3
    },
    "osd": {
        "ceph version 16.2.10-248.el8cp (0edb63afd9bd3edb333364f2e0031b77e62f4896) pacific (stable)": 108
    },
    "mds": {
        "ceph version 16.2.10-208.el8cp (791f73fbb4bbca2ffe53a2ea0f8706dbffadcc0b) pacific (stable)": 4 <--- Still needs upgrade
    },
    "overall": {
        "ceph version 16.2.10-208.el8cp (791f73fbb4bbca2ffe53a2ea0f8706dbffadcc0b) pacific (stable)": 4,
        "ceph version 16.2.10-248.el8cp (0edb63afd9bd3edb333364f2e0031b77e62f4896) pacific (stable)": 114
    }
}


Additional info:


  We're still waiting on additional logs and outputs. Once uploaded we'll set this BZ to needinfo.


Regards,


Craig Wayman
TSE Red Hat OpenShift Data Foundations (ODF) 
Customer Experience and Engagement, NA