Description of problem: It seems the customer was hitting the same issue outlined in: https://bugzilla.redhat.com/show_bug.cgi?id=2253672 We applied the fix by having the customer upgrade to Ceph v5.3z6, using the method outlined in: https://bugzilla.redhat.com/show_bug.cgi?id=2253672#c46 The customer had managed to bring 108 of the 120 OSDs however, had some issues bringing the following OSDs back up/in: ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 61.12427 root default -9 17.46460 host ceph01-ap-perf 1 ssd 0.72769 osd.1 down 0 1.00000 10 ssd 0.72769 osd.10 down 0 1.00000 86 ssd 0.72769 osd.86 down 0 1.00000 89 ssd 0.72769 osd.89 down 0 1.00000 -11 17.46460 host ceph02-ap-perf 107 ssd 0.72769 osd.107 down 0 1.00000 108 ssd 0.72769 osd.108 down 0 1.00000 109 ssd 0.72769 osd.109 down 0 1.00000 113 ssd 0.72769 osd.113 down 0 1.00000 117 ssd 0.72769 osd.117 down 0 1.00000 -3 17.46460 host ceph03-ap-perf 30 ssd 0.72769 osd.30 down 0 1.00000 72 ssd 0.72769 osd.72 down 0 1.00000 -7 4.36523 host ceph1 5 ssd 0.18188 osd.5 down 0 1.00000 We have sos-reports for the nodes above. We’ll most likely request the logs for all of those OSDs as well. The following process was able to get those ODSs back up/in: - $ ceph config set osd.<osdid> osd_skip_check_past_interval_bounds true - $ ceph orch daemon redeploy osd.<osdid> --image <image-name> - Verify that we are safe to enable `check_past_interval_bounds` check: - Get `osdmap_first_committed` from ceph report. The osdmap_first_committed should be equal to `ceph daemon osd.<osdid> status | grep cluster_osdmap_trim_lower_bound` $ ceph report|grep osdmap_first_committed $ ceph daemon osd.<osdid> status | grep cluster_osdmap_trim_lower_bound - If osd.<osdid> comes fine && joins the cluster successfully && condition [3] satisfy then only enable the `check_past_interval_bounds` check by removing config `osd_skip_check_past_interval_bounds`. **NOTE:** Do not remove `osd_skip_check_past_interval_bounds` config if any of the above conditions fail. - $ ceph config rm osd.<osdid> osd_skip_check_past_interval_bounds` - Repeat steps from [2] to [4] for the remaining OSDs. Version-Release number of selected component (if applicable): The MDSs still need to be upgraded. Not sure if they have or not. I am still waiting on more outputs from the customer. The most recent output from the case is below: $ ceph versions { "mon": { "ceph version 16.2.10-248.el8cp (0edb63afd9bd3edb333364f2e0031b77e62f4896) pacific (stable)": 3 }, "mgr": { "ceph version 16.2.10-248.el8cp (0edb63afd9bd3edb333364f2e0031b77e62f4896) pacific (stable)": 3 }, "osd": { "ceph version 16.2.10-248.el8cp (0edb63afd9bd3edb333364f2e0031b77e62f4896) pacific (stable)": 108 }, "mds": { "ceph version 16.2.10-208.el8cp (791f73fbb4bbca2ffe53a2ea0f8706dbffadcc0b) pacific (stable)": 4 <--- Still needs upgrade }, "overall": { "ceph version 16.2.10-208.el8cp (791f73fbb4bbca2ffe53a2ea0f8706dbffadcc0b) pacific (stable)": 4, "ceph version 16.2.10-248.el8cp (0edb63afd9bd3edb333364f2e0031b77e62f4896) pacific (stable)": 114 } } Additional info: We're still waiting on additional logs and outputs. Once uploaded we'll set this BZ to needinfo. Regards, Craig Wayman TSE Red Hat OpenShift Data Foundations (ODF) Customer Experience and Engagement, NA