Description of problem: In our bare metal OCP 4.7 cluster (running OCS 4.6), attempting to put a Node into maintenance mode results in the following error: Warning alert:The Ceph storage cluster is not in a healthy state. Maintenance should not be started until the health of the storage cluster is restored. However, we are not able to identify any problems with the Ceph cluster: - The Storage indicator on the overview page is green - The status of the OCS operator is Up-to-date/Succeeded - The status of the storagecluster is Ready - Running 'ceph status' shows 'HEALTH_OK' Version-Release number of selected component (if applicable): OCP 4.7.0 OCS 4.6.4 Additional info: I think there are two problems here: - What is causing this error? - The presentation of this error is very poor UX. It should link to more detailed information about the problem so the operator has some idea where to look or what to fix. I've opened the bug against the "bare metal hardware provisioning" component, since the error is being presented only by the Node management screen. That may not be the appropriate component.
There are recent must-gather logs (both general and ocs-specific) in the linked customer case. They are too large to attach to the bz. I'm happy to place them somewhere else if there is a supported Red Hat resource for hosting them.
Perhaps of interest, there are two pods in the openshift-storage namespace stuck in the Pending state: rook-ceph-osd-prepare-ocs-deviceset-1-data-0-b4cfv-dzgjx 0/1 Pending 0 26m rook-ceph-osd-prepare-ocs-deviceset-1-data-1-mk9sb-vx8b8 0/1 Pending 0 26m If I delete these pods, they are simply re-created. All the PVCs seem up and healthy: $ oc get pvc | grep deviceset ocs-deviceset-0-data-0-p95pd Bound local-pv-39e6c5b5 558Gi RWO localblock 7d1h ocs-deviceset-0-data-1-8q8cs Bound local-pv-7ce94287 558Gi RWO localblock 7d1h ocs-deviceset-0-data-2-vbs7j Bound local-pv-fce13c73 558Gi RWO localblock 7d1h ocs-deviceset-1-data-0-fdfpz Bound local-pv-81638871 558Gi RWO localblock 21h ocs-deviceset-1-data-1-s22xn Bound local-pv-c69a8c5e 558Gi RWO localblock 21h ocs-deviceset-1-data-2-rvnbt Bound local-pv-8ca27948 558Gi RWO localblock 21h ocs-deviceset-2-data-0-sdggv Bound local-pv-10b202a8 558Gi RWO localblock 7d1h ocs-deviceset-2-data-1-khz48 Bound local-pv-7f42d565 558Gi RWO localblock 7d1h ocs-deviceset-2-data-2-8npbj Bound local-pv-41de1fd4 558Gi RWO localblock 7d1h
@lars, hi, could you help to check if the bug is fixed on ocp 4.8? I don't have a suitable bare metal cluster with enough storage space to create storagecluster successfully. Thanks!
Unfortunately, I don't have access to a cluster on which I can deploy 4.8.
I have checked on bare metal OCP 4.8 cluster, after install NMO and OCS 4.6.4 successfully on the cluster, could start maintenance/stop maintenance without error/warning info.
Checked on bare metal ocp 4.8 cluster with payload 4.8.0-0.nightly-2021-04-22-182303 Install NMO and OCS 4.6.4 successfully on the cluster When created storagecluster is not in healthy state, click "Start Maintenance", there is warning info on modal: Warning alert:The Ceph storage cluster is not in a healthy state. Maintenance should not be started until the health of the storage cluster is restored. When cephcluster is not available or is in a normal status, click "Start Maintenance", there is no ceph related warning info on modal.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438