Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1939606

Summary:	Attempting to put a host into maintenance mode warns about Ceph cluster health, but no storage cluster problems are apparent
Product:	OpenShift Container Platform	Reporter:	Lars Kellogg-Stedman <lars>
Component:	Console Metal3 Plugin	Assignee:	Sanjal Katiyar <skatiyar>
Status:	CLOSED ERRATA	QA Contact:	Yanping Zhang <yanpzhan>
Severity:	medium	Docs Contact:
Priority:	high
Version:	4.6	CC:	afrahman, aos-bugs, bniver, madam, mmanjuna, muagarwa, nthomas, ocs-bugs, sostapov
Target Milestone:	---	Flags:	mmanjuna: needinfo?
Target Release:	4.8.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1941624 (view as bug list)		Environment:
Last Closed:	2021-07-27 22:53:48 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Lars Kellogg-Stedman 2021-03-16 17:05:34 UTC

Description of problem:

In our bare metal OCP 4.7 cluster (running OCS 4.6), attempting to put
a Node into maintenance mode results in the following error:

  Warning alert:The Ceph storage cluster is not in a healthy state.

  Maintenance should not be started until the health of the storage
  cluster is restored.

However, we are not able to identify any problems with the Ceph
cluster:

- The Storage indicator on the overview page is green
- The status of the OCS operator is Up-to-date/Succeeded
- The status of the storagecluster is Ready
- Running 'ceph status' shows 'HEALTH_OK'

Version-Release number of selected component (if applicable):

OCP 4.7.0
OCS 4.6.4

Additional info:

I think there are two problems here:

- What is causing this error?
- The presentation of this error is very poor UX. It should link to
  more detailed information about the problem so the operator has some
  idea where to look or what to fix.

I've opened the bug against the "bare metal hardware provisioning" component, since the error is being presented only by the Node management screen. That may not be the appropriate component.

Comment 1 Lars Kellogg-Stedman 2021-03-16 17:07:19 UTC

There are recent must-gather logs (both general and ocs-specific) in the linked customer case. They are too large to attach to the bz. I'm happy to place them somewhere else if there is a supported Red Hat resource for hosting them.

Comment 2 Lars Kellogg-Stedman 2021-03-16 17:28:43 UTC

Perhaps of interest, there are two pods in the openshift-storage namespace stuck in the Pending state:

rook-ceph-osd-prepare-ocs-deviceset-1-data-0-b4cfv-dzgjx          0/1     Pending     0          26m
rook-ceph-osd-prepare-ocs-deviceset-1-data-1-mk9sb-vx8b8          0/1     Pending     0          26m

If I delete these pods, they are simply re-created. All the PVCs seem up and healthy:

$ oc get pvc | grep deviceset
ocs-deviceset-0-data-0-p95pd   Bound    local-pv-39e6c5b5                          558Gi      RWO            localblock                    7d1h
ocs-deviceset-0-data-1-8q8cs   Bound    local-pv-7ce94287                          558Gi      RWO            localblock                    7d1h
ocs-deviceset-0-data-2-vbs7j   Bound    local-pv-fce13c73                          558Gi      RWO            localblock                    7d1h
ocs-deviceset-1-data-0-fdfpz   Bound    local-pv-81638871                          558Gi      RWO            localblock                    21h
ocs-deviceset-1-data-1-s22xn   Bound    local-pv-c69a8c5e                          558Gi      RWO            localblock                    21h
ocs-deviceset-1-data-2-rvnbt   Bound    local-pv-8ca27948                          558Gi      RWO            localblock                    21h
ocs-deviceset-2-data-0-sdggv   Bound    local-pv-10b202a8                          558Gi      RWO            localblock                    7d1h
ocs-deviceset-2-data-1-khz48   Bound    local-pv-7f42d565                          558Gi      RWO            localblock                    7d1h
ocs-deviceset-2-data-2-8npbj   Bound    local-pv-41de1fd4                          558Gi      RWO            localblock                    7d1h

Comment 13 Yanping Zhang 2021-04-01 06:39:29 UTC

@lars, hi, could you help to check if the bug is fixed on ocp 4.8? I don't have a suitable bare metal cluster with enough storage space to create storagecluster successfully. Thanks!

Comment 14 Lars Kellogg-Stedman 2021-04-01 11:54:50 UTC

Unfortunately, I don't have access to a cluster on which I can deploy 4.8.

Comment 15 Yanping Zhang 2021-04-22 02:50:03 UTC

I have checked on bare metal OCP 4.8 cluster, after install NMO and OCS 4.6.4 successfully on the cluster, could start maintenance/stop maintenance without error/warning info.

Comment 16 Yanping Zhang 2021-04-23 12:55:28 UTC

Checked on bare metal ocp 4.8 cluster with payload 4.8.0-0.nightly-2021-04-22-182303
Install NMO and OCS 4.6.4 successfully on the cluster
When created storagecluster is not in healthy state, click "Start Maintenance", there is warning info on modal:
Warning alert:The Ceph storage cluster is not in a healthy state.
Maintenance should not be started until the health of the storage cluster is restored.
When cephcluster is not available or is in a normal status, click "Start Maintenance", there is no ceph related warning info on modal.

Comment 19 errata-xmlrpc 2021-07-27 22:53:48 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438