2005014 – state of ODF StorageSystem is misreported during installation or uninstallation

Bug 2005014 - state of ODF StorageSystem is misreported during installation or uninstallation

Summary: state of ODF StorageSystem is misreported during installation or uninstallation

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Console Storage Plugin
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.10.0
Assignee:	Bipul Adhikari
QA Contact:	Anna Sandler
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	2008143 2019652 (view as bug list)
Depends On:
Blocks:	2006760 2017717
TreeView+	depends on / blocked

Reported:	2021-09-16 14:42 UTC by Martin Bukatovic
Modified:	2022-03-10 16:11 UTC (History)
CC List:	15 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-03-10 16:11:32 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
screenshot #2: clear status of StorageCluster CR in OCS 4.8 during installation (for comparision) (111.18 KB, image/png) 2021-09-16 14:50 UTC, Martin Bukatovic	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift console pull 10336	0	None	open	Bug 2005014: Use conditions for status in Storage System list page	2021-10-27 08:25:37 UTC
Red Hat Product Errata	RHSA-2022:0056	0	None	None	None	2022-03-10 16:11:58 UTC

Description Martin Bukatovic 2021-09-16 14:42:22 UTC

Description of problem
======================

During installation of ODF StorageSystem via OCP Console web UI, status of the
storage system is reported as sheer "-" instead of clear "progressing" or
"installing".

This is a regression compared to StorageCluster CRD behaviour of OCS 4.8

Version-Release number of selected component
============================================

OCP 4.9.0-0.nightly-2021-09-14-200602
LSO 4.9.0-202109132154
ODF 4.9.0-139.ci

How reproducible
================

2/2

Steps to Reproduce
==================

1. Install OCP cluster.
2. Install OCS/ODF (OpenShift Data Foundation) operator.
3. Install LSO operator.
4. Start "Create a StorageSystem" wizard in OCP Console web UI and complete
   the process.
5. Observe status of the new storage system in OCP Console

Actual results
==============

While the cluster is being installed, the status of new storage system is
reported as "-", and only when the installation finishes, the status changes
to "Ready".

See screenshot #1.

Expected results
================

The status is reported in a clear way as "progressing" (this is the behaviour
in <= OCS 4.9) or "installing".

This way, it's immediately clear that the installation is still going on without
any problem, and it's possible to distinguish this state from possible problem
with fetching status or other possible problems.

Comment 4 Martin Bukatovic 2021-09-16 14:50:12 UTC

Created attachment 1823589 [details]
screenshot #2: clear status of StorageCluster CR in OCS 4.8 during installation (for comparision)

Comment 5 Nitin Goyal 2021-09-16 15:14:44 UTC

There is no phase even in the backend in the storage system CR. We have conditions only and based on those conditions + storageCluster phase we can set the state in the UI. I don't know how it is done today. looping @afrahman who can help us to understand this better.

Comment 7 Martin Bukatovic 2021-09-17 16:02:47 UTC

Additional information
======================

When installation fails, and StorageCluster ends up in Error state, the StorageSystem and it's representation in the UI still reports "-" and it's not directly possible to understand that installation failed (you need to know that there is StorageCluster CR to check).

Comment 9 Nitin Goyal 2021-09-20 12:25:33 UTC

@badhikar Can you pls change the component accordingly?

Comment 10 Bipul Adhikari 2021-09-21 04:51:03 UTC

I think the component is correct. Isn't metrics exporter part of odf-operator. @amohan could you please take a look?

Comment 12 Martin Bukatovic 2021-09-22 10:48:21 UTC

The same problem is applicable to uninstallation, based on discussion during triage meeting yesterday,
I'm expanding the case cases of this bug to installation and uninstallation.

As Nitin notes, StorageSystem CR doesn't not communicate status on purpose, and using metrics for
this purpose instead (as currently implemented in UI) is wrong, because this will allow to
communicate status only after ODF is successfully installed, and so can't provide status during
installation, esp. when sth. goes wrong when such status it's most important to communicate.
 
See also a negative use case mentioned in comment 7 (to reproduce, one can mislabel machines or
miss any other necessary step before creation of storagecluster CR ...).

Comment 13 Bipul Adhikari 2021-09-22 10:53:11 UTC

The console relies on metrics being provided to it. If the backend is not exporting metrics then we will show a `-`. The metrics must be reported as per the agreed upon standardization scheme. Moving it back to ODF Operator as this is an issue with metrics exporter.

Comment 16 arun kumar mohan 2021-09-28 13:49:13 UTC

(In reply to Bipul Adhikari from comment #10)
> I think the component is correct. Isn't metrics exporter part of
> odf-operator. @amohan could you please take a look?

Yes, `metric-exporter` is a part of odf-operator.

Comment 17 Jose A. Rivera 2021-09-28 16:14:35 UTC

As Martin noted in comment 14, this is not something we should just blindly punt to odf-operator. It feels kind of ridiculous to expect operand state from a metric! Time spent in a state is a metric, but not the state itself, especially if the metric reporting lags the actual state change.

So, sorry Bipul, but I'm throwing this back at you. Why on earth are we using metrics for this? The StorageSystem should already be accurately reporting it's state in the Status Conditions. My understanding is that the page in question is part of the console plugin, so we should still be able to fix this.

Please don't punt this back until we actually figure this out. :P

Comment 18 Bipul Adhikari 2021-10-04 05:05:41 UTC

Console team has always been an advocate of not standardizing metrics as every storage system is unique and it can have it's own set of steps that are required for it to report information. We were in favor of each storage system vendor to push it's own logic(via extensions in the UI) to determine state or other information. However, this idea was rejected and we are working with standardized metrics. Hence the UI cannot accommodate such logic at this point. The metrics exporter will have to figure out a way to convey this message. 
So the reason we are using metrics exporter is because the dashboard and list page was designed with standardized metrics in mind. We cannot expect the dashboard to accommodate custom logic for each storage provider.

Comment 19 N Balachandran 2021-10-05 05:37:38 UTC

I think the issue here is that using the metrics to determine status is incorrect. They are meant for a different purpose.

Comment 20 Jose A. Rivera 2021-10-05 16:12:46 UTC

This was brought up in the orchestration forum today. There's a rough plan, but first step is to talk to QE to get more details. I'll follow up with this offline and report back to this BZ with the results.

Comment 23 Bipul Adhikari 2021-10-19 09:45:26 UTC

*** Bug 2008143 has been marked as a duplicate of this bug. ***

Comment 32 Anna Sandler 2021-11-10 06:04:51 UTC

verifying by the attachment above

Comment 34 Anna Sandler 2021-11-10 16:55:11 UTC

when deleting it to try and install it again the status was observed too

Comment 35 Bipul Adhikari 2021-11-12 08:24:52 UTC

*** Bug 2019652 has been marked as a duplicate of this bug. ***

Comment 36 Bipul Adhikari 2021-11-12 08:35:29 UTC

Fixing this issue fixes 
https://bugzilla.redhat.com/show_bug.cgi?id=2019652
https://bugzilla.redhat.com/show_bug.cgi?id=2005014

The backport of this BZ  to 4.9 should fix all the issues in 4.9 as well. 
Although from the description these bugs look unrelated. The source of the problem for the aforementioned bugs are solved with this fix hence marking duplicates.

Comment 37 Martin Bukatovic 2021-11-18 19:01:15 UTC

I'm not sure I understand how would this bug be a duplicate of bz 2019652.

Comment 40 errata-xmlrpc 2022-03-10 16:11:32 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Note You need to log in before you can comment on or make changes to this bug.