Bug 2003651

Summary:	ODF4.9+LSO4.8 installation via UI, StorageCluster move to error state
Product:	OpenShift Container Platform	Reporter:	Oded <oviner>
Component:	Console Storage Plugin	Assignee:	Afreen <afrahman>
Status:	CLOSED ERRATA	QA Contact:	Oded <oviner>
Severity:	high	Docs Contact:
Priority:	urgent
Version:	4.9	CC:	afrahman, amagrawa, aos-bugs, ebenahar, jarrpa, jijoy, madam, mbukatov, muagarwa, nthomas, ocs-bugs, sabose, sostapov, srozen
Target Milestone:	---	Keywords:	Regression, TestBlocker
Target Release:	4.10.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-03-10 16:09:36 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2004241

Description Oded 2021-09-13 11:49:21 UTC

Description of problem (please be detailed as possible and provide log
snippests):
Storagecluster move to error state after odf_lso installation via UI

Version of all relevant components (if applicable):
OpenShift version: 4.9.0-0.nightly-2021-09-10-170926
LSO Version:4.8.0-202107291502
ODF Version:4.9.0-132.ci

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.Install OCP cluster on Vmware platform:
OpenShift version: 4.9.0-0.nightly-2021-09-10-170926

2.Install Local storage Operator:
LSO Version:4.8.0-202107291502

3.Install ODF Operator:
ODF Version:4.9.0-132.ci

4.Add Disks to worker nodes via vcenter.

5.Create StorageSystem

6.Check StorageCluster status:

Get StorageCluster status:
$ oc get storagecluster
NAME                 AGE     PHASE   EXTERNAL   CREATED AT             VERSION
ocs-storagecluster   3m23s   Error              2021-09-13T11:26:51Z 

$ oc describe storagecluster
Status:
  Images:
  Phase:  Error
Events:
  Type     Reason            Age    From                       Message
  ----     ------            ----   ----                       -------
  Warning  FailedValidation  3m51s  controller_storagecluster  failed to validate StorageDeviceSet 0: no StorageClass specified


7.Check pods status [on openshift-storage project]
$ oc get pods
NAME                                               READY   STATUS    RESTARTS   AGE
noobaa-operator-7859f67cbc-2crmr                   1/1     Running   0          17m
ocs-metrics-exporter-787686dbfd-mlqzw              1/1     Running   0          17m
ocs-operator-fd5fd568f-lxzft                       1/1     Running   0          17m
odf-console-75f8bb874d-k7jsp                       2/2     Running   0          17m
odf-operator-controller-manager-5c6f854875-6v72t   2/2     Running   0          17m
rook-ceph-operator-bd8ffff7c-56qpb                 1/1     Running   0          17m

7.Check pv status:
$ oc get pv
NAME                CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS      CLAIM   STORAGECLASS   REASON   AGE
local-pv-33c8520a   100Gi      RWO            Delete           Available           localblock              8m47s
local-pv-bcd51894   100Gi      RWO            Delete           Available           localblock              8m47s
local-pv-f1262717   100Gi      RWO            Delete           Available           localblock              8m46s

8.Check pvc status [on openshift-storage project]:
$ oc get pvc
No resources found in openshift-storage namespace.


Actual results:
Storagecluster move to error state

Expected results:
Storagecluster move to Ready state


Additional info:
https://docs.google.com/document/d/1Fo2qtBbYNaLzYSkw1qYbjGEBWfN44FZzz_EZ4sceWlg/edit

Comment 2 Nitin Goyal 2021-09-13 12:14:30 UTC

Moving it to the OCS

Comment 3 Oded 2021-09-13 12:26:22 UTC

Must-gather:
http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-2003651/

Comment 4 Shay Rozen 2021-09-13 13:06:53 UTC

Also happens with internal mode. Maybe not related to LSO. Version installed ODF Version:4.9.0-132.ci

oc get storageclusters.ocs.openshift.io 
NAME                 AGE     PHASE   EXTERNAL   CREATED AT             VERSION
ocs-storagecluster   4m51s   Error              2021-09-13T12:58:52Z   4.9.0

Status:
  Conditions:
    Last Heartbeat Time:   2021-09-13T12:59:34Z
    Last Transition Time:  2021-09-13T12:58:52Z
    Message:               Error while reconciling: some StorageClasses [ocs-storagecluster-cephfs,ocs-storagecluster-ceph-rbd,ocs-storagecluster-ceph-rbd-thick] were skipped while waiting for pre-requisites to be met

Comment 6 Shay Rozen 2021-09-13 13:12:57 UTC

Internal mode recovered after few minutes. Removing feature_blocker and urgent.
I think I'll open a seperate BZ

Comment 7 Sahina Bose 2021-09-13 13:15:04 UTC

@pjiandan Priyanka, do you know if anyone is looking at this?

Comment 8 Oded 2021-09-13 18:37:01 UTC

This issue reconstructed with LSO4.9

OCP Vesrion:4.9.0-0.nightly-2021-09-10-170926
ODF Version:4.9.0-132.ci
LSO Version:4.9.0-202109101110

for more information:
https://docs.google.com/document/d/156nnw0XDoZnIHkalo5mycLEAaNH9RP6NZbiUhBlU9es/edit

Comment 9 Priyanka 2021-09-14 09:05:46 UTC

This is a UI issue. 
The StorageClassName is not populated correctly: 

storageDeviceSets:
    - config: {}
      count: 3
      dataPVCTemplate:
        metadata: {}
        spec:
          accessModes:
            - ReadWriteOnce
          resources:
            requests:
              storage: '1'
          storageClassName: ''
          volumeMode: Block

@Afreen will be looking into it.

Comment 12 Martin Bukatovic 2021-09-15 10:08:34 UTC

(In reply to afrahman from comment #10)
> I looked into the issue.The issue is with create new sc step.
> Till the time fix is merged. The workarounds are :
> 
> 1) type out storage class name along with volume set in input field

Isn't this the default behaviour in the UI? Or are you saying that the
workaround is not to use UI at all, but come up with StorageCluster yaml
file yourself and deploy it into openshift-storage namespace?
 
> 2) existing storage class option can be used if you have a lvset created
> already

There is a bug which makes this no longer possible: bz 2004185

Comment 15 Oded 2021-09-30 15:12:52 UTC

LSO deployment [Full deployment] via UI pass on OCP4.10 

Setup:
Provider:Vmware
OCP version:4.10.0-0.nightly-2021-09-30-041351
ODF Version:4.9.0-164.ci
LSO Version:4.9.0-202109210853

Test Procedure:
1.Deploy OCP4.10 cluster on Vmwarw plaform:
OCP Version:4.10.0-0.nightly-2021-09-30-041351

2.Install LSO operator:
LSO Version:4.9.0-202109210853
$ oc create -f https://raw.githubusercontent.com/red-hat-storage/ocs-ci/master/ocs_ci/templates/ocs-deployment/local-storage-optional-operators.yaml
imagecontentsourcepolicy.operator.openshift.io/olmcontentsourcepolicy created
catalogsource.operators.coreos.com/optional-operators created

3.Install ODF operator:
ODF Version: 4.9.0-164.ci

4.Add Disks [100G]to Worker nodes via Vcenter:

5.Create Storage System

6.Get Ceph status:
sh-4.4$ ceph status
  cluster:
    id:     574cedec-3e55-4985-9f0b-5bc1e3eec9ec
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum a,b,c (age 8m)
    mgr: a(active, since 8m)
    mds: 1/1 daemons up, 1 hot standby
    osd: 3 osds: 3 up (since 8m), 3 in (since 8m)
    rgw: 1 daemon active (1 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   11 pools, 177 pgs
    objects: 331 objects, 128 MiB
    usage:   322 MiB used, 300 GiB / 300 GiB avail
    pgs:     177 active+clean
 
  io:
    client:   852 B/s rd, 10 KiB/s wr, 1 op/s rd, 0 op/s wr


for more details:
https://docs.google.com/document/d/19xeFCYcERckWasC2fo_cIhgBcgeq4ElGXgZTHS-onFg/edit

Comment 19 errata-xmlrpc 2022-03-10 16:09:36 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056