Bug 2089296

Summary:	[MS v2] Storage cluster in error phase and 'ocs-provider-qe' addon installation failed with ODF 4.10.2
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	Jilju Joy <jijoy>
Component:	ocs-operator	Assignee:	Kaustav Majumder <kmajumde>
Status:	CLOSED ERRATA	QA Contact:	Jilju Joy <jijoy>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.10	CC:	ebenahar, kmajumde, madam, muagarwa, nberry, ocs-bugs, odf-bz-bot, omitrani, owasserm, sostapov
Target Milestone:	---	Keywords:	Automation, Regression
Target Release:	ODF 4.11.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:
Clones:	2096302 (view as bug list)		Environment:
Last Closed:	2022-08-24 13:53:39 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2096302

Description Jilju Joy 2022-05-23 11:26:43 UTC

Description of problem (please be detailed as possible and provide log
snippests):
The storage cluster phase is showing as "Error" in a provider cluster. This is causing the 'ocs-provider-qe'  add-on to remain in Failed state.
Tested in managed service platform.

$ oc get storagecluster
NAME                 AGE    PHASE   EXTERNAL   CREATED AT             VERSION
ocs-storagecluster   160m   Error              2022-05-23T08:40:11Z 

$ oc get cephblockpool
NAME                                                                 AGE
cephblockpool-storageconsumer-e9000440-68c8-4750-bbc5-5d942784ffc9   107m


$ oc get sc
NAME                        PROVISIONER                             RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
gp2 (default)               kubernetes.io/aws-ebs                   Delete          WaitForFirstConsumer   true                   178m
gp2-csi                     ebs.csi.aws.com                         Delete          WaitForFirstConsumer   true                   178m
gp3-csi                     ebs.csi.aws.com                         Delete          WaitForFirstConsumer   true                   178m
ocs-storagecluster-cephfs   openshift-storage.cephfs.csi.ceph.com   Delete          Immediate              true                   150m
  


$ oc logs ocs-operator-5985b8b5f4-g99mr  --tail=20
{"level":"info","ts":1653303198.6508234,"logger":"controllers.StorageCluster","msg":"Platform is set to skip object store. Not creating a CephObjectStore.","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster","Platform":"AWS"}
{"level":"info","ts":1653303198.6508474,"logger":"controllers.StorageCluster","msg":"Platform is set to skip object store. Not creating a CephObjectStoreUser.","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster","Platform":"AWS"}
{"level":"info","ts":1653303198.6508522,"logger":"controllers.StorageCluster","msg":"Platform is set to skip Ceph RGW Route. Not creating a Ceph RGW Route.","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster","platform":"AWS"}
{"level":"info","ts":1653303198.6509066,"logger":"controllers.StorageCluster","msg":"Waiting for CephBlockPool to be Ready. Skip reconciling StorageClass","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster","CephBlockPool":"ocs-storagecluster-cephblockpool/openshift-storage","StorageClass":"ocs-storagecluster-ceph-rbd"}
{"level":"error","ts":1653303198.6609645,"logger":"controller-runtime.manager.controller.storagecluster","msg":"Reconciler error","reconciler group":"ocs.openshift.io","reconciler kind":"StorageCluster","name":"ocs-storagecluster","namespace":"openshift-storage","error":"some StorageClasses [ocs-storagecluster-ceph-rbd] were skipped while waiting for pre-requisites to be met","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:253\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:214"}
{"level":"info","ts":1653303261.633716,"logger":"controllers.StorageCluster","msg":"Reconciling StorageCluster.","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster","StorageCluster":"openshift-storage/ocs-storagecluster"}
{"level":"info","ts":1653303261.633753,"logger":"controllers.StorageCluster","msg":"Spec.AllowRemoteStorageConsumers is enabled. Creating Provider API resources","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster"}
{"level":"info","ts":1653303261.6396916,"logger":"controllers.StorageCluster","msg":"Service create/update succeeded","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster"}
{"level":"info","ts":1653303261.6398952,"logger":"controllers.StorageCluster","msg":"status.storageProviderEndpoint is updated","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster","Endpoint":"10.0.128.9:31659"}
{"level":"info","ts":1653303261.646025,"logger":"controllers.StorageCluster","msg":"Deployment is running as desired","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster"}
{"level":"info","ts":1653303261.6461484,"logger":"controllers.StorageCluster","msg":"Adding topology label from Node.","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster","Node":"ip-10-0-142-154.us-east-2.compute.internal","Label":"failure-domain.beta.kubernetes.io/zone","Value":"us-east-2a"}
{"level":"info","ts":1653303261.6461625,"logger":"controllers.StorageCluster","msg":"Adding topology label from Node.","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster","Node":"ip-10-0-142-154.us-east-2.compute.internal","Label":"failure-domain.beta.kubernetes.io/region","Value":"us-east-2"}
{"level":"info","ts":1653303261.6461685,"logger":"controllers.StorageCluster","msg":"Adding topology label from Node.","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster","Node":"ip-10-0-182-3.us-east-2.compute.internal","Label":"failure-domain.beta.kubernetes.io/zone","Value":"us-east-2b"}
{"level":"info","ts":1653303261.646174,"logger":"controllers.StorageCluster","msg":"Adding topology label from Node.","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster","Node":"ip-10-0-201-124.us-east-2.compute.internal","Label":"failure-domain.beta.kubernetes.io/zone","Value":"us-east-2c"}
{"level":"info","ts":1653303261.6464894,"logger":"controllers.StorageCluster","msg":"Restoring original CephFilesystem.","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster","CephFileSystem":"openshift-storage/ocs-storagecluster-cephfilesystem"}
{"level":"info","ts":1653303261.6528974,"logger":"controllers.StorageCluster","msg":"Platform is set to skip object store. Not creating a CephObjectStore.","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster","Platform":"AWS"}
{"level":"info","ts":1653303261.6529174,"logger":"controllers.StorageCluster","msg":"Platform is set to skip object store. Not creating a CephObjectStoreUser.","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster","Platform":"AWS"}
{"level":"info","ts":1653303261.6529224,"logger":"controllers.StorageCluster","msg":"Platform is set to skip Ceph RGW Route. Not creating a Ceph RGW Route.","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster","platform":"AWS"}
{"level":"info","ts":1653303261.6529758,"logger":"controllers.StorageCluster","msg":"Waiting for CephBlockPool to be Ready. Skip reconciling StorageClass","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster","CephBlockPool":"ocs-storagecluster-cephblockpool/openshift-storage","StorageClass":"ocs-storagecluster-ceph-rbd"}
{"level":"error","ts":1653303261.6629755,"logger":"controller-runtime.manager.controller.storagecluster","msg":"Reconciler error","reconciler group":"ocs.openshift.io","reconciler kind":"StorageCluster","name":"ocs-storagecluster","namespace":"openshift-storage","error":"some StorageClasses [ocs-storagecluster-ceph-rbd] were skipped while waiting for pre-requisites to be met","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:253\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:214"}




$ rosa list addon -c jijoy-m23-pr|grep ocs-provider-qe
ocs-provider-qe             Red Hat OpenShift Data Foundation Managed Service Provider (QE)       failed


$ ocm list clusters | grep jijoy-m23-pr
1scf5pngrsi8vmi0o4qikucri33sd6lh  jijoy-m23-pr                  https://api.jijoy-m23-pr.41dj.s1.devshift.org:6443          4.10.13             rosa            aws             us-east-2       ready
$ oc get csv ocs-osd-deployer.v2.0.2
NAME                      DISPLAY            VERSION   REPLACES                  PHASE
ocs-osd-deployer.v2.0.2   OCS OSD Deployer   2.0.2     ocs-osd-deployer.v2.0.1   Failed


$ oc get deployment ocs-osd-controller-manager
NAME                         READY   UP-TO-DATE   AVAILABLE   AGE
ocs-osd-controller-manager   0/1     1            0           80m


$ oc get pods -o wide | grep ocs-osd-controller-manager
ocs-osd-controller-manager-6b74c4cc67-4xdjd                       2/3     Running     0          80m   10.131.0.28    ip-10-0-201-124.us-east-2.compute.internal   <none>           <none>



$ oc get managedocs managedocs -o yaml
apiVersion: ocs.openshift.io/v1alpha1
kind: ManagedOCS
metadata:
  creationTimestamp: "2022-05-23T08:39:32Z"
  finalizers:
  - managedocs.ocs.openshift.io
  generation: 1
  name: managedocs
  namespace: openshift-storage
  resourceVersion: "47837"
  uid: e0153a35-52d8-4433-bfdd-cabf3d2345de
spec: {}
status:
  components:
    alertmanager:
      state: Ready
    prometheus:
      state: Ready
    storageCluster:
      state: Pending
  reconcileStrategy: strict


$ oc describe pod ocs-osd-controller-manager-6b74c4cc67-4xdjd | grep Events -A 100
Events:
  Type     Reason          Age   From               Message
  ----     ------          ----  ----               -------
  Normal   Scheduled       82m   default-scheduler  Successfully assigned openshift-storage/ocs-osd-controller-manager-6b74c4cc67-4xdjd to ip-10-0-201-124.us-east-2.compute.internal
  Normal   AddedInterface  82m   multus             Add eth0 [10.131.0.28/23] from openshift-sdn
  Normal   Pulled          82m   kubelet            Container image "quay.io/openshift/origin-kube-rbac-proxy:4.10.0" already present on machine
  Normal   Created         82m   kubelet            Created container kube-rbac-proxy
  Normal   Started         82m   kubelet            Started container kube-rbac-proxy
  Normal   Pulling         82m   kubelet            Pulling image "quay.io/osd-addons/ocs-osd-deployer:2.0.2-2"
  Normal   Pulled          82m   kubelet            Container image "quay.io/osd-addons/ocs-osd-deployer:2.0.2-2" already present on machine
  Normal   Pulled          82m   kubelet            Successfully pulled image "quay.io/osd-addons/ocs-osd-deployer:2.0.2-2" in 2.480468207s
  Normal   Created         82m   kubelet            Created container manager
  Normal   Started         82m   kubelet            Started container manager
  Normal   Created         82m   kubelet            Created container readiness-server
  Normal   Started         82m   kubelet            Started container readiness-server
  Warning  ProbeError      82m   kubelet            Readiness probe error: HTTP probe failed with statuscode: 500
body:
  Warning  Unhealthy   82m                    kubelet  Readiness probe failed: HTTP probe failed with statuscode: 500
  Warning  Unhealthy   81m (x6 over 82m)      kubelet  Readiness probe failed: HTTP probe failed with statuscode: 503
  Warning  ProbeError  2m25s (x543 over 82m)  kubelet  Readiness probe error: HTTP probe failed with statuscode: 503
body:






Version of all relevant components (if applicable):
$ oc get csv
NAME                                      DISPLAY                       VERSION           REPLACES                                  PHASE
mcg-operator.v4.10.2                      NooBaa Operator               4.10.2            mcg-operator.v4.10.1                      Succeeded
ocs-operator.v4.10.2                      OpenShift Container Storage   4.10.2            ocs-operator.v4.10.1                      Succeeded
ocs-osd-deployer.v2.0.2                   OCS OSD Deployer              2.0.2             ocs-osd-deployer.v2.0.1                   Failed
odf-csi-addons-operator.v4.10.2           CSI Addons                    4.10.2            odf-csi-addons-operator.v4.10.1           Succeeded
odf-operator.v4.10.2                      OpenShift Data Foundation     4.10.2            odf-operator.v4.10.1                      Succeeded
ose-prometheus-operator.4.10.0            Prometheus Operator           4.10.0            ose-prometheus-operator.4.8.0             Succeeded
route-monitor-operator.v0.1.408-c2256a2   Route Monitor Operator        0.1.408-c2256a2   route-monitor-operator.v0.1.406-54ff884   Succeeded


$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.13   True        False         166m    Cluster version is 4.10.13

$ oc get csv odf-operator.v4.10.2 -o yaml| grep full_version
    full_version: 4.10.2-3



Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes. Storagecluster state is not reaching Ready state.

Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Create a provider cluster


Actual results:
The storage cluster phase is showing as "Error" in a provider cluster. 
'ocs-provider-qe'  add-on is in in Failed state.


Expected results:
Storage cluster should be Ready.
provider-qe addon installation should succeed.

Additional info:

Comment 2 Jilju Joy 2022-05-23 11:44:36 UTC

Adding Regression keyword because the installation was working with previous version -  Deployer 2.0.1 with ODF 4.10.0 GA.

Comment 4 Jilju Joy 2022-05-23 12:05:53 UTC

must-gather logs : http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-m23-pr/jijoy-m23-pr_20220523T080402/logs/testcases_1653306906/

Comment 5 Kaustav Majumder 2022-06-01 06:10:44 UTC

Since the PR attached is already merged for 4.11 , should the status on the BZ be ON_QA?

Comment 8 Mudit Agarwal 2022-06-22 17:25:43 UTC

Looks like this issue was fixed in deployer and nothing was required in the product. 
According to the chat (https://bugzilla.redhat.com/show_bug.cgi?id=2089296#c3) Jilju mentions that the issue is not even reproducible in 4.10.3 which means we don't require a bug in 4.10 and the BZ targeted for 4.10 can be closed

The attached PR is not relevant for this fix and should be removed.
Attached PR is for the perf BZ #2068398

For 4.10 we had a different PR/bug BZ #2078715

IMO, we should do this

1. Remove BZ link from the PR.
2. Move the current BZ to managed service and mark it ON_QA
3. Close the 4.10 BZ #2096302

Ohad - FYI - Let me know if this makes sense.

Comment 9 Ohad 2022-06-22 17:39:30 UTC

It does, with a very small correction.
The bug was not fixed in the deployer it was fixed in the product as part of the fix for the pref bug. 
Because the pref bug had a completely diff fix for 4.10 and 4.11 the entire thing got confusing.

Comment 10 Mudit Agarwal 2022-06-23 01:14:27 UTC

Ok. so no need to move this bug to MS and it can be verified along with the perf bug and 4.10 clone is not needed.

Comment 12 Jilju Joy 2022-06-28 10:28:58 UTC

Verified in version:

ODF 4.11.0-104
OCP 4.10.18

$ oc -n openshift-storage get csv
NAME                                      DISPLAY                       VERSION           REPLACES                                  PHASE
mcg-operator.v4.11.0                      NooBaa Operator               4.11.0            mcg-operator.v4.10.4                      Succeeded
ocs-operator.v4.11.0                      OpenShift Container Storage   4.11.0            ocs-operator.v4.10.4                      Succeeded
ocs-osd-deployer.v2.0.2                   OCS OSD Deployer              2.0.2             ocs-osd-deployer.v2.0.1                   Succeeded
odf-csi-addons-operator.v4.11.0           CSI Addons                    4.11.0            odf-csi-addons-operator.v4.10.4           Succeeded
odf-operator.v4.11.0                      OpenShift Data Foundation     4.11.0            odf-operator.v4.10.2                      Succeeded
ose-prometheus-operator.4.10.0            Prometheus Operator           4.10.0            ose-prometheus-operator.4.8.0             Succeeded
route-monitor-operator.v0.1.422-151be96   Route Monitor Operator        0.1.422-151be96   route-monitor-operator.v0.1.420-b65f47e   Succeeded


$ rosa list addon -c fbalak-prov27|grep ocs-provider-qe
ocs-provider-qe             Red Hat OpenShift Data Foundation Managed Service Provider (QE)       ready

$ ocm list clusters | grep fbalak-prov27
1t3h55itvjj6p8cm5hvmg9v7mjo1lceg  fbalak-prov27                 https://api.fbalak-prov27.be5a.s1.devshift.org:6443         4.10.18             rosa            aws             us-east-1       ready        

$ oc get deployment ocs-osd-controller-manager
NAME                         READY   UP-TO-DATE   AVAILABLE   AGE
ocs-osd-controller-manager   1/1     1            1           26h

$ oc get pods -o wide | grep ocs-osd-controller-manager
ocs-osd-controller-manager-6cbb8889fc-k9bm6                       3/3     Running     1 (21h ago)   21h   10.129.2.36    ip-10-0-171-213.ec2.internal   <none>           <none>

$ oc get managedocs managedocs -o yaml
apiVersion: ocs.openshift.io/v1alpha1
kind: ManagedOCS
metadata:
  creationTimestamp: "2022-06-27T08:18:43Z"
  finalizers:
  - managedocs.ocs.openshift.io
  generation: 1
  name: managedocs
  namespace: openshift-storage
  resourceVersion: "340704"
  uid: 34529f17-0e61-43a9-bceb-fbae15fdbf93
spec: {}
status:
  components:
    alertmanager:
      state: Ready
    prometheus:
      state: Ready
    storageCluster:
      state: Ready
  reconcileStrategy: strict

$ oc get storagecluster
NAME                 AGE   PHASE   EXTERNAL   CREATED AT             VERSION
ocs-storagecluster   26h   Ready              2022-06-27T08:19:01Z   

$ oc get cephblockpool
NAME                                                                 PHASE
cephblockpool-storageconsumer-7c25e752-8ce3-4470-bc36-391d2404417e   Ready


$ oc get sc
NAME            PROVISIONER             RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
gp2 (default)   kubernetes.io/aws-ebs   Delete          WaitForFirstConsumer   true                   26h
gp2-csi         ebs.csi.aws.com         Delete          WaitForFirstConsumer   true                   26h
gp3-csi         ebs.csi.aws.com         Delete          WaitForFirstConsumer   true                   26h

Comment 13 Kaustav Majumder 2022-08-08 09:36:03 UTC

This fix is only required in 4.11 , since there is a different fix for 4.10.z addressed here https://bugzilla.redhat.com/show_bug.cgi?id=2078715.
Hence removing the 4.10.z? flag

Comment 15 errata-xmlrpc 2022-08-24 13:53:39 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.11.0 security, enhancement, & bugfix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:6156

Comment 16 Red Hat Bugzilla 2023-12-08 04:28:49 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days