Bug 2000027

Summary:	[AWS]: [odf-operator.v4.9.0-120.ci] storagecluster is in Progressing state
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	Vijay Avuthu <vavuthu>
Component:	Multi-Cloud Object Gateway	Assignee:	Jacky Albo <jalbo>
Status:	CLOSED NOTABUG	QA Contact:	Raz Tamir <ratamir>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	4.9	CC:	ebenahar, etamir, kramdoss, madam, muagarwa, nbecker, ocs-bugs, odf-bz-bot, pbalogh, sostapov
Target Milestone:	---	Keywords:	Automation
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-10-05 07:13:09 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Vijay Avuthu 2021-09-01 08:33:51 UTC

Description of problem (please be detailed as possible and provide log
snippests):

storagecluster is in Progressing state 

Version of all relevant components (if applicable):

odf-operator.v4.9.0-120.ci

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes

Is there any workaround available to the best of your knowledge?
NA

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
1/1

Can this issue reproduce from the UI?
Not Tried

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. install OCS using ocs-ci ( PR 4647 )
2. check storagecluster status
3.


Actual results:

$ oc get storagecluster
NAME                 AGE   PHASE         EXTERNAL   CREATED AT             VERSION
ocs-storagecluster   44m   Progressing              2021-09-01T07:40:26Z   4.9.0



Expected results:

storagecluster should be in Ready phase

Additional info:

> All csv are in succeeded phase

$ oc get csv
NAME                            DISPLAY                       VERSION        REPLACES   PHASE
noobaa-operator.v4.9.0-120.ci   NooBaa Operator               4.9.0-120.ci              Succeeded
ocs-operator.v4.9.0-120.ci      OpenShift Container Storage   4.9.0-120.ci              Succeeded
odf-operator.v4.9.0-120.ci      OpenShift Data Foundation     4.9.0-120.ci              Succeeded

> storagecluster status

$ oc get storagecluster
NAME                 AGE   PHASE         EXTERNAL   CREATED AT             VERSION
ocs-storagecluster   47m   Progressing              2021-09-01T07:40:26Z   4.9.0


> $ oc describe storagecluster ocs-storagecluster
Name:         ocs-storagecluster
Namespace:    openshift-storage


Status:
  Conditions:
    Last Heartbeat Time:   2021-09-01T08:28:29Z
    Last Transition Time:  2021-09-01T07:45:41Z
    Message:               Reconcile completed successfully
    Reason:                ReconcileCompleted
    Status:                True
    Type:                  ReconcileComplete
    Last Heartbeat Time:   2021-09-01T07:40:26Z
    Last Transition Time:  2021-09-01T07:40:26Z
    Message:               Initializing StorageCluster
    Reason:                Init
    Status:                False
    Type:                  Available
    Last Heartbeat Time:   2021-09-01T08:28:29Z
    Last Transition Time:  2021-09-01T07:40:26Z
    Message:               Waiting on Nooba instance to finish initialization
    Reason:                NoobaaInitializing
    Status:                True
    Type:                  Progressing
    Last Heartbeat Time:   2021-09-01T07:40:26Z
    Last Transition Time:  2021-09-01T07:40:26Z
    Message:               Initializing StorageCluster
    Reason:                Init
    Status:                False
    Type:                  Degraded
    Last Heartbeat Time:   2021-09-01T07:40:26Z
    Last Transition Time:  2021-09-01T07:40:26Z
    Message:               Initializing StorageCluster
    Reason:                Init
    Status:                Unknown
    Type:                  Upgradeable
  Failure Domain:          zone
  Failure Domain Key:      topology.kubernetes.io/zone


> nooba operator log

time="2021-09-01T08:19:59Z" level=warning msg="⏳ Temporary Error: not enough available replicas in endpoint deployment" sys=openshift-storage/noobaa
time="2021-09-01T08:19:59Z" level=info msg="UpdateStatus: Done generation 1" sys=openshift-storage/noobaa

> job: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/5656/console

Comment 3 Vijay Avuthu 2021-09-01 09:08:04 UTC

must gather logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/vavuthuaws2-pr4647/vavuthuaws2-pr4647_20210901T061730/logs/failed_testcase_ocs_logs_1630478606/test_deployment_ocs_logs/

Comment 4 Mudit Agarwal 2021-09-07 15:17:45 UTC

Vijay, is this reproducible in the latest build?

Comment 5 Petr Balogh 2021-09-08 09:50:51 UTC

I thought that I had another reproduce here as I saw that cluster was stuck in progressing state, but after looking at logs, it looks like different issue.

So opened new bug here:
https://bugzilla.redhat.com/show_bug.cgi?id=2002220

Comment 8 Vijay Avuthu 2021-09-08 12:06:30 UTC

Update:
=======

> Tried with latest build: 4.9.0-129.ci and its failed with same state

> $ oc get storagecluster
NAME                 AGE   PHASE         EXTERNAL   CREATED AT             VERSION
ocs-storagecluster   84m   Progressing              2021-09-08T10:34:17Z   4.9.0

> $ oc describe storagecluster ocs-storagecluster
Name:         ocs-storagecluster
Namespace:    openshift-storage
Labels:       <none>
Annotations:  storagesystem.odf.openshift.io/watched-by: storagesystem-odf
              uninstall.ocs.openshift.io/cleanup-policy: delete
              uninstall.ocs.openshift.io/mode: graceful
API Version:  ocs.openshift.io/v1


Status:
  Conditions:
    Last Heartbeat Time:   2021-09-08T11:58:28Z
    Last Transition Time:  2021-09-08T10:38:48Z
    Message:               Reconcile completed successfully
    Reason:                ReconcileCompleted
    Status:                True
    Type:                  ReconcileComplete
    Last Heartbeat Time:   2021-09-08T10:34:17Z
    Last Transition Time:  2021-09-08T10:34:17Z
    Message:               Initializing StorageCluster
    Reason:                Init
    Status:                False
    Type:                  Available
    Last Heartbeat Time:   2021-09-08T11:58:28Z
    Last Transition Time:  2021-09-08T10:34:17Z
    Message:               Waiting on Nooba instance to finish initialization
    Reason:                NoobaaInitializing
    Status:                True
    Type:                  Progressing
    Last Heartbeat Time:   2021-09-08T10:34:17Z
    Last Transition Time:  2021-09-08T10:34:17Z
    Message:               Initializing StorageCluster
    Reason:                Init
    Status:                False
    Type:                  Degraded
    Last Heartbeat Time:   2021-09-08T10:34:17Z
    Last Transition Time:  2021-09-08T10:34:17Z
    Message:               Initializing StorageCluster
    Reason:                Init
    Status:                Unknown
    Type:                  Upgradeable


> describe of noobaendpoint

$ oc describe pod noobaa-endpoint-dcc9c5d9d-tm8wj
Name:           noobaa-endpoint-dcc9c5d9d-tm8wj
Namespace:      openshift-storage
Priority:       0
Node:           <none>
Labels:         app=noobaa
                noobaa-s3=noobaa
                pod-template-hash=dcc9c5d9d
Annotations:    openshift.io/scc: noobaa-endpoint
Status:         Pending

Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  80m   default-scheduler  0/6 nodes are available: 3 Insufficient cpu, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
  Warning  FailedScheduling  78m   default-scheduler  0/6 nodes are available: 3 Insufficient cpu, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.


> Deploy is with allowing lower requirements 

> storagecluster

10:34:12 - MainThread - ocs_ci.utility.templating - INFO - apiVersion: ocs.openshift.io/v1
kind: StorageCluster
metadata:
  name: ocs-storagecluster
  namespace: openshift-storage
spec:
  resources:
    mds:
      Limits: null
      Requests: null
    mgr:
      Limits: null
      Requests: null
    mon:
      Limits: null
      Requests: null
    noobaa-core:
      Limits: null
      Requests: null
    noobaa-db:
      Limits: null
      Requests: null
    noobaa-endpoint:
      limits:
        cpu: 1
        memory: 500Mi
      requests:
        cpu: 1
        memory: 500Mi
    rgw:
      Limits: null
      Requests: null
  storageDeviceSets:
  - count: 1
    dataPVCTemplate:
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 100Gi
        storageClassName: gp2
        volumeMode: Block
    name: ocs-deviceset
    placement: {}
    portable: true
    replica: 3
    resources:
      Limits: null
      Requests: null

> As part of bug https://bugzilla.redhat.com/show_bug.cgi?id=1885313 , we have values for noobaendpoint, instead of empty objects

> Not sure what changes recently but with the above StorageCluster values, deployment used to be succesfull previously.

Job : https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/5852/console

must gather: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/vavuthu-bz027/vavuthu-bz027_20210908T091752/logs/failed_testcase_ocs_logs_1631093767/test_deployment_ocs_logs/