Bug 1839117 - Storagecluster in Progressing state with "noobaInitializing" message in OCS 4.3 installed from OCS 4.4 registry
Summary: Storagecluster in Progressing state with "noobaInitializing" message in OCS 4...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Container Storage
Classification: Red Hat Storage
Component: Multi-Cloud Object Gateway
Version: 4.4
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: OCS 4.5.0
Assignee: Jacky Albo
QA Contact: aberner
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-05-22 14:14 UTC by Neha Berry
Modified: 2020-09-15 10:17 UTC (History)
8 users (show)

Fixed In Version: 4.5.0-477.ci
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-09-15 10:17:07 UTC
Embargoed:


Attachments (Terms of Use)
noobaa-core log - repro (4.75 MB, text/plain)
2020-06-03 10:47 UTC, Danny
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github noobaa noobaa-core pull 6042 0 None closed added log message when tiering_policy is deleted and bucket is not 2021-02-09 01:48:01 UTC
Github noobaa noobaa-core pull 6064 0 None closed Fixing issue with removing tiering policies of wrong buckets 2021-02-09 01:48:02 UTC
Red Hat Product Errata RHBA-2020:3754 0 None None None 2020-09-15 10:17:34 UTC

Description Neha Berry 2020-05-22 14:14:06 UTC
Description of problem (please be detailed as possible and provide log
snippests):
----------------------------------------------------------------------
OCS 4.3 was installed using OCS 4.4 registry for 4.4-rc6. Created some OBCs and app pods for rbd and cephfs volumes and Io was ongoing. Also performed couple of MGR pod restart along with PVC creation.

The storagecluster CR was seen to be in Progressing state with "NoobaaInitializing" messages. Also, the noobaa bucket class and backingstore related error messages were seen in noobaa-operator logs.

Performed upgrade on the same cluster a s"upgradeable" was true and since then the ocs-operator pod is in 0/1 state and CSV in Installing. Details of the activities performed on the cluster are added in Steps to reproduce section

{"level":"info","ts":"2020-05-22T12:57:27.383Z","logger":"controller_storagecluster","msg":"Reconciling StorageCluster","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster"}
{"level":"info","ts":"2020-05-22T12:57:27.443Z","logger":"controller_storagecluster","msg":"Reconciling StorageCluster","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster"}

>> oc get pods -o wide --snip---

noobaa-operator-65f857844c-9df47                                  1/1     Running     0          89m   10.128.2.99    compute-0   <none>           <none>
ocs-operator-699cdb89d4-rq99x                                     0/1     Running     0          89m   10.128.2.100   compute-0   <none>           <none>
rook-ceph-operator-56c448f887-w6mt9                               1/1     Running     0          89m   10.129.2.57    compute-2   <none>           <none>


>> oc get csv (after upgrade)

$ oc get csv
NAME                                         DISPLAY                       VERSION               REPLACES              PHASE
elasticsearch-operator.4.2.29-202004140532   Elasticsearch Operator        4.2.29-202004140532                         Succeeded
lib-bucket-provisioner.v1.0.0                lib-bucket-provisioner        1.0.0                                       Succeeded
ocs-operator.v4.4.0-428.ci                   OpenShift Container Storage   4.4.0-428.ci          ocs-operator.v4.3.0   Installing


>> oc get storagecluster -o yaml --snip---
______________________________________________________________________

 - lastHeartbeatTime: "2020-05-22T13:47:50Z"
      lastTransitionTime: "2020-05-21T17:37:01Z"
      message: Waiting on Nooba instance to finish initialization
      reason: NoobaaInitializing
      status: "True"
      type: Progressing
______________________________________________________________________

>> Output from noobaa status (added screenshots in the bug directory)


#------------------#
#- Backing Stores -#
#------------------#

NAME                           TYPE            TARGET-BUCKET                                         PHASE        AGE       
noobaa-default-backing-store   s3-compatible   nb.1590068145657.apps.nberry-dc28-m21.qe.rh-ocs.com   Connecting   24h6m9s   

#------------------#
#- Bucket Classes -#
#------------------#

NAME                          PLACEMENT                                                             PHASE       AGE        
noobaa-default-bucket-class   {Tiers:[{Placement: BackingStores:[noobaa-default-backing-store]}]}   Verifying   24h6m10s   

#-----------------#
#- Bucket Claims -#
#-----------------#

NAMESPACE           NAME       BUCKET-NAME                                     STORAGE-CLASS                 BUCKET-CLASS                  PHASE   
openshift-storage   deleteme   deleteme-eec3c618-c19e-4444-bd8a-d7d9b8186c73   openshift-storage.noobaa.io   noobaa-default-bucket-class   Bound   
openshift-storage   nbio1      nbio1-6ee690c2-c669-493a-8da2-d800e4408ea3      openshift-storage.noobaa.io   noobaa-default-bucket-class   Bound   
openshift-storage   nbio2      nbio2-c2aa8d8d-fa4b-4dcf-9ef0-c0aeaf136003      openshift-storage.noobaa.io   noobaa-default-bucket-class   Bound   
openshift-storage   nbio3      nbio3-95e137dc-c0d9-45cd-a845-474694e140d0      openshift-storage.noobaa.io   noobaa-default-bucket-class   Bound   
openshift-storage   nbio4      nbio4-6af4da26-2deb-4925-85fb-2845ab275a8d      openshift-storage.noobaa.io   noobaa-default-bucket-class   Bound   .

  
Version of all relevant components (if applicable):
----------------------------------------------------------------------
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.4.0-0.nightly-2020-05-21-042450   True        False         9h      Cluster version is 4.4.0-0.nightly-2020-05-21-042450

$ oc get catsrc/ocs-catalogsource -n openshift-marketplace  -o yaml|grep image:
  image: quay.io/rhceph-dev/ocs-olm-operator:4.4.0-428.ci

OCS version before upgrade = ocs-operator.v4.3.0
OCS versions after upgrade = ocs-operator.v4.4.0-428.ci

Container Image : quay.io/ocs-dev/ocs-operator:4.4.0

$ noobaa version
INFO[0000] CLI version: 2.0.10                          
INFO[0000] noobaa-image: noobaa/noobaa-core:5.2.13      
INFO[0000] operator-image: noobaa/noobaa-operator:2.0.10 



Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
----------------------------------------------------------------------
RBD and cephfs is intact but not sure about nooba IO

Is there any workaround available to the best of your knowledge?
----------------------------------------------------------------------
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
----------------------------------------------------------------------
3

Can this issue reproducible?
----------------------------------------------------------------------
Tested once

Can this issue reproduce from the UI?
----------------------------------------------------------------------
No.

If this is a regression, please provide more details to justify this:
----------------------------------------------------------------------
Not sure


Steps to Reproduce:
----------------------------------------------------------------------
>> Following are some of the activities performed on the cluster. 

P.S: Not sure of the exact timeline when Noobaa related issues started showing up (was it after OBC creations ?) 

1. Installed OCS 4.3 from OCS 4.4 registry and all pods were in Running and 1/1 state (even ocs-operator and noobaa-operator)
2. Created some PVCs and app pods and started FIO, PGQL IO
3. With simmultaneous PVC creation, restarted MGR (this activity was repeated thrice)
4. Even before upgrade was started, observed storagecluster in Progressing state due to noobaa Initializing state.

5. Started upgrade from OCS 4.3 to OCS 4.4
   - Edited the subscription to change channel to stable-4.4 and sclaed down mon-a
   - Upgrade succeeded (after the mon wa srecovered automatically by rook-operator)

6. The ocs-operator pod has been in 0/1 state since upgrade completed and the storagecluster is still in Progressing state.


Actual results:
----------------------------------------------------------------------
The storagecluster CR is in progressing state with noobaa resources being not in good state. After upgrade, ocs-operator pod is in 0/1 state as storagecluster is not reconciled properly

Expected results:
----------------------------------------------------------------------
Storagecluster should be in succeeded state.


Additional info:
----------------------------------------------------------------------
>> Cephcluster is good

$ oc get cephcluster 
NAME                             DATADIRHOSTPATH   MONCOUNT   AGE   STATE     HEALTH
ocs-storagecluster-cephcluster   /var/lib/rook     3          24h   Created   HEALTH_OK


>> logs snip from noobaa-operator

time="2020-05-22T10:48:07Z" level=info msg="✈️  RPC: system.read_system() Request: <nil>"
time="2020-05-22T10:48:07Z" level=error msg="⚠️  RPC: system.read_system() Response Error: Code=INTERNAL Message=bucket.tiering.tiers is not iterable"
time="2020-05-22T10:48:07Z" level=error msg="failed to read system info: bucket.tiering.tiers is not iterable" sys=openshift-storage/noobaa
time="2020-05-22T10:48:07Z" level=info msg="SetPhase: temporary error during phase \"Configuring\"" sys=openshift-storage/noobaa
time="2020-05-22T10:48:07Z" level=warning msg="⏳ Temporary Error: bucket.tiering.tiers is not iterable" sys=openshift-storage/noobaa
time="2020-05-22T10:48:07Z" level=info msg="UpdateStatus: Done generation 1" sys=openshift-storage/noobaa
time="2020-05-22T10:48:08Z" level=info msg="Start ..." bucketclass=openshift-storage/noobaa-default-bucket-class
time="2020-05-22T10:48:08Z" level=info msg="✅ Exists: BucketClass \"noobaa-default-bucket-class\"\n"
time="2020-05-22T10:48:08Z" level=info msg="✅ Exists: NooBaa \"noobaa\"\n"
time="2020-05-22T10:48:08Z" level=info msg="SetPhase: Verifying" bucketclass=openshift-storage/noobaa-default-bucket-class
time="2020-05-22T10:48:08Z" level=info msg="✅ Exists: BackingStore \"noobaa-default-backing-store\"\n"
time="2020-05-22T10:48:08Z" level=info msg="SetPhase: temporary error during phase \"Verifying\"" bucketclass=openshift-storage/noobaa-default-bucket-class
time="2020-05-22T10:48:08Z" level=warning msg="⏳ Temporary Error: NooBaa BackingStore \"noobaa-default-backing-store\" is not yet ready" bucketclass=openshift-storage/noobaa-default-bucket-class
time="2020-05-22T10:48:08Z" level=info msg="UpdateStatus: Done" bucketclass=openshift-storage/noobaa-default-bucket-class
time="2020-05-22T10:48:09Z" level=info msg="Start ..." backingstore=openshift-storage/noobaa-default-backing-store
time="2020-05-22T10:48:09Z" level=info msg="✅ Exists: BackingStore \"noobaa-default-backing-store\"\n"
time="2020-05-22T10:48:09Z" level=info msg="✅ Exists: NooBaa \"noobaa\"\n"
time="2020-05-22T10:48:09Z" level=info msg="✅ Exists: Secret \"rook-ceph-object-user-ocs-storagecluster-cephobjectstore-noobaa-ceph-objectstore-user\"\n"
time="2020-05-22T10:48:09Z" level=info msg="SetPhase: Verifying" backingstore=openshift-storage/noobaa-default-backing-store
time="2020-05-22T10:48:09Z" level=info msg="SetPhase: Connecting" backingstore=openshift-storage/noobaa-default-backing-store
time="2020-05-22T10:48:09Z" level=info msg="✅ Exists: NooBaa \"noobaa\"\n"
time="2020-05-22T10:48:09Z" level=info msg="✅ Exists: Service \"noobaa-mgmt\"\n"
time="2020-05-22T10:48:09Z" level=info msg="✅ Exists: Secret \"noobaa-operator\"\n"
time="2020-05-22T10:48:09Z" level=info msg="✅ Exists: Secret \"noobaa-admin\"\n"
time="2020-05-22T10:48:09Z" level=info msg="✈️  RPC: system.read_system() Request: <nil>"
time="2020-05-22T10:48:09Z" level=error msg="⚠️  RPC: system.read_system() Response Error: Code=INTERNAL Message=bucket.tiering.tiers is not iterable"
time="2020-05-22T10:48:09Z" level=info msg="SetPhase: temporary error during phase \"Connecting\"" backingstore=openshift-storage/noobaa-default-backing-store
time="2020-05-22T10:48:09Z" level=warning msg="⏳ Temporary Error: bucket.tiering.tiers is not iterable" backingstore=openshift-storage/noobaa-default-backing-store
time="2020-05-22T10:48:09Z" level=info msg="UpdateStatus: Done" backingstore=openshift-storage/noobaa-default-backing-store

Comment 8 Danny 2020-06-03 10:47:54 UTC
Created attachment 1694802 [details]
noobaa-core log - repro

Comment 10 Nimrod Becker 2020-06-09 08:32:05 UTC
Logs have been added, waiting for a repro

Comment 15 Elad 2020-07-01 07:53:40 UTC
Based on the last comment from Ben and since we have a few repros already and this bug is about service unavailability, proposing this as a blocker for 4.5

Comment 18 Michael Adam 2020-07-06 09:22:24 UTC
Patches referenced here have been backported with the following backport PRs:

https://github.com/noobaa/noobaa-core/pull/6048

https://github.com/noobaa/noobaa-core/pull/6072

Comment 20 aberner 2020-07-21 12:45:42 UTC
The bug cannot be verified directly, and thus we have to rely on our regression testing in order to verify it.

As can be seen in test run https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/9875/, tests that once led to the issue, are now passing successfully.

Verified.

Comment 22 errata-xmlrpc 2020-09-15 10:17:07 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Container Storage 4.5.0 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3754


Note You need to log in before you can comment on or make changes to this bug.