Description of problem (please be detailed as possible and provide log snippests): ---------------------------------------------------------------------- OCS 4.3 was installed using OCS 4.4 registry for 4.4-rc6. Created some OBCs and app pods for rbd and cephfs volumes and Io was ongoing. Also performed couple of MGR pod restart along with PVC creation. The storagecluster CR was seen to be in Progressing state with "NoobaaInitializing" messages. Also, the noobaa bucket class and backingstore related error messages were seen in noobaa-operator logs. Performed upgrade on the same cluster a s"upgradeable" was true and since then the ocs-operator pod is in 0/1 state and CSV in Installing. Details of the activities performed on the cluster are added in Steps to reproduce section {"level":"info","ts":"2020-05-22T12:57:27.383Z","logger":"controller_storagecluster","msg":"Reconciling StorageCluster","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster"} {"level":"info","ts":"2020-05-22T12:57:27.443Z","logger":"controller_storagecluster","msg":"Reconciling StorageCluster","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster"} >> oc get pods -o wide --snip--- noobaa-operator-65f857844c-9df47 1/1 Running 0 89m 10.128.2.99 compute-0 <none> <none> ocs-operator-699cdb89d4-rq99x 0/1 Running 0 89m 10.128.2.100 compute-0 <none> <none> rook-ceph-operator-56c448f887-w6mt9 1/1 Running 0 89m 10.129.2.57 compute-2 <none> <none> >> oc get csv (after upgrade) $ oc get csv NAME DISPLAY VERSION REPLACES PHASE elasticsearch-operator.4.2.29-202004140532 Elasticsearch Operator 4.2.29-202004140532 Succeeded lib-bucket-provisioner.v1.0.0 lib-bucket-provisioner 1.0.0 Succeeded ocs-operator.v4.4.0-428.ci OpenShift Container Storage 4.4.0-428.ci ocs-operator.v4.3.0 Installing >> oc get storagecluster -o yaml --snip--- ______________________________________________________________________ - lastHeartbeatTime: "2020-05-22T13:47:50Z" lastTransitionTime: "2020-05-21T17:37:01Z" message: Waiting on Nooba instance to finish initialization reason: NoobaaInitializing status: "True" type: Progressing ______________________________________________________________________ >> Output from noobaa status (added screenshots in the bug directory) #------------------# #- Backing Stores -# #------------------# NAME TYPE TARGET-BUCKET PHASE AGE noobaa-default-backing-store s3-compatible nb.1590068145657.apps.nberry-dc28-m21.qe.rh-ocs.com Connecting 24h6m9s #------------------# #- Bucket Classes -# #------------------# NAME PLACEMENT PHASE AGE noobaa-default-bucket-class {Tiers:[{Placement: BackingStores:[noobaa-default-backing-store]}]} Verifying 24h6m10s #-----------------# #- Bucket Claims -# #-----------------# NAMESPACE NAME BUCKET-NAME STORAGE-CLASS BUCKET-CLASS PHASE openshift-storage deleteme deleteme-eec3c618-c19e-4444-bd8a-d7d9b8186c73 openshift-storage.noobaa.io noobaa-default-bucket-class Bound openshift-storage nbio1 nbio1-6ee690c2-c669-493a-8da2-d800e4408ea3 openshift-storage.noobaa.io noobaa-default-bucket-class Bound openshift-storage nbio2 nbio2-c2aa8d8d-fa4b-4dcf-9ef0-c0aeaf136003 openshift-storage.noobaa.io noobaa-default-bucket-class Bound openshift-storage nbio3 nbio3-95e137dc-c0d9-45cd-a845-474694e140d0 openshift-storage.noobaa.io noobaa-default-bucket-class Bound openshift-storage nbio4 nbio4-6af4da26-2deb-4925-85fb-2845ab275a8d openshift-storage.noobaa.io noobaa-default-bucket-class Bound . Version of all relevant components (if applicable): ---------------------------------------------------------------------- $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.4.0-0.nightly-2020-05-21-042450 True False 9h Cluster version is 4.4.0-0.nightly-2020-05-21-042450 $ oc get catsrc/ocs-catalogsource -n openshift-marketplace -o yaml|grep image: image: quay.io/rhceph-dev/ocs-olm-operator:4.4.0-428.ci OCS version before upgrade = ocs-operator.v4.3.0 OCS versions after upgrade = ocs-operator.v4.4.0-428.ci Container Image : quay.io/ocs-dev/ocs-operator:4.4.0 $ noobaa version INFO[0000] CLI version: 2.0.10 INFO[0000] noobaa-image: noobaa/noobaa-core:5.2.13 INFO[0000] operator-image: noobaa/noobaa-operator:2.0.10 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? ---------------------------------------------------------------------- RBD and cephfs is intact but not sure about nooba IO Is there any workaround available to the best of your knowledge? ---------------------------------------------------------------------- No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? ---------------------------------------------------------------------- 3 Can this issue reproducible? ---------------------------------------------------------------------- Tested once Can this issue reproduce from the UI? ---------------------------------------------------------------------- No. If this is a regression, please provide more details to justify this: ---------------------------------------------------------------------- Not sure Steps to Reproduce: ---------------------------------------------------------------------- >> Following are some of the activities performed on the cluster. P.S: Not sure of the exact timeline when Noobaa related issues started showing up (was it after OBC creations ?) 1. Installed OCS 4.3 from OCS 4.4 registry and all pods were in Running and 1/1 state (even ocs-operator and noobaa-operator) 2. Created some PVCs and app pods and started FIO, PGQL IO 3. With simmultaneous PVC creation, restarted MGR (this activity was repeated thrice) 4. Even before upgrade was started, observed storagecluster in Progressing state due to noobaa Initializing state. 5. Started upgrade from OCS 4.3 to OCS 4.4 - Edited the subscription to change channel to stable-4.4 and sclaed down mon-a - Upgrade succeeded (after the mon wa srecovered automatically by rook-operator) 6. The ocs-operator pod has been in 0/1 state since upgrade completed and the storagecluster is still in Progressing state. Actual results: ---------------------------------------------------------------------- The storagecluster CR is in progressing state with noobaa resources being not in good state. After upgrade, ocs-operator pod is in 0/1 state as storagecluster is not reconciled properly Expected results: ---------------------------------------------------------------------- Storagecluster should be in succeeded state. Additional info: ---------------------------------------------------------------------- >> Cephcluster is good $ oc get cephcluster NAME DATADIRHOSTPATH MONCOUNT AGE STATE HEALTH ocs-storagecluster-cephcluster /var/lib/rook 3 24h Created HEALTH_OK >> logs snip from noobaa-operator time="2020-05-22T10:48:07Z" level=info msg="✈️ RPC: system.read_system() Request: <nil>" time="2020-05-22T10:48:07Z" level=error msg="⚠️ RPC: system.read_system() Response Error: Code=INTERNAL Message=bucket.tiering.tiers is not iterable" time="2020-05-22T10:48:07Z" level=error msg="failed to read system info: bucket.tiering.tiers is not iterable" sys=openshift-storage/noobaa time="2020-05-22T10:48:07Z" level=info msg="SetPhase: temporary error during phase \"Configuring\"" sys=openshift-storage/noobaa time="2020-05-22T10:48:07Z" level=warning msg="⏳ Temporary Error: bucket.tiering.tiers is not iterable" sys=openshift-storage/noobaa time="2020-05-22T10:48:07Z" level=info msg="UpdateStatus: Done generation 1" sys=openshift-storage/noobaa time="2020-05-22T10:48:08Z" level=info msg="Start ..." bucketclass=openshift-storage/noobaa-default-bucket-class time="2020-05-22T10:48:08Z" level=info msg="✅ Exists: BucketClass \"noobaa-default-bucket-class\"\n" time="2020-05-22T10:48:08Z" level=info msg="✅ Exists: NooBaa \"noobaa\"\n" time="2020-05-22T10:48:08Z" level=info msg="SetPhase: Verifying" bucketclass=openshift-storage/noobaa-default-bucket-class time="2020-05-22T10:48:08Z" level=info msg="✅ Exists: BackingStore \"noobaa-default-backing-store\"\n" time="2020-05-22T10:48:08Z" level=info msg="SetPhase: temporary error during phase \"Verifying\"" bucketclass=openshift-storage/noobaa-default-bucket-class time="2020-05-22T10:48:08Z" level=warning msg="⏳ Temporary Error: NooBaa BackingStore \"noobaa-default-backing-store\" is not yet ready" bucketclass=openshift-storage/noobaa-default-bucket-class time="2020-05-22T10:48:08Z" level=info msg="UpdateStatus: Done" bucketclass=openshift-storage/noobaa-default-bucket-class time="2020-05-22T10:48:09Z" level=info msg="Start ..." backingstore=openshift-storage/noobaa-default-backing-store time="2020-05-22T10:48:09Z" level=info msg="✅ Exists: BackingStore \"noobaa-default-backing-store\"\n" time="2020-05-22T10:48:09Z" level=info msg="✅ Exists: NooBaa \"noobaa\"\n" time="2020-05-22T10:48:09Z" level=info msg="✅ Exists: Secret \"rook-ceph-object-user-ocs-storagecluster-cephobjectstore-noobaa-ceph-objectstore-user\"\n" time="2020-05-22T10:48:09Z" level=info msg="SetPhase: Verifying" backingstore=openshift-storage/noobaa-default-backing-store time="2020-05-22T10:48:09Z" level=info msg="SetPhase: Connecting" backingstore=openshift-storage/noobaa-default-backing-store time="2020-05-22T10:48:09Z" level=info msg="✅ Exists: NooBaa \"noobaa\"\n" time="2020-05-22T10:48:09Z" level=info msg="✅ Exists: Service \"noobaa-mgmt\"\n" time="2020-05-22T10:48:09Z" level=info msg="✅ Exists: Secret \"noobaa-operator\"\n" time="2020-05-22T10:48:09Z" level=info msg="✅ Exists: Secret \"noobaa-admin\"\n" time="2020-05-22T10:48:09Z" level=info msg="✈️ RPC: system.read_system() Request: <nil>" time="2020-05-22T10:48:09Z" level=error msg="⚠️ RPC: system.read_system() Response Error: Code=INTERNAL Message=bucket.tiering.tiers is not iterable" time="2020-05-22T10:48:09Z" level=info msg="SetPhase: temporary error during phase \"Connecting\"" backingstore=openshift-storage/noobaa-default-backing-store time="2020-05-22T10:48:09Z" level=warning msg="⏳ Temporary Error: bucket.tiering.tiers is not iterable" backingstore=openshift-storage/noobaa-default-backing-store time="2020-05-22T10:48:09Z" level=info msg="UpdateStatus: Done" backingstore=openshift-storage/noobaa-default-backing-store
Created attachment 1694802 [details] noobaa-core log - repro
Logs have been added, waiting for a repro
Based on the last comment from Ben and since we have a few repros already and this bug is about service unavailability, proposing this as a blocker for 4.5
Patches referenced here have been backported with the following backport PRs: https://github.com/noobaa/noobaa-core/pull/6048 https://github.com/noobaa/noobaa-core/pull/6072
The bug cannot be verified directly, and thus we have to rely on our regression testing in order to verify it. As can be seen in test run https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/9875/, tests that once led to the issue, are now passing successfully. Verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Container Storage 4.5.0 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:3754