Description of problem (please be detailed as possible and provide log snippests): ---------------------------------------------------------------------------- OCS build was ocs-operator.v4.5.0-515.ci, hence the default noobaa-backingstore was in Connecting state indefinitely - Bug 1866781 Created few backingstores using PV-pool with different SCs and all came in Ready state Before OCP upgrade ++++++++++++++++++++++++ OCP version = 4.5.0-0.nightly-2020-08-06-062632 ======= backingstore ========== NAME TYPE PHASE AGE neha-cephfs pv-pool Ready 14h neha-cli pv-pool Ready 14h neha-test pv-pool Ready 14h noobaa-default-backing-store s3-compatible Connecting 18h After OCP Upgrade +++++++++++++++++++++++ $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.5.0-0.nightly-2020-08-07-024812 True False 44h Cluster version is 4.5.0-0.nightly-2020-08-07-024812 $ oc get csv NAME DISPLAY VERSION REPLACES PHASE ocs-operator.v4.5.0-515.ci OpenShift Container Storage 4.5.0-515.ci Succeeded $ oc get backingstore NAME TYPE PHASE AGE neha-cephfs pv-pool Rejected 2d22h neha-cli pv-pool Rejected 2d22h neha-test pv-pool Rejected 2d22h noobaa-default-backing-store s3-compatible Ready 3d2h Observations: a) The noobaa-default-backing-store automatically transitioned from Connecting ->Ready b) All 3 pv-pool based BS transitioned from Ready->Rejected From one of the rejected BS "neha-test" which used thin SC -------------------------------- status: conditions: - lastHeartbeatTime: "2020-08-06T18:00:09Z" lastTransitionTime: "2020-08-10T02:10:04Z" message: BackingStorePhaseRejected reason: 'Backing store mode: ALL_NODES_OFFLINE' status: Unknown type: Available - lastHeartbeatTime: "2020-08-06T18:00:09Z" lastTransitionTime: "2020-08-10T02:10:04Z" message: BackingStorePhaseRejected reason: 'Backing store mode: ALL_NODES_OFFLINE' status: "False" type: Progressing - lastHeartbeatTime: "2020-08-06T18:00:09Z" lastTransitionTime: "2020-08-10T02:10:04Z" message: BackingStorePhaseRejected reason: 'Backing store mode: ALL_NODES_OFFLINE' status: "True" type: Degraded - lastHeartbeatTime: "2020-08-06T18:00:09Z" lastTransitionTime: "2020-08-10T02:10:04Z" message: BackingStorePhaseRejected reason: 'Backing store mode: ALL_NODES_OFFLINE' status: Unknown type: Upgradeable mode: modeCode: ALL_NODES_OFFLINE timeStamp: 2020-08-07 19:15:41.350622572 +0000 UTC m=+107989.300670396 phase: Rejected Version of all relevant components (if applicable): ------------------------------------------------------ OCS version = ocs-operator.v4.5.0-515.ci Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? ------------------------------------------- yes Is there any workaround available to the best of your knowledge? -------------------------------------------- Not sure Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? ------------------------------------ 4 Can this issue reproducible? ------------------------------ need to test Can this issue reproduce from the UI? --------------------------------- Backingstore was created from UI as well If this is a regression, please provide more details to justify this: ------------------------------------------------- PV-pool is tested from OCS 4.5 onwards Steps to Reproduce: ------------------------- 1. Install OCS ocs-operator.v4.5.0-515.ci which doesnot have fix for Bug 1866781 on a vsphere cluster a) the default backingstore will be stuck in Connecting state 2. Create 2-3 new backingstore with Provider =PVC in following combinations: a) Using noobaa-cli : noobaa-new backingstore create pv-pool neha-cli -n openshift-storage : Noobaa Uses ocs-storagecluster-ceph-rbd SC by default b) Create BS from UI by selecting StorageClass = thin c) Create BS from UI by selecting Storageclass= ocs-storagecluster-cephfs 3. Check the states of the Backingstores. They are all in ready state 4. Upgrade OCP from one 4.5 build to the other oc adm upgrade --to-image=registry.svc.ci.openshift.org/ocp/release:4.5.0-0.nightly-2020-08-07-024812 --force 5. Check the Status of all the existing backingstores. For me, the default backingstore transitioned from Connecting->Ready and all other BS transitioned from Ready->rejected Actual results: ------------------------- After OCP upgrade , the default backingstore transitioned from Connecting->Ready(even though 4.5.0.515 build doesn't have the fix for "Connecting" state of BS) and all other BS transitioned from Ready->rejected Expected results: ---------------------- The BS should not have gone to rejected state post OCP upgrade Additional info: --------------------- $ oc get backingstore NAME TYPE PHASE AGE neha-test pv-pool Ready 4m27s noobaa-default-backing-store s3-compatible Connecting 4h11m ``` ``` $ oc get pvc|grep noobaa neha-test-noobaa-pvc-d404b1c0 Bound pvc-5cab3441-a580-4098-857a-f9cc251926c2 50Gi RWO thin 4m49s ``` ``` $ oc get pod|grep noobaa-pod neha-test-noobaa-pod-d404b1c0 1/1 Running 0 5m45s ``` *So thin SC also works* ``` $ oc describe pvc neha-test-noobaa-pvc-d404b1c0 Name: neha-test-noobaa-pvc-d404b1c0 Namespace: openshift-storage StorageClass: thin Status: Bound Volume: pvc-5cab3441-a580-4098-857a-f9cc251926c2 Labels: pool=neha-test Annotations: pv.kubernetes.io/bind-completed: yes pv.kubernetes.io/bound-by-controller: yes volume.beta.kubernetes.io/storage-provisioner: kubernetes.io/vsphere-volume Finalizers: [kubernetes.io/pvc-protection] Capacity: 50Gi Access Modes: RWO VolumeMode: Filesystem Mounted By: neha-test-noobaa-pod-d404b1c0 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal ProvisioningSucceeded 6m35s persistentvolume-controller Successfully provisioned volume pvc-5cab3441-a580-4098-857a-f9cc251926c2 using kubernetes.io/vsphere-volume For the neha-cli --------------------- ./noobaa-cli backingstore create pv-pool neha-cli -n openshift-storage $ oc get pvc |grep cli neha-cli-noobaa-pvc-ba4e693f Bound pvc-dc0c8e9f-6d45-4c9d-83cb-e4e19e2ac8ea 30Gi RWO ocs-storagecluster-ceph-rbd 4m4s [nberry@localhost sid-aug6-45-515]$ oc get pods |grep cli neha-cli-noobaa-pod-ba4e693f 1/1 Running 0 4m27s
fixed another issue with pods not being deleted during node drain due to having finalizers
While troubleshooting is in progress, moving the BZ to Assigned state: Let me know if instead of FAILEDQA this BZ, you want me to raise a separate BZ.
This last issue is not really relevant to the upgrade issue. It seems we are getting out of memory in the pods due to high load on 2 of the backingstores. These pods being restarted every once in a while is getting the backingstore and bucketclass to go in and out from Rejected phase as data can't be written to those pods. We fixed this issue by doubling the amount of memory required by the pod, and lowering the cache used by it
Hi Mudit, Hopefully the fix is there in 4.5.0-rc3 build right ? if yes, can this BZ be moved to ON_QA?
Yes, its there.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Container Storage 4.5.0 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:3754