1867762 – After OCP upgrade, user created backingstores with provider: PVC went to rejected state

Bug 1867762 - After OCP upgrade, user created backingstores with provider: PVC went to rejected state

Summary: After OCP upgrade, user created backingstores with provider: PVC went to reje...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	Multi-Cloud Object Gateway
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	OCS 4.5.0
Assignee:	Jacky Albo
QA Contact:	Neha Berry
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-08-10 17:38 UTC by Neha Berry
Modified:	2020-09-23 09:05 UTC (History)
CC List:	7 users (show)
Fixed In Version:	4.5.0-67.ci
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-09-15 10:18:38 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	noobaa noobaa-core pull 6148	None	closed	Disabling block_verifer and limiting cache to 100MB	2021-02-10 00:13:31 UTC
Github	noobaa noobaa-core pull 6149	None	closed	Backport to 5.5	2021-02-10 00:13:31 UTC
Github	noobaa noobaa-operator pull 390	None	closed	Will reconcile on mode change for backingstores	2021-02-10 00:13:31 UTC
Github	noobaa noobaa-operator pull 391	None	closed	Backport to 2.3	2021-02-10 00:13:31 UTC
Github	noobaa noobaa-operator pull 393	None	closed	Change pod delete handling for pvpool	2021-02-10 00:13:31 UTC
Github	noobaa noobaa-operator pull 396	None	closed	change with pod delete handling for pvpool	2021-02-10 00:13:31 UTC
Github	noobaa noobaa-operator pull 402	None	closed	changing default mem req in pvpool pod to 400M	2021-02-10 00:13:31 UTC
Github	noobaa noobaa-operator pull 403	None	closed	changing default mem req in pvpool pod to 400M	2021-02-10 00:13:31 UTC
Red Hat Product Errata	RHBA-2020:3754	None	None	None	2020-09-15 10:19:02 UTC

Description Neha Berry 2020-08-10 17:38:43 UTC

Description of problem (please be detailed as possible and provide log
snippests):
----------------------------------------------------------------------------

OCS build was ocs-operator.v4.5.0-515.ci, hence the default noobaa-backingstore was in Connecting state indefinitely - Bug 1866781

Created few backingstores using PV-pool with different SCs and all came in Ready state

Before OCP upgrade
++++++++++++++++++++++++

OCP version = 4.5.0-0.nightly-2020-08-06-062632

======= backingstore ==========
NAME                           TYPE            PHASE        AGE
neha-cephfs                    pv-pool         Ready        14h
neha-cli                       pv-pool         Ready        14h
neha-test                      pv-pool         Ready        14h
noobaa-default-backing-store   s3-compatible   Connecting   18h



After OCP Upgrade
+++++++++++++++++++++++
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.0-0.nightly-2020-08-07-024812   True        False         44h     Cluster version is 4.5.0-0.nightly-2020-08-07-024812

$ oc get csv
NAME                         DISPLAY                       VERSION        REPLACES   PHASE
ocs-operator.v4.5.0-515.ci   OpenShift Container Storage   4.5.0-515.ci              Succeeded

$ oc get backingstore
NAME                           TYPE            PHASE      AGE
neha-cephfs                    pv-pool         Rejected   2d22h
neha-cli                       pv-pool         Rejected   2d22h
neha-test                      pv-pool         Rejected   2d22h
noobaa-default-backing-store   s3-compatible   Ready      3d2h

Observations:

a) The noobaa-default-backing-store automatically transitioned from Connecting ->Ready 
b) All 3 pv-pool based BS transitioned from Ready->Rejected

From one of the rejected BS "neha-test" which used thin SC
--------------------------------

status:
  conditions:
  - lastHeartbeatTime: "2020-08-06T18:00:09Z"
    lastTransitionTime: "2020-08-10T02:10:04Z"
    message: BackingStorePhaseRejected
    reason: 'Backing store mode: ALL_NODES_OFFLINE'
    status: Unknown
    type: Available
  - lastHeartbeatTime: "2020-08-06T18:00:09Z"
    lastTransitionTime: "2020-08-10T02:10:04Z"
    message: BackingStorePhaseRejected
    reason: 'Backing store mode: ALL_NODES_OFFLINE'
    status: "False"
    type: Progressing
  - lastHeartbeatTime: "2020-08-06T18:00:09Z"
    lastTransitionTime: "2020-08-10T02:10:04Z"
    message: BackingStorePhaseRejected
    reason: 'Backing store mode: ALL_NODES_OFFLINE'
    status: "True"
    type: Degraded
  - lastHeartbeatTime: "2020-08-06T18:00:09Z"
    lastTransitionTime: "2020-08-10T02:10:04Z"
    message: BackingStorePhaseRejected
    reason: 'Backing store mode: ALL_NODES_OFFLINE'
    status: Unknown
    type: Upgradeable
  mode:
    modeCode: ALL_NODES_OFFLINE
    timeStamp: 2020-08-07 19:15:41.350622572 +0000 UTC m=+107989.300670396
  phase: Rejected




Version of all relevant components (if applicable):
------------------------------------------------------
OCS version = ocs-operator.v4.5.0-515.ci



Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
-------------------------------------------
yes

Is there any workaround available to the best of your knowledge?
--------------------------------------------
Not sure


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
------------------------------------
4


Can this issue reproducible?
------------------------------
need to test

Can this issue reproduce from the UI?
---------------------------------
Backingstore was created from UI as well

If this is a regression, please provide more details to justify this:
-------------------------------------------------

PV-pool is tested from OCS 4.5 onwards

Steps to Reproduce:
-------------------------
1. Install OCS ocs-operator.v4.5.0-515.ci which doesnot have fix for Bug 1866781 on a vsphere cluster
   a) the default backingstore will be stuck in Connecting state
2. Create 2-3 new backingstore with Provider =PVC in following combinations:
   a) Using noobaa-cli : noobaa-new backingstore create pv-pool neha-cli -n openshift-storage : Noobaa Uses ocs-storagecluster-ceph-rbd SC by default
   b) Create BS from UI by selecting StorageClass = thin
   c) Create BS from UI by selecting Storageclass= ocs-storagecluster-cephfs
3. Check the states of the Backingstores. They are all in ready state
4. Upgrade OCP from one 4.5 build to the other
     oc adm upgrade --to-image=registry.svc.ci.openshift.org/ocp/release:4.5.0-0.nightly-2020-08-07-024812 --force

5. Check the Status of all the existing backingstores. For me, the default backingstore transitioned from Connecting->Ready and all other BS transitioned from Ready->rejected


Actual results:
-------------------------
After OCP upgrade , the default backingstore transitioned from Connecting->Ready(even though 4.5.0.515 build doesn't have the fix for "Connecting" state of BS) and all other BS transitioned from Ready->rejected

Expected results:
----------------------
The BS should not have gone to rejected state post OCP upgrade

Additional info:
---------------------

$ oc get backingstore
NAME                           TYPE            PHASE        AGE
neha-test                      pv-pool         Ready        4m27s
noobaa-default-backing-store   s3-compatible   Connecting   4h11m

```
```
$ oc get pvc|grep noobaa

neha-test-noobaa-pvc-d404b1c0   Bound    pvc-5cab3441-a580-4098-857a-f9cc251926c2   50Gi       RWO            thin                          4m49s
```
```
$ oc get pod|grep noobaa-pod
neha-test-noobaa-pod-d404b1c0                                     1/1     Running     0          5m45s

```
*So thin SC also works*

```
$ oc describe pvc neha-test-noobaa-pvc-d404b1c0
Name:          neha-test-noobaa-pvc-d404b1c0
Namespace:     openshift-storage
StorageClass:  thin
Status:        Bound
Volume:        pvc-5cab3441-a580-4098-857a-f9cc251926c2
Labels:        pool=neha-test
Annotations:   pv.kubernetes.io/bind-completed: yes
               pv.kubernetes.io/bound-by-controller: yes
               volume.beta.kubernetes.io/storage-provisioner: kubernetes.io/vsphere-volume
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:      50Gi
Access Modes:  RWO
VolumeMode:    Filesystem
Mounted By:    neha-test-noobaa-pod-d404b1c0
Events:
  Type    Reason                 Age    From                         Message
  ----    ------                 ----   ----                         -------
  Normal  ProvisioningSucceeded  6m35s  persistentvolume-controller  Successfully provisioned volume pvc-5cab3441-a580-4098-857a-f9cc251926c2 using kubernetes.io/vsphere-volume


For the neha-cli
---------------------

./noobaa-cli backingstore create pv-pool neha-cli -n openshift-storage


$ oc get pvc |grep cli
neha-cli-noobaa-pvc-ba4e693f      Bound    pvc-dc0c8e9f-6d45-4c9d-83cb-e4e19e2ac8ea   30Gi       RWO            ocs-storagecluster-ceph-rbd   4m4s
[nberry@localhost sid-aug6-45-515]$ oc get pods |grep cli
neha-cli-noobaa-pod-ba4e693f                                      1/1     Running     0          4m27s

Comment 7 Jacky Albo 2020-08-20 09:03:41 UTC

fixed another issue with pods not being deleted during node drain due to having finalizers

Comment 10 Neha Berry 2020-08-25 15:17:12 UTC

While troubleshooting is in progress, moving the BZ to Assigned state:

Let me know if instead of FAILEDQA this BZ, you want me to raise a separate BZ.

Comment 11 Jacky Albo 2020-08-26 07:24:08 UTC

This last issue is not really relevant to the upgrade issue. It seems we are getting out of memory in the pods due to high load on 2 of the backingstores. These pods being restarted every once in a while is getting the backingstore and bucketclass to go in and out from Rejected phase as data can't be written to those pods.
We fixed this issue by doubling the amount of memory required by the pod, and lowering the cache used by it

Comment 12 Neha Berry 2020-08-27 17:33:28 UTC

Hi Mudit,

Hopefully the fix is there in 4.5.0-rc3 build right ?  if yes, can this BZ be moved to ON_QA?

Comment 13 Mudit Agarwal 2020-08-28 02:16:29 UTC

Yes, its there.

Comment 18 errata-xmlrpc 2020-09-15 10:18:38 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Container Storage 4.5.0 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3754

Note You need to log in before you can comment on or make changes to this bug.