1874243 – [RFE] Noobaa resources are impacted with a downtime during admin operations, such as upgrade, due to no HA for noobaa-core and noobaa-db

Bug 1874243 - [RFE] Noobaa resources are impacted with a downtime during admin operations, such as upgrade, due to no HA for noobaa-core and noobaa-db

Summary: [RFE] Noobaa resources are impacted with a downtime during admin operations, ...

Keywords:
Status:	CLOSED DEFERRED
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	Multi-Cloud Object Gateway
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Nobody
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1913771 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-08-31 19:16 UTC by Neha Berry
Modified:	2023-08-09 16:49 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-04-17 17:18:20 UTC
Embargoed:

Attachments	(Terms of Use)

Description Neha Berry 2020-08-31 19:16:22 UTC

Description of problem (please be detailed as possible and provide log
snippests):
-----------------------------------------------------------------------

In recent times, whenever we tested node drains or OCP upgrade (which involves Drain followed by node reboot) , it is seen that all the backingstores go to rejected state.

this is bacause the noobaa-core POD might have been restarted on another node. But it would be helpful if we have HA for these important noobaa pods to not have disruption of services.

Esp. in the case of OCP upgrade, where nodes are maintained in rolling fashion during MCO upgrade, it is seen that th noobaa-core pods may get restarted more than once.  Due to this, the BS get to rejected state for quite some time, resulting in disruption of IOs running on their OBCs.

For more details on the issue and the logs, please refer - https://bugzilla.redhat.com/show_bug.cgi?id=1867762#c14

>> During OCP upgrade

Fri Aug 28 10:17:51 UTC 2020
NAME                           TYPE            PHASE      AGE
bs-cephfs                      pv-pool         Rejected   15h
bs-pool-rbd2                   pv-pool         Rejected   44m
bs-rbd                         pv-pool         Rejected   15h
bs-thin                        pv-pool         Rejected   15h
noobaa-default-backing-store   s3-compatible   Rejected   20h

NAME                          PLACEMENT                                                         PHASE      AGE
bclass-pool-rbd2              map[tiers:[map[backingStores:[bs-pool-rbd2] placement:Spread]]]   Rejected   43m
noobaa-default-bucket-class   map[tiers:[map[backingStores:[noobaa-default-backing-store]]]]    Rejected   20h
pv-bclass-cephfs              map[tiers:[map[backingStores:[bs-cephfs] placement:Spread]]]      Rejected   15h
pv-bucket-rbd                 map[tiers:[map[backingStores:[bs-rbd] placement:Spread]]]         Rejected   15h

bs-cephfs-noobaa-pod-ec2e00ce                                     0/1     ContainerCreating   0          17s    <none>         compute-2   <none>           <none>
bs-pool-rbd2-noobaa-pod-b138c371                                  0/1     ContainerCreating   0          17s    <none>         compute-2   <none>           <none>
bs-rbd-noobaa-pod-6d9dc069                                        0/1     ContainerCreating   0          18s    <none>         compute-2   <none>           <none>
bs-thin-noobaa-pod-6161aa95                                       0/1     ContainerCreating   0          18s    <none>         compute-2   <none>           <none>
nbio-cephfs-6484c755dc-tfspk                                      1/1     Running             3          128m   10.131.0.108   compute-1   <none>           <none>
nbio-default-bs-cc86ccd6c-xpbmg                                   1/1     Running             3          127m   10.131.0.109   compute-1   <none>           <none>
nbio-rbd-c87c44f8c-hnh6t                                          1/1     Running             20         14h    10.131.0.103   compute-1   <none>           <none>
nbio-rbd2-5c55b48b5-8slmc                                         1/1     Running             1          38m    10.131.0.113   compute-1   <none>           <none>
nbio-rgw-74c7b4d576-c2hz4                                         1/1     Running             1          47m    10.131.0.111   compute-1   <none>           <none>
nbio-thin-644bf9fcfb-zkmbm                                        1/1     Running             13         130m   10.131.0.106   compute-1   <none>           <none>
noobaa-core-0                                                     1/1     Running             0          41s    10.129.2.111   compute-2   <none>           <none>
noobaa-db-0                                                       1/1     Running             0          15h    10.129.2.96    compute-2   <none>           <none>
noobaa-endpoint-59c5769c8-8bdrq                                   1/1     Running             0          74s    10.129.2.103   compute-2   <none>           <none>
noobaa-operator-fc85888b7-2dlh4                                   1/1     Running             0          79s    10.131.0.122   compute-1   <none>           <none>

The BS might go to rejected state even during OCS upgrade - https://bugzilla.redhat.com/show_bug.cgi?id=1868646#c13



Version of all relevant components (if applicable):
----------------------------------------------------
Seen since OCS 4.2 builds. 

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
------------------------------------------------
Yes. For OCP upgrade, due to more data size, my MCO upgrade took >25 mins. During this timeframe, the BS were in rejected state almost the whole time. Or were transitioning from Ready -> rejected continuously.


Is there any workaround available to the best of your knowledge?
------------------------------------------------------------------
No. 

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
-------------------------------------------------------------
3

Can this issue reproducible?
-----------------------------------
Always

Can this issue reproduce from the UI?
----------------------------------------
NA

If this is a regression, please provide more details to justify this:
------------------------------------------------------------
No

Steps to Reproduce:
------------------------------------
1. 1. Created some BS , bucketclases , OBCs and nbio pods using the pv pool as well as default BS
2. Performed OCP upgrade from 4.5.0-0.nightly-2020-08-24-135141 to 4.5.0-0.nightly-2020-08-28-063337
3. Continuously checked for the BS state using a loop

$ oc project openshift-storage; while true; do date --utc; echo "====Backingstore=====";  oc get backingstore; echo ""; echo "====bucketclass====="; oc get bucketclass; echo ""; echo "====noobaa pods====="; oc get pods -o wide|egrep 'noobaa|nbio'; echo ""; echo "====obcs=====" oc get obc; sleep 10; done|tee noobaa-resources-in-loop.txt

4. checked the state of the Backingstore after successful completion of the Upgrade. they did not transition to rejected state.



Actual results:
-------------------------
While Nodes are drained/rebooted and nodes are affected (which might be hosting the noobaa-core pods),  the Backingstores go to rejected state for a long time which causes IO disruptions


Expected results:
------------------------------
If we had HA for noobba pods (core/db), then we would't have faced IO disruptions during Upgrades, Node drains, etc

Additional info:

Comment 5 Neha Berry 2020-09-03 06:40:27 UTC

Maybe my understanding is wrong, but if we are adding the request in the Backlog, why can't we keep the Bug open as an RFE and target it for a future release where we may plan to take it up, e.g. 4.7. We can always re-target if it gets delayed.

I understand the current behavior is by design but we are requesting for a design change as we end up in issues because of no HA. That's why we raised it as an RFE.

Closing the BZ as NOTABUG doesn't in a glance confirm that it is being tracked in backlog and also may result in loss of track from our side as we do not refer to backlog regularly to monitor the plan.

Even if we have to close an RFE, why not use DEFERRED ?

Correct me If my understanding about Backlog is wrong.

Comment 6 Nimrod Becker 2020-09-03 06:58:38 UTC

Mayube not a bug was the wrong close here, but we don't want to manage RFEs in BZs

Comment 7 Elad 2020-09-03 07:03:11 UTC

HA is much needed. I agree that we should track RFEs in a single place so having it in Jira only is ok.

However, NOTABUG is not the correct reason.

Also, I believe this BZ should not be marked as an RFE bug as a regular high severity bug, and should capture the fact that there is a downtime while performing admin operations

Comment 8 Neha Berry 2020-09-03 07:36:58 UTC

After performing a set of OCP, OCS upgrades and node drain,node shutdown, etc. it is seen that Noobaa resources and IO is severely impacted as noobaa-core might get drained multiple times, each time resulting in downtime for the backingstores.

At times, the downtime exceeds 5-10 mins. 

IMO, we should consider it a must-fix. IMO, hopefully I am not wrong, even the provisioner pods used to be statefulset in the earlier days but due to the downtime, they were converted to deployments with replica 2.

I would request to consider this for a fix in next release. Hence, based on Comment#7 and Comment#0, re-opening the bug to start a discussion.


Thanks in advance.

Comment 9 Nimrod Becker 2020-09-03 07:47:24 UTC

We discussed in the triage meeting.
HA and non disruptive upgrade are not the same, and the ask here is for the latter

I'ts not due to HA since it can be achieved without HA. I would change the title

Comment 10 Nimrod Becker 2020-09-30 12:20:43 UTC

RFE, won't be in time for 4.6, probably want to push to 4.7

Comment 16 Nimrod Becker 2021-01-19 08:55:16 UTC

*** Bug 1913771 has been marked as a duplicate of this bug. ***

Comment 18 Nimrod Becker 2021-04-05 12:53:55 UTC

also related to https://bugzilla.redhat.com/show_bug.cgi?id=1889616

Comment 22 Nimrod Becker 2021-12-09 08:49:54 UTC

As a feature it should be planned as part of the version

Note You need to log in before you can comment on or make changes to this bug.