Description of problem (please be detailed as possible and provide log snippests): ----------------------------------------------------------------------- In recent times, whenever we tested node drains or OCP upgrade (which involves Drain followed by node reboot) , it is seen that all the backingstores go to rejected state. this is bacause the noobaa-core POD might have been restarted on another node. But it would be helpful if we have HA for these important noobaa pods to not have disruption of services. Esp. in the case of OCP upgrade, where nodes are maintained in rolling fashion during MCO upgrade, it is seen that th noobaa-core pods may get restarted more than once. Due to this, the BS get to rejected state for quite some time, resulting in disruption of IOs running on their OBCs. For more details on the issue and the logs, please refer - https://bugzilla.redhat.com/show_bug.cgi?id=1867762#c14 >> During OCP upgrade Fri Aug 28 10:17:51 UTC 2020 NAME TYPE PHASE AGE bs-cephfs pv-pool Rejected 15h bs-pool-rbd2 pv-pool Rejected 44m bs-rbd pv-pool Rejected 15h bs-thin pv-pool Rejected 15h noobaa-default-backing-store s3-compatible Rejected 20h NAME PLACEMENT PHASE AGE bclass-pool-rbd2 map[tiers:[map[backingStores:[bs-pool-rbd2] placement:Spread]]] Rejected 43m noobaa-default-bucket-class map[tiers:[map[backingStores:[noobaa-default-backing-store]]]] Rejected 20h pv-bclass-cephfs map[tiers:[map[backingStores:[bs-cephfs] placement:Spread]]] Rejected 15h pv-bucket-rbd map[tiers:[map[backingStores:[bs-rbd] placement:Spread]]] Rejected 15h bs-cephfs-noobaa-pod-ec2e00ce 0/1 ContainerCreating 0 17s <none> compute-2 <none> <none> bs-pool-rbd2-noobaa-pod-b138c371 0/1 ContainerCreating 0 17s <none> compute-2 <none> <none> bs-rbd-noobaa-pod-6d9dc069 0/1 ContainerCreating 0 18s <none> compute-2 <none> <none> bs-thin-noobaa-pod-6161aa95 0/1 ContainerCreating 0 18s <none> compute-2 <none> <none> nbio-cephfs-6484c755dc-tfspk 1/1 Running 3 128m 10.131.0.108 compute-1 <none> <none> nbio-default-bs-cc86ccd6c-xpbmg 1/1 Running 3 127m 10.131.0.109 compute-1 <none> <none> nbio-rbd-c87c44f8c-hnh6t 1/1 Running 20 14h 10.131.0.103 compute-1 <none> <none> nbio-rbd2-5c55b48b5-8slmc 1/1 Running 1 38m 10.131.0.113 compute-1 <none> <none> nbio-rgw-74c7b4d576-c2hz4 1/1 Running 1 47m 10.131.0.111 compute-1 <none> <none> nbio-thin-644bf9fcfb-zkmbm 1/1 Running 13 130m 10.131.0.106 compute-1 <none> <none> noobaa-core-0 1/1 Running 0 41s 10.129.2.111 compute-2 <none> <none> noobaa-db-0 1/1 Running 0 15h 10.129.2.96 compute-2 <none> <none> noobaa-endpoint-59c5769c8-8bdrq 1/1 Running 0 74s 10.129.2.103 compute-2 <none> <none> noobaa-operator-fc85888b7-2dlh4 1/1 Running 0 79s 10.131.0.122 compute-1 <none> <none> The BS might go to rejected state even during OCS upgrade - https://bugzilla.redhat.com/show_bug.cgi?id=1868646#c13 Version of all relevant components (if applicable): ---------------------------------------------------- Seen since OCS 4.2 builds. Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? ------------------------------------------------ Yes. For OCP upgrade, due to more data size, my MCO upgrade took >25 mins. During this timeframe, the BS were in rejected state almost the whole time. Or were transitioning from Ready -> rejected continuously. Is there any workaround available to the best of your knowledge? ------------------------------------------------------------------ No. Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? ------------------------------------------------------------- 3 Can this issue reproducible? ----------------------------------- Always Can this issue reproduce from the UI? ---------------------------------------- NA If this is a regression, please provide more details to justify this: ------------------------------------------------------------ No Steps to Reproduce: ------------------------------------ 1. 1. Created some BS , bucketclases , OBCs and nbio pods using the pv pool as well as default BS 2. Performed OCP upgrade from 4.5.0-0.nightly-2020-08-24-135141 to 4.5.0-0.nightly-2020-08-28-063337 3. Continuously checked for the BS state using a loop $ oc project openshift-storage; while true; do date --utc; echo "====Backingstore====="; oc get backingstore; echo ""; echo "====bucketclass====="; oc get bucketclass; echo ""; echo "====noobaa pods====="; oc get pods -o wide|egrep 'noobaa|nbio'; echo ""; echo "====obcs=====" oc get obc; sleep 10; done|tee noobaa-resources-in-loop.txt 4. checked the state of the Backingstore after successful completion of the Upgrade. they did not transition to rejected state. Actual results: ------------------------- While Nodes are drained/rebooted and nodes are affected (which might be hosting the noobaa-core pods), the Backingstores go to rejected state for a long time which causes IO disruptions Expected results: ------------------------------ If we had HA for noobba pods (core/db), then we would't have faced IO disruptions during Upgrades, Node drains, etc Additional info:
Maybe my understanding is wrong, but if we are adding the request in the Backlog, why can't we keep the Bug open as an RFE and target it for a future release where we may plan to take it up, e.g. 4.7. We can always re-target if it gets delayed. I understand the current behavior is by design but we are requesting for a design change as we end up in issues because of no HA. That's why we raised it as an RFE. Closing the BZ as NOTABUG doesn't in a glance confirm that it is being tracked in backlog and also may result in loss of track from our side as we do not refer to backlog regularly to monitor the plan. Even if we have to close an RFE, why not use DEFERRED ? Correct me If my understanding about Backlog is wrong.
Mayube not a bug was the wrong close here, but we don't want to manage RFEs in BZs
HA is much needed. I agree that we should track RFEs in a single place so having it in Jira only is ok. However, NOTABUG is not the correct reason. Also, I believe this BZ should not be marked as an RFE bug as a regular high severity bug, and should capture the fact that there is a downtime while performing admin operations
After performing a set of OCP, OCS upgrades and node drain,node shutdown, etc. it is seen that Noobaa resources and IO is severely impacted as noobaa-core might get drained multiple times, each time resulting in downtime for the backingstores. At times, the downtime exceeds 5-10 mins. IMO, we should consider it a must-fix. IMO, hopefully I am not wrong, even the provisioner pods used to be statefulset in the earlier days but due to the downtime, they were converted to deployments with replica 2. I would request to consider this for a fix in next release. Hence, based on Comment#7 and Comment#0, re-opening the bug to start a discussion. Thanks in advance.
We discussed in the triage meeting. HA and non disruptive upgrade are not the same, and the ask here is for the latter I'ts not due to HA since it can be achieved without HA. I would change the title
RFE, won't be in time for 4.6, probably want to push to 4.7
*** Bug 1913771 has been marked as a duplicate of this bug. ***
also related to https://bugzilla.redhat.com/show_bug.cgi?id=1889616
As a feature it should be planned as part of the version