1823937 – Upgrade from OCS-4.3 to OCS-4.4 failed due to OB/OBC CRDs ownership collision - Changed from required to owned

Bug 1823937 - Upgrade from OCS-4.3 to OCS-4.4 failed due to OB/OBC CRDs ownership collision - Changed from required to owned

Summary: Upgrade from OCS-4.3 to OCS-4.4 failed due to OB/OBC CRDs ownership collision...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	ocs-operator
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	OCS 4.4.0
Assignee:	umanga
QA Contact:	Aviad Polak
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-04-14 20:04 UTC by shylesh
Modified:	2023-09-14 05:55 UTC (History)
CC List:	17 users (show)
Fixed In Version:	4.4.0-rc6
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1834934 (view as bug list)
Environment:
Last Closed:	2020-06-04 12:54:39 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift ocs-operator pull 518	0	None	closed	Bug 1823937: [release-4.4] Revert owning OB and OBC CRDs	2020-11-16 12:45:20 UTC
Red Hat Product Errata	RHBA-2020:2393	0	None	None	None	2020-06-04 12:54:53 UTC

Description shylesh 2020-04-14 20:04:14 UTC

Description of problem (please be detailed as possible and provide log
snippests):

Upgrade from 4.3-> 4.4 failed with 'etcd server timeouts'

Version of all relevant components (if applicable):
4.3.0-0.nightly-2020-04-13-190424

CSV:-  ocs-operator.v4.4.0-411.ci
Upgrade from 4.3->4.4

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

OCS-CI upgrade test fails

Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?
Tried only once

Can this issue reproduce from the UI?
Not sure

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Run upgrade test from OCS-CI
 

Actual results:
Upgrade fails with 
"
    ignore_error=ignore_error, **kwargs
  File "/home/jenkins/workspace/qe-deploy-ocs-cluster/ocs-ci/ocs_ci/utility/utils.py", line 430, in run_cmd
    f"Error during execution of command: {masked_cmd}."
ocs_ci.ocs.exceptions.CommandFailed: Error during execution of command: oc -n openshift-storage --kubeconfig /home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig rsh rook-ceph-tools-6f59b98f4f-n6w96 ceph health detail.
Error is Error from server: etcdserver: request timed out
"
==========================================
>               f"Resource: {self.resource_name} is not in expected phase: "
                f"{phase}"
            )
E           ocs_ci.ocs.exceptions.ResourceInUnexpectedState: Resource: ocs-operator.v4.4.0-411.ci is not in expected phase: Succeeded

=========================================

Expected results:
Upgrade should work

Additional info:

Must gather
=======
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jnk-vu1cv33-ua/jnk-vu1cv33-ua_20200414T140529/logs/failed_testcase_ocs_logs_1586876888/test_upgrade_ocs_logs/

Complete Logs
========
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jnk-vu1cv33-ua/jnk-vu1cv33-ua_20200414T140529/logs/

Jenkins job
=========
https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/6663/

Comment 3 shylesh 2020-04-15 03:31:03 UTC

Another run for upgrade https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/6670/console .
In this case there was no etcd server request timeout error, instead  'ResourceInUnexpectedState' which was also case in the earlier run. I am not sure if root cause is same across both the runs.

========================================================
        if not sampler.wait_for_func_status(True):
            raise ResourceInUnexpectedState(
>               f"Resource: {self.resource_name} is not in expected phase: "
                f"{phase}"
            )
E           ocs_ci.ocs.exceptions.ResourceInUnexpectedState: Resource: ocs-operator.v4.4.0-411.ci is not in expected phase: Succeeded

ocs_ci/ocs/ocp.py:733: ResourceInUnexpectedState
=======================================================

Comment 4 shylesh 2020-04-15 03:31:48 UTC

(In reply to shylesh from comment #3)
> Another run for upgrade
> https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-
> cluster/6670/console .
> In this case there was no etcd server request timeout error, instead 
> 'ResourceInUnexpectedState' which was also case in the earlier run. I am not
> sure if root cause is same across both the runs.
> 
> ========================================================
>         if not sampler.wait_for_func_status(True):
>             raise ResourceInUnexpectedState(
> >               f"Resource: {self.resource_name} is not in expected phase: "
>                 f"{phase}"
>             )
> E           ocs_ci.ocs.exceptions.ResourceInUnexpectedState: Resource:
> ocs-operator.v4.4.0-411.ci is not in expected phase: Succeeded
> 
> ocs_ci/ocs/ocp.py:733: ResourceInUnexpectedState
> =======================================================

Must gather :- http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jnk-vu1cv33-ua/jnk-vu1cv33-ua_20200414T201453/logs/failed_testcase_ocs_logs_1586899472/test_upgrade_ocs_logs/

Comment 5 Petr Balogh 2020-04-15 07:40:45 UTC

I am building cluster once more to reproduce and will pause before teardown.
https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/6680/console

So we will have cluster for investigation.

Comment 7 Petr Balogh 2020-04-15 09:30:32 UTC

We had some issue with one of the repository as the certificate expired so previous job I linked failed.

Here is the new one:
https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/6682/console

Is the described approach will be just WA and it will be fixed from operator or this will be part of the documentation and user will have to do that?

Will try to do the mentioned and see it it will succeed.

Comment 9 Petr Balogh 2020-04-15 13:38:47 UTC

After step 2 of the workaround :
$ oc delete subscriptions -n openshift-storage lib-bucket-provisioner-alpha-community-operators-openshift-marketplace

Nothing happended.

I had to also remove CSV as it was still there:

$ oc get csv -n openshift-storage
NAME                            DISPLAY                       VERSION   REPLACES   PHASE
lib-bucket-provisioner.v1.0.0   lib-bucket-provisioner        1.0.0                Succeeded
ocs-operator.v4.3.0             OpenShift Container Storage   4.3.0                Succeeded


$ oc delete csv -n openshift-storage lib-bucket-provisioner.v1.0.0

After this, the upgrade started rolling:
$ oc get csv -n openshift-storage
NAME                         DISPLAY                       VERSION        REPLACES              PHASE
ocs-operator.v4.3.0          OpenShift Container Storage   4.3.0                                Succeeded
ocs-operator.v4.4.0-411.ci   OpenShift Container Storage   4.4.0-411.ci   ocs-operator.v4.3.0   Pending

$ oc get csv -n openshift-storage
NAME                         DISPLAY                       VERSION        REPLACES              PHASE
ocs-operator.v4.3.0          OpenShift Container Storage   4.3.0                                Replacing
ocs-operator.v4.4.0-411.ci   OpenShift Container Storage   4.4.0-411.ci   ocs-operator.v4.3.0   Installing



Umanga do you know if lib-bucket-provisioner.v1.0.0   subscription will be still the same for 4.3? Are we depending on this specific version or it can change?

If this is something we need to do as WA we will need to delete this subscription, from it we can get current installed CSV name from subscription's status, there is:   installedCSV: lib-bucket-provisioner.v1.0.0

So then we need to delete this as well.


Is this something what we will describe to customers in documentation or any other idea how to solve this without user intervention?



We need to make sure that nothing in the product is not broken by this from noobaa point of view.

Comment 10 Nimrod Becker 2020-04-16 15:48:33 UTC

4.2 and 4.3 are the same, and we should find a way to remove this dependency without manual intervention which is not acceptable.
Please allow us to investigate and also consult the OLM team, might also be a problem in that mechanism.

Comment 11 Petr Balogh 2020-04-22 08:37:01 UTC

As Nimrod wanted to reproduce on OCP 4.2, here are the data:

must gather:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jnk-ai3c33-ua/jnk-ai3c33-ua_20200421T193753/logs/failed_testcase_ocs_logs_1587501073/test_upgrade_ocs_logs/

Jenkins job:
https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/6869/

OCP version v4.2.29

As I don't see the details about OCP version for the job we reported this BZ adding it here:
OCP 4.3.0-0.nightly-2020-04-13-190424

Comment 14 Michael Adam 2020-04-28 14:02:05 UTC

To summarize the various discussions: We are intending to do a code fix for 4.4. Therefore giving devel_ack. The component and assignee might still change.

Comment 17 Nimrod Becker 2020-05-05 15:16:47 UTC

Updating and moving to ON_QE. Fix would be in the community operators, but testing is going on.

We chose to go with "options #2" which is publishing a new version of the lib-bucket-provisioner (v2) to the community operators (same channel as v1).
This version with not require and will not own the two CRDs (OB/OBC) thus untangling the collision we had when we switched from "require" in OCS 4.3 to "own" in ocs 4.4

Since this change is in the community operators and pushing a new version would affect existing deployments, we are using a private catalog to simulate the same flow.

The current status is:

Deploying a new OCS 4.3 on top of OCP 4.3 when both v1 and v2 exist in the catalog works - v1 is installed and then upgraded to v2. Upgrade to OCS 4.4 works smoothly.
Deploy OCS 4.3 on top of OCP 4.3 with v1 in the catalog (existing customer) and then pushing v2 to the catalog - lib-bucket is upgraded and then upgrade to OCS 4.4 works.
Deploy OCS 4.2 on OCP 4.3 with v1 in the catalog, push v2 to the catalog and upgrade OCS to 4.3 works as well.

Testing details would be added in the following comment.

Comment 21 Petr Balogh 2020-05-12 13:29:47 UTC

Seems there is some new issue:

From OLM logs I see:
oc logs -n openshift-operator-lifecycle-manager catalog-operator-5b59684b6d-nx5zp

penshift-marketplace
time="2020-05-12T12:49:43Z" level=info msg="error updating subscription status" channel=alpha error="Operation cannot be fulfilled on subscriptions.operators.coreos.com \"lib-bucket-provisioner-alpha-lib-bucket-catalog-openshift-marketpla
ce\": the object has been modified; please apply your changes to the latest version and try again" id=lAVx6 namespace=openshift-storage pkg=lib-bucket-provisioner source=lib-bucket-catalog sub=lib-bucket-provisioner-alpha-lib-bucket-catal
og-openshift-marketplace
E0512 12:49:43.957056       1 queueinformer_operator.go:290] sync "openshift-storage" failed: error updating Subscription status: Operation cannot be fulfilled on subscriptions.operators.coreos.com "lib-bucket-provisioner-alpha-lib-bucket
-catalog-openshift-marketplace": the object has been modified; please apply your changes to the latest version and try again
time="2020-05-12T12:49:44Z" level=info msg=syncing event=update reconciling="*v1alpha1.Subscription" selflink=/apis/operators.coreos.com/v1alpha1/namespaces/openshift-storage/subscriptions/lib-bucket-provisioner-alpha-lib-bucket-catalog-o
penshift-marketplace
E0512 12:49:46.016558       1 queueinformer_operator.go:290] sync "openshift-storage" failed: error calculating generation changes due to new bundle: ceph.rook.io/v1/CephObjectStoreUser (cephobjectstoreusers) already provided by ocs-opera
tor.v4.3.0
E0512 12:49:46.803863       1 queueinformer_operator.go:290] sync "openshift-storage" failed: error calculating generation changes due to new bundle: noobaa.io/v1alpha1/NooBaa (noobaas) already provided by ocs-operator.v4.3.0
time="2020-05-12T12:50:03Z" level=warning msg="no installplan found with matching generation, creating new one" id=40xu4 namespace=openshift-storage
time="2020-05-12T12:50:03Z" level=info msg=syncing id=lEdNi ip=install-fdv8h namespace=openshift-storage phase=


CSVs look like:
$ oc get csv -n openshift-storage
NAME                            DISPLAY                       VERSION        REPLACES                        PHASE
lib-bucket-provisioner.v1.0.0   lib-bucket-provisioner        1.0.0                                          Failed
lib-bucket-provisioner.v2.0.0   lib-bucket-provisioner        2.0.0          lib-bucket-provisioner.v1.0.0   Pending
ocs-operator.v4.3.0             OpenShift Container Storage   4.3.0                                          Replacing
ocs-operator.v4.4.0-420.ci      OpenShift Container Storage   4.4.0-420.ci   ocs-operator.v4.3.0             Failed

Tried from 4.3 live content to latest internal 4.4 RC5 build: 4.4.0-420.ci

OCP version:
Server Version: 4.4.0-0.nightly-2020-05-08-224132

Jenkins job still running:
https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/7451/console


Once the job will fail and collect must gather they should be available here:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jnk-ai3c33-ua/jnk-ai3c33-ua_20200512T082600/logs/ocs-ci-logs-1589282568/


So seems like we have some issue with this approach at least in latest OCP 4.4

Comment 22 Petr Balogh 2020-05-12 13:50:06 UTC

The must gather logs will be collected actually here:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jnk-ai3c33-ua/jnk-ai3c33-ua_20200512T082600/logs/failed_testcase_ocs_logs_1589282568/test_upgrade_ocs_logs/

I see now the collecting is in the process.

Comment 23 Petr Balogh 2020-05-12 14:15:21 UTC

Weird is that this approach worked fine last week on OCP 4.3 as you can see from the console output mentioned in this mail thread:
http://post-office.corp.redhat.com/archives/rhocs-eng/2020-May/msg00012.html

Comment 24 Petr Balogh 2020-05-12 18:55:48 UTC

Trying to reproduce once more on OCP 4.3 here:

https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/7469/console

Comment 25 Petr Balogh 2020-05-12 21:05:01 UTC

Looks like the upgrade started rolling fine on OCP 4.3 but then we hit another noobaa issue, so I will report BZ for that tomorrow morning.

Based on conversation with Vu, this can be caused by this bug fix:
https://github.com/operator-framework/operator-lifecycle-manager/pull/1484

Which was merged to OCP 4.4 but not to 4.3 so that's why it's probably working on OCP 4.3 but not on OCP 4.4.

Comment 26 Michael Adam 2020-05-15 23:37:35 UTC

As decided by the stakeholders, we are falling back to reverting the CRD owning change in ocs-operator.

https://github.com/openshift/ocs-operator/pull/518
has been merged to the release branch.

Comment 34 Elad 2020-06-01 08:25:15 UTC

Removing the needinfo as everything is clear now

Comment 36 errata-xmlrpc 2020-06-04 12:54:39 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2393

Comment 37 Red Hat Bugzilla 2023-09-14 05:55:28 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days

Note You need to log in before you can comment on or make changes to this bug.