Description of problem (please be detailed as possible and provide log snippests): Upgrade from 4.3-> 4.4 failed with 'etcd server timeouts' Version of all relevant components (if applicable): 4.3.0-0.nightly-2020-04-13-190424 CSV:- ocs-operator.v4.4.0-411.ci Upgrade from 4.3->4.4 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? OCS-CI upgrade test fails Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Tried only once Can this issue reproduce from the UI? Not sure If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Run upgrade test from OCS-CI Actual results: Upgrade fails with " ignore_error=ignore_error, **kwargs File "/home/jenkins/workspace/qe-deploy-ocs-cluster/ocs-ci/ocs_ci/utility/utils.py", line 430, in run_cmd f"Error during execution of command: {masked_cmd}." ocs_ci.ocs.exceptions.CommandFailed: Error during execution of command: oc -n openshift-storage --kubeconfig /home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig rsh rook-ceph-tools-6f59b98f4f-n6w96 ceph health detail. Error is Error from server: etcdserver: request timed out " ========================================== > f"Resource: {self.resource_name} is not in expected phase: " f"{phase}" ) E ocs_ci.ocs.exceptions.ResourceInUnexpectedState: Resource: ocs-operator.v4.4.0-411.ci is not in expected phase: Succeeded ========================================= Expected results: Upgrade should work Additional info: Must gather ======= http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jnk-vu1cv33-ua/jnk-vu1cv33-ua_20200414T140529/logs/failed_testcase_ocs_logs_1586876888/test_upgrade_ocs_logs/ Complete Logs ======== http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jnk-vu1cv33-ua/jnk-vu1cv33-ua_20200414T140529/logs/ Jenkins job ========= https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/6663/
Another run for upgrade https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/6670/console . In this case there was no etcd server request timeout error, instead 'ResourceInUnexpectedState' which was also case in the earlier run. I am not sure if root cause is same across both the runs. ======================================================== if not sampler.wait_for_func_status(True): raise ResourceInUnexpectedState( > f"Resource: {self.resource_name} is not in expected phase: " f"{phase}" ) E ocs_ci.ocs.exceptions.ResourceInUnexpectedState: Resource: ocs-operator.v4.4.0-411.ci is not in expected phase: Succeeded ocs_ci/ocs/ocp.py:733: ResourceInUnexpectedState =======================================================
(In reply to shylesh from comment #3) > Another run for upgrade > https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs- > cluster/6670/console . > In this case there was no etcd server request timeout error, instead > 'ResourceInUnexpectedState' which was also case in the earlier run. I am not > sure if root cause is same across both the runs. > > ======================================================== > if not sampler.wait_for_func_status(True): > raise ResourceInUnexpectedState( > > f"Resource: {self.resource_name} is not in expected phase: " > f"{phase}" > ) > E ocs_ci.ocs.exceptions.ResourceInUnexpectedState: Resource: > ocs-operator.v4.4.0-411.ci is not in expected phase: Succeeded > > ocs_ci/ocs/ocp.py:733: ResourceInUnexpectedState > ======================================================= Must gather :- http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jnk-vu1cv33-ua/jnk-vu1cv33-ua_20200414T201453/logs/failed_testcase_ocs_logs_1586899472/test_upgrade_ocs_logs/
I am building cluster once more to reproduce and will pause before teardown. https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/6680/console So we will have cluster for investigation.
We had some issue with one of the repository as the certificate expired so previous job I linked failed. Here is the new one: https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/6682/console Is the described approach will be just WA and it will be fixed from operator or this will be part of the documentation and user will have to do that? Will try to do the mentioned and see it it will succeed.
After step 2 of the workaround : $ oc delete subscriptions -n openshift-storage lib-bucket-provisioner-alpha-community-operators-openshift-marketplace Nothing happended. I had to also remove CSV as it was still there: $ oc get csv -n openshift-storage NAME DISPLAY VERSION REPLACES PHASE lib-bucket-provisioner.v1.0.0 lib-bucket-provisioner 1.0.0 Succeeded ocs-operator.v4.3.0 OpenShift Container Storage 4.3.0 Succeeded $ oc delete csv -n openshift-storage lib-bucket-provisioner.v1.0.0 After this, the upgrade started rolling: $ oc get csv -n openshift-storage NAME DISPLAY VERSION REPLACES PHASE ocs-operator.v4.3.0 OpenShift Container Storage 4.3.0 Succeeded ocs-operator.v4.4.0-411.ci OpenShift Container Storage 4.4.0-411.ci ocs-operator.v4.3.0 Pending $ oc get csv -n openshift-storage NAME DISPLAY VERSION REPLACES PHASE ocs-operator.v4.3.0 OpenShift Container Storage 4.3.0 Replacing ocs-operator.v4.4.0-411.ci OpenShift Container Storage 4.4.0-411.ci ocs-operator.v4.3.0 Installing Umanga do you know if lib-bucket-provisioner.v1.0.0 subscription will be still the same for 4.3? Are we depending on this specific version or it can change? If this is something we need to do as WA we will need to delete this subscription, from it we can get current installed CSV name from subscription's status, there is: installedCSV: lib-bucket-provisioner.v1.0.0 So then we need to delete this as well. Is this something what we will describe to customers in documentation or any other idea how to solve this without user intervention? We need to make sure that nothing in the product is not broken by this from noobaa point of view.
4.2 and 4.3 are the same, and we should find a way to remove this dependency without manual intervention which is not acceptable. Please allow us to investigate and also consult the OLM team, might also be a problem in that mechanism.
As Nimrod wanted to reproduce on OCP 4.2, here are the data: must gather: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jnk-ai3c33-ua/jnk-ai3c33-ua_20200421T193753/logs/failed_testcase_ocs_logs_1587501073/test_upgrade_ocs_logs/ Jenkins job: https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/6869/ OCP version v4.2.29 As I don't see the details about OCP version for the job we reported this BZ adding it here: OCP 4.3.0-0.nightly-2020-04-13-190424
To summarize the various discussions: We are intending to do a code fix for 4.4. Therefore giving devel_ack. The component and assignee might still change.
Updating and moving to ON_QE. Fix would be in the community operators, but testing is going on. We chose to go with "options #2" which is publishing a new version of the lib-bucket-provisioner (v2) to the community operators (same channel as v1). This version with not require and will not own the two CRDs (OB/OBC) thus untangling the collision we had when we switched from "require" in OCS 4.3 to "own" in ocs 4.4 Since this change is in the community operators and pushing a new version would affect existing deployments, we are using a private catalog to simulate the same flow. The current status is: Deploying a new OCS 4.3 on top of OCP 4.3 when both v1 and v2 exist in the catalog works - v1 is installed and then upgraded to v2. Upgrade to OCS 4.4 works smoothly. Deploy OCS 4.3 on top of OCP 4.3 with v1 in the catalog (existing customer) and then pushing v2 to the catalog - lib-bucket is upgraded and then upgrade to OCS 4.4 works. Deploy OCS 4.2 on OCP 4.3 with v1 in the catalog, push v2 to the catalog and upgrade OCS to 4.3 works as well. Testing details would be added in the following comment.
Seems there is some new issue: From OLM logs I see: oc logs -n openshift-operator-lifecycle-manager catalog-operator-5b59684b6d-nx5zp penshift-marketplace time="2020-05-12T12:49:43Z" level=info msg="error updating subscription status" channel=alpha error="Operation cannot be fulfilled on subscriptions.operators.coreos.com \"lib-bucket-provisioner-alpha-lib-bucket-catalog-openshift-marketpla ce\": the object has been modified; please apply your changes to the latest version and try again" id=lAVx6 namespace=openshift-storage pkg=lib-bucket-provisioner source=lib-bucket-catalog sub=lib-bucket-provisioner-alpha-lib-bucket-catal og-openshift-marketplace E0512 12:49:43.957056 1 queueinformer_operator.go:290] sync "openshift-storage" failed: error updating Subscription status: Operation cannot be fulfilled on subscriptions.operators.coreos.com "lib-bucket-provisioner-alpha-lib-bucket -catalog-openshift-marketplace": the object has been modified; please apply your changes to the latest version and try again time="2020-05-12T12:49:44Z" level=info msg=syncing event=update reconciling="*v1alpha1.Subscription" selflink=/apis/operators.coreos.com/v1alpha1/namespaces/openshift-storage/subscriptions/lib-bucket-provisioner-alpha-lib-bucket-catalog-o penshift-marketplace E0512 12:49:46.016558 1 queueinformer_operator.go:290] sync "openshift-storage" failed: error calculating generation changes due to new bundle: ceph.rook.io/v1/CephObjectStoreUser (cephobjectstoreusers) already provided by ocs-opera tor.v4.3.0 E0512 12:49:46.803863 1 queueinformer_operator.go:290] sync "openshift-storage" failed: error calculating generation changes due to new bundle: noobaa.io/v1alpha1/NooBaa (noobaas) already provided by ocs-operator.v4.3.0 time="2020-05-12T12:50:03Z" level=warning msg="no installplan found with matching generation, creating new one" id=40xu4 namespace=openshift-storage time="2020-05-12T12:50:03Z" level=info msg=syncing id=lEdNi ip=install-fdv8h namespace=openshift-storage phase= CSVs look like: $ oc get csv -n openshift-storage NAME DISPLAY VERSION REPLACES PHASE lib-bucket-provisioner.v1.0.0 lib-bucket-provisioner 1.0.0 Failed lib-bucket-provisioner.v2.0.0 lib-bucket-provisioner 2.0.0 lib-bucket-provisioner.v1.0.0 Pending ocs-operator.v4.3.0 OpenShift Container Storage 4.3.0 Replacing ocs-operator.v4.4.0-420.ci OpenShift Container Storage 4.4.0-420.ci ocs-operator.v4.3.0 Failed Tried from 4.3 live content to latest internal 4.4 RC5 build: 4.4.0-420.ci OCP version: Server Version: 4.4.0-0.nightly-2020-05-08-224132 Jenkins job still running: https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/7451/console Once the job will fail and collect must gather they should be available here: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jnk-ai3c33-ua/jnk-ai3c33-ua_20200512T082600/logs/ocs-ci-logs-1589282568/ So seems like we have some issue with this approach at least in latest OCP 4.4
The must gather logs will be collected actually here: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jnk-ai3c33-ua/jnk-ai3c33-ua_20200512T082600/logs/failed_testcase_ocs_logs_1589282568/test_upgrade_ocs_logs/ I see now the collecting is in the process.
Weird is that this approach worked fine last week on OCP 4.3 as you can see from the console output mentioned in this mail thread: http://post-office.corp.redhat.com/archives/rhocs-eng/2020-May/msg00012.html
Trying to reproduce once more on OCP 4.3 here: https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/7469/console
Looks like the upgrade started rolling fine on OCP 4.3 but then we hit another noobaa issue, so I will report BZ for that tomorrow morning. Based on conversation with Vu, this can be caused by this bug fix: https://github.com/operator-framework/operator-lifecycle-manager/pull/1484 Which was merged to OCP 4.4 but not to 4.3 so that's why it's probably working on OCP 4.3 but not on OCP 4.4.
As decided by the stakeholders, we are falling back to reverting the CRD owning change in ocs-operator. https://github.com/openshift/ocs-operator/pull/518 has been merged to the release branch.
Removing the needinfo as everything is clear now
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2393
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days