Description of problem (please be detailed as possible and provide log snippests): When upgrading from 4.6 internal build to 4.7 build we see CSV in failed state: NAME DISPLAY VERSION REPLACES PHASE ocs-operator.v4.6.0-156.ci OpenShift Container Storage 4.6.0-156.ci Replacing ocs-operator.v4.7.0-158.ci OpenShift Container Storage 4.7.0-158.ci ocs-operator.v4.6.0-156.ci Failed From operator logs I see a lot of this errors: 2020-11-10T02:28:25.853965298Z {"level":"error","ts":"2020-11-10T02:28:25.853Z","logger":"controller_storagecluster","msg":"prometheus rules file not found","error":"'/ocs-prometheus-rules/prometheus-ocs-rules.yaml' not found","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/remote-source/deps/gomod/pkg/mod/github.com/go-logr/zapr.1/zapr.go:128\ngithub.com/openshift/ocs-operator/pkg/controller/storagecluster.(*ReconcileStorageCluster).enablePrometheusRules\n\t/remote-source/app/pkg/controller/storagecluster/prometheus.go:29\ngithub.com/openshift/ocs-operator/pkg/controller/storagecluster.(*ReconcileStorageCluster).Reconcile\n\t/remote-source/app/pkg/controller/storagecluster/reconcile.go:359\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.2/pkg/internal/controller/controller.go:256\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.2/pkg/internal/controller/controller.go:232\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.2/pkg/internal/controller/controller.go:211\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.4/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.4/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.4/pkg/util/wait/wait.go:88"} 2020-11-10T02:28:25.854009133Z {"level":"error","ts":"2020-11-10T02:28:25.853Z","logger":"controller_storagecluster","msg":"unable to deploy Prometheus rules","error":"failed while creating PrometheusRule: expected pointer, but got nil","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/remote-source/deps/gomod/pkg/mod/github.com/go-logr/zapr.1/zapr.go:128\ngithub.com/openshift/ocs-operator/pkg/controller/storagecluster.(*ReconcileStorageCluster).enablePrometheusRules\n\t/remote-source/app/pkg/controller/storagecluster/prometheus.go:33\ngithub.com/openshift/ocs-operator/pkg/controller/storagecluster.(*ReconcileStorageCluster).Reconcile\n\t/remote-source/app/pkg/controller/storagecluster/reconcile.go:359\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.2/pkg/internal/controller/controller.go:256\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.2/pkg/internal/controller/controller.go:232\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.2/pkg/internal/controller/controller.go:211\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.4/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.4/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.4/pkg/util/wait/wait.go:88"} Version of all relevant components (if applicable): OCS: 4.6.0-156.ci upgrade to: 4.7.0-158.ci Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yep, it's blocking upgrade to new Y version. Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? Haven't tried yet Can this issue reproduce from the UI? Haven't tried. If this is a regression, please provide more details to justify this: Yes, this worked before. Steps to Reproduce: 1. Install OCS 4.6 internal build mentioned above 2. Upgrade to 4.7 internal build 3. Upgrade will not complete and fail Actual results: Upgrade failed Expected results: Have successful upgrade Additional info: Must gather: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j006vu1cs33-uan/j006vu1cs33-uan_20201109T221607/logs/failed_testcase_ocs_logs_1604963919/test_upgrade_ocs_logs/ Job: https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/14430/consoleFull
This was a known problem in ocs-oeprator master that should have already been resolved. It seems that Jenkins is unavailable right now, so I can't determine what commit the ocs-operator build was taken from, hopefully that clears up soon.
Looks like an intermittent issue, meanwhile moving it to 4.6z
Can you try again?
Job triggered here: https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/14878/console
The mentioned job failed with other issue recently introduced by ocs-ci change. But after re-trigger we hit another noobaa related bug reported here: https://bugzilla.redhat.com/show_bug.cgi?id=1900722 So I cannot confirm that upgrade is working cause of other BZ but we see other error and not this one.
(In reply to Petr Balogh from comment #7) > The mentioned job failed with other issue recently introduced by ocs-ci > change. But after re-trigger we hit another noobaa related bug reported here: > https://bugzilla.redhat.com/show_bug.cgi?id=1900722 > > So I cannot confirm that upgrade is working cause of other BZ but we see > other error and not this one. Any updates?
I don't see any update here: https://bugzilla.redhat.com/show_bug.cgi?id=1900722 So I guess it's still blocked. If I will see some update in BZ ^ I can give an another try. Anyway, just giving a try now: https://ocs4-jenkins-csb-ocsqe.cloud.paas.psi.redhat.com/view/Nightly/job/qe-trigger-vsphere-upi-1az-rhcos-vsan-3m-3w-upgrade-ocs-auto-nightly/1/console Let's see.
Yaniv, I see that we are still affected by: https://bugzilla.redhat.com/show_bug.cgi?id=1900722 noobaa-core-0 0/1 CrashLoopBackOff 10 29m
(In reply to Petr Balogh from comment #10) > Yaniv, I see that we are still affected by: > https://bugzilla.redhat.com/show_bug.cgi?id=1900722 > > noobaa-core-0 0/1 > CrashLoopBackOff 10 29m Petr, Bug #1900722 is ON_QA now
(In reply to Mudit Agarwal from comment #11) > (In reply to Petr Balogh from comment #10) > > Yaniv, I see that we are still affected by: > > https://bugzilla.redhat.com/show_bug.cgi?id=1900722 > > > > noobaa-core-0 0/1 > > CrashLoopBackOff 10 29m > > Petr, Bug #1900722 is ON_QA now Please re-test.
Ran verification job here: https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/15583/ Upgrade from 4.6 RC 7 to 4.7.0-192.ci which I see should have a fix for #1900722 .
I commented in this BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1900722 - we are still blocked here.
Petr, can you please retry as the blocker BZs (in the above comments) are already ON_QA
Sorry Mudit for the late response but I was on PTO for last 3 weeks, Triggered job here: https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-trigger-aws-ipi-3az-rhcos-3m-3w-upgrade-ocs-auto-nightly/2/console
Running new verification job here: https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/16688/console
I can move bug to verified based on the execution above. But the BZ is in the NEW state, I think it should go first to ON_QE state @muagarwa ?
Thanks Petr
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2041