1896338 – OCS upgrade from 4.6 to 4.7 build failed

Bug 1896338 - OCS upgrade from 4.6 to 4.7 build failed

Summary: OCS upgrade from 4.6 to 4.7 build failed

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	ocs-operator
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	OCS 4.7.0
Assignee:	Jose A. Rivera
QA Contact:	Petr Balogh
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-11-10 10:09 UTC by Petr Balogh
Modified:	2021-05-19 09:16 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-05-19 09:16:13 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2021:2041	0	None	None	None	2021-05-19 09:16:51 UTC

Description Petr Balogh 2020-11-10 10:09:22 UTC

Description of problem (please be detailed as possible and provide log
snippests):
When upgrading from 4.6 internal build to 4.7 build we see CSV in failed state:
NAME                         DISPLAY                       VERSION        REPLACES                     PHASE
ocs-operator.v4.6.0-156.ci   OpenShift Container Storage   4.6.0-156.ci                                Replacing
ocs-operator.v4.7.0-158.ci   OpenShift Container Storage   4.7.0-158.ci   ocs-operator.v4.6.0-156.ci   Failed

From operator logs I see a lot of this errors:
2020-11-10T02:28:25.853965298Z {"level":"error","ts":"2020-11-10T02:28:25.853Z","logger":"controller_storagecluster","msg":"prometheus rules file not found","error":"'/ocs-prometheus-rules/prometheus-ocs-rules.yaml' not found","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/remote-source/deps/gomod/pkg/mod/github.com/go-logr/zapr.1/zapr.go:128\ngithub.com/openshift/ocs-operator/pkg/controller/storagecluster.(*ReconcileStorageCluster).enablePrometheusRules\n\t/remote-source/app/pkg/controller/storagecluster/prometheus.go:29\ngithub.com/openshift/ocs-operator/pkg/controller/storagecluster.(*ReconcileStorageCluster).Reconcile\n\t/remote-source/app/pkg/controller/storagecluster/reconcile.go:359\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.2/pkg/internal/controller/controller.go:256\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.2/pkg/internal/controller/controller.go:232\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.2/pkg/internal/controller/controller.go:211\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.4/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.4/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.4/pkg/util/wait/wait.go:88"}
2020-11-10T02:28:25.854009133Z {"level":"error","ts":"2020-11-10T02:28:25.853Z","logger":"controller_storagecluster","msg":"unable to deploy Prometheus rules","error":"failed while creating PrometheusRule: expected pointer, but got nil","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/remote-source/deps/gomod/pkg/mod/github.com/go-logr/zapr.1/zapr.go:128\ngithub.com/openshift/ocs-operator/pkg/controller/storagecluster.(*ReconcileStorageCluster).enablePrometheusRules\n\t/remote-source/app/pkg/controller/storagecluster/prometheus.go:33\ngithub.com/openshift/ocs-operator/pkg/controller/storagecluster.(*ReconcileStorageCluster).Reconcile\n\t/remote-source/app/pkg/controller/storagecluster/reconcile.go:359\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.2/pkg/internal/controller/controller.go:256\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.2/pkg/internal/controller/controller.go:232\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.2/pkg/internal/controller/controller.go:211\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.4/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.4/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.4/pkg/util/wait/wait.go:88"}




Version of all relevant components (if applicable):
OCS:  4.6.0-156.ci  upgrade to: 4.7.0-158.ci

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yep, it's blocking upgrade to new Y version.


Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
Haven't tried yet


Can this issue reproduce from the UI?
Haven't tried.


If this is a regression, please provide more details to justify this:
Yes, this worked before.


Steps to Reproduce:
1. Install OCS 4.6 internal build mentioned above
2. Upgrade to 4.7 internal build
3. Upgrade will not complete and fail


Actual results:
Upgrade failed


Expected results:
Have successful upgrade


Additional info:
Must gather:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j006vu1cs33-uan/j006vu1cs33-uan_20201109T221607/logs/failed_testcase_ocs_logs_1604963919/test_upgrade_ocs_logs/

Job:
https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/14430/consoleFull

Comment 3 Jose A. Rivera 2020-11-10 17:29:50 UTC

This was a known problem in ocs-oeprator master that should have already been resolved. It seems that Jenkins is unavailable right now, so I can't determine what commit the ocs-operator build was taken from, hopefully that clears up soon.

Comment 4 Mudit Agarwal 2020-11-11 04:47:15 UTC

Looks like an intermittent issue, meanwhile moving it to 4.6z

Comment 5 Yaniv Kaul 2020-11-11 07:32:18 UTC

Can you try again?

Comment 6 Petr Balogh 2020-11-20 14:07:42 UTC

Job triggered here:
https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/14878/console

Comment 7 Petr Balogh 2020-11-23 16:42:49 UTC

The mentioned job failed with other issue recently introduced by ocs-ci change. But after re-trigger we hit another noobaa related bug reported here:
https://bugzilla.redhat.com/show_bug.cgi?id=1900722

So I cannot confirm that upgrade is working cause of other BZ but we see other error and not this one.

Comment 8 Yaniv Kaul 2020-12-04 17:39:58 UTC

(In reply to Petr Balogh from comment #7)
> The mentioned job failed with other issue recently introduced by ocs-ci
> change. But after re-trigger we hit another noobaa related bug reported here:
> https://bugzilla.redhat.com/show_bug.cgi?id=1900722
> 
> So I cannot confirm that upgrade is working cause of other BZ but we see
> other error and not this one.

Any updates?

Comment 9 Petr Balogh 2020-12-04 19:10:42 UTC

I don't see any update here:
https://bugzilla.redhat.com/show_bug.cgi?id=1900722

So I guess it's still blocked. If I will see some update in BZ ^ I can give an another try.

Anyway, just giving a try now:
https://ocs4-jenkins-csb-ocsqe.cloud.paas.psi.redhat.com/view/Nightly/job/qe-trigger-vsphere-upi-1az-rhcos-vsan-3m-3w-upgrade-ocs-auto-nightly/1/console

Let's see.

Comment 10 Petr Balogh 2020-12-04 22:19:33 UTC

Yaniv, I see that we are still affected by: https://bugzilla.redhat.com/show_bug.cgi?id=1900722

noobaa-core-0                                                     0/1     CrashLoopBackOff   10         29m

Comment 11 Mudit Agarwal 2020-12-10 04:28:55 UTC

(In reply to Petr Balogh from comment #10)
> Yaniv, I see that we are still affected by:
> https://bugzilla.redhat.com/show_bug.cgi?id=1900722
> 
> noobaa-core-0                                                     0/1    
> CrashLoopBackOff   10         29m

Petr, Bug #1900722 is ON_QA now

Comment 12 Yaniv Kaul 2020-12-10 11:35:57 UTC

(In reply to Mudit Agarwal from comment #11)
> (In reply to Petr Balogh from comment #10)
> > Yaniv, I see that we are still affected by:
> > https://bugzilla.redhat.com/show_bug.cgi?id=1900722
> > 
> > noobaa-core-0                                                     0/1    
> > CrashLoopBackOff   10         29m
> 
> Petr, Bug #1900722 is ON_QA now

Please re-test.

Comment 13 Petr Balogh 2020-12-14 14:23:31 UTC

Ran verification job here:
https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/15583/

Upgrade from 4.6 RC 7 to 4.7.0-192.ci which I see should have a fix for #1900722 .

Comment 14 Petr Balogh 2020-12-15 09:10:15 UTC

I commented in this BZ:  https://bugzilla.redhat.com/show_bug.cgi?id=1900722 - we are still blocked here.

Comment 16 Mudit Agarwal 2021-01-05 14:53:24 UTC

Petr, can you please retry as the blocker BZs (in the above comments) are already ON_QA

Comment 17 Petr Balogh 2021-01-11 10:41:19 UTC

Sorry Mudit for the late response but I was on PTO for last 3 weeks,
Triggered job here:
https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-trigger-aws-ipi-3az-rhcos-3m-3w-upgrade-ocs-auto-nightly/2/console

Comment 18 Petr Balogh 2021-01-25 15:53:54 UTC

Running new verification job here:
https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/16688/console

Comment 19 Petr Balogh 2021-01-27 12:42:40 UTC

I can move bug to verified based on the execution above. But the BZ is in the NEW state, I think it should go first to ON_QE state @muagarwa ?

Comment 20 Mudit Agarwal 2021-01-28 05:45:16 UTC

Thanks Petr

Comment 25 errata-xmlrpc 2021-05-19 09:16:13 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2041

Note You need to log in before you can comment on or make changes to this bug.