Bug 2010232

Summary:	ODF installation is stuck with odf-operator.v4.9.0 CSV in installing phase
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	Petr Balogh <pbalogh>
Component:	odf-operator	Assignee:	Nitin Goyal <nigoyal>
Status:	CLOSED DUPLICATE	QA Contact:	Petr Balogh <pbalogh>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	4.9	CC:	jarrpa, muagarwa, ocs-bugs, odf-bz-bot, ratamir
Target Milestone:	---	Keywords:	Automation, AutomationBlocker, Regression, TestBlocker
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-10-04 14:31:58 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Petr Balogh 2021-10-04 09:49:23 UTC

Description of problem (please be detailed as possible and provide log
snippests):
Based on email from Boris:
https://mailman-int.corp.redhat.com/archives/ocs-qe/2021-September/msg00342.html

I have tried to do changes in ocs-ci:
https://github.com/red-hat-storage/ocs-ci/pull/4898

And run verification here:
https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/6342/consoleFull

This is actually stuck with CSV in installing phase:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/pbalogh-odf/pbalogh-odf_20211001T144817/logs/failed_testcase_ocs_logs_1633100193/test_deployment_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-1a3ed74a00cd4bb1f0480fddf45ad4b6611584759f6510e284769f347ecfa270/namespaces/openshift-storage/oc_output/csv

NAME                  DISPLAY                     VERSION   REPLACES   PHASE
odf-operator.v4.9.0   OpenShift Data Foundation   4.9.0                Installing


Pods:
NAME                                               READY   STATUS             RESTARTS       AGE   IP             NODE                                         NOMINATED NODE   READINESS GATES
ip-10-0-159-235us-east-2computeinternal-debug      1/1     Running            0              64s   10.0.159.235   ip-10-0-159-235.us-east-2.compute.internal   <none>           <none>
ip-10-0-161-155us-east-2computeinternal-debug      1/1     Running            0              64s   10.0.161.155   ip-10-0-161-155.us-east-2.compute.internal   <none>           <none>
ip-10-0-221-16us-east-2computeinternal-debug       1/1     Running            0              64s   10.0.221.16    ip-10-0-221-16.us-east-2.compute.internal    <none>           <none>
odf-console-58d5cf48f4-v96p4                       1/1     Running            0              59m   10.131.0.25    ip-10-0-221-16.us-east-2.compute.internal    <none>           <none>
odf-operator-controller-manager-6767dd85bb-ltfpl   1/2     CrashLoopBackOff   15 (74s ago)   59m   10.129.2.13    ip-10-0-161-155.us-east-2.compute.internal   <none>           <none>


You can see:
odf-operator-controller-manager-6767dd85bb-ltfpl   is in:  CrashLoopBackOff   

From its log I see:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/pbalogh-odf/pbalogh-odf_20211001T144817/logs/failed_testcase_ocs_logs_1633100193/test_deployment_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-1a3ed74a00cd4bb1f0480fddf45ad4b6611584759f6510e284769f347ecfa270/namespaces/openshift-storage/pods/odf-operator-controller-manager-6767dd85bb-ltfpl/manager/manager/logs/current.log

2021-10-01T16:31:59.356957454Z I1001 16:31:59.356916       1 request.go:655] Throttling request took 1.044371111s, request: GET:https://172.30.0.1:443/apis/apps.openshift.io/v1?timeout=32s
2021-10-01T16:32:00.608858577Z 2021-10-01T16:32:00.608Z	ERROR	controller-runtime.source	if kind is a CRD, it should be installed before calling Start	{"kind": "StorageCluster.ocs.openshift.io", "error": "no matches for kind \"StorageCluster\" in version \"ocs.openshift.io/v1\""}
2021-10-01T16:32:00.608858577Z github.com/go-logr/zapr.(*zapLogger).Error
2021-10-01T16:32:00.608858577Z 	/remote-source/deps/gomod/pkg/mod/github.com/go-logr/zapr.0/zapr.go:132
2021-10-01T16:32:00.608858577Z sigs.k8s.io/controller-runtime/pkg/log.(*DelegatingLogger).Error
2021-10-01T16:32:00.608858577Z 	/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.3/pkg/log/deleg.go:144
2021-10-01T16:32:00.608858577Z sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start
2021-10-01T16:32:00.608858577Z 	/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.3/pkg/source/source.go:117
2021-10-01T16:32:00.608858577Z sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1
2021-10-01T16:32:00.608858577Z 	/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.3/pkg/internal/controller/controller.go:167
2021-10-01T16:32:00.608858577Z sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start
2021-10-01T16:32:00.608858577Z 	/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.3/pkg/internal/controller/controller.go:223
2021-10-01T16:32:00.608858577Z sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).startRunnable.func1
2021-10-01T16:32:00.608858577Z 	/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.3/pkg/manager/internal.go:681
2021-10-01T16:32:00.609243516Z 2021-10-01T16:32:00.609Z	ERROR	controller-runtime.manager	error received after stop sequence was engaged	{"error": "Timeout: failed waiting for *v1alpha1.StorageSystem Informer to sync"}
2021-10-01T16:32:00.609243516Z github.com/go-logr/zapr.(*zapLogger).Error


So I tried to move creation of storage system in ocs-ci little bit sooner, right after subscription - but if failed here:
https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/6363/console

with:
08:24:54 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc apply -f /home/jenkins/workspace/qe-deploy-ocs-cluster/ocs-ci/ocs_ci/templates/ocs-deployment/storagesystem_odf.yaml
08:24:55 - MainThread - ocs_ci.utility.utils - WARNING - Command stderr: error: unable to recognize "/home/jenkins/workspace/qe-deploy-ocs-cluster/ocs-ci/ocs_ci/templates/ocs-deployment/storagesystem_odf.yaml": no matches for kind "StorageSystem" in version "odf.openshift.io/v1alpha1"

08:24:55 - MainThread - ocs_ci.deployment.deployment - ERROR - Error during execution of command: oc apply -f /home/jenkins/workspace/qe-deploy-ocs-cluster/ocs-ci/ocs_ci/templates/ocs-deployment/storagesystem_odf.yaml.
Error is error: unable to recognize "/home/jenkins/workspace/qe-deploy-ocs-cluster/ocs-ci/ocs_ci/templates/ocs-deployment/storagesystem_odf.yaml": no matches for kind "StorageSystem" in version "odf.openshift.io/v1alpha1"

Logs:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/pbalogh-odf/pbalogh-odf_20211004T074805/logs/failed_testcase_ocs_logs_1633334185/test_deployment_ocs_logs/

As installation failed in early stages I connected via UI and uninstalled the operator and started the whole procedure from UI.

After I started installing the operator it wrote info that I need to create storageSystem, but the button for that was grayed out - so I couldn't create the storageSystem from UI.

Then in about 20-30 of seconds it asked to refresh the page but it wrote that I am already subscribed to the odf operator and I cannot install it again.

So I went to the installed operators and there was button to create the storageSystem from there. So I did and it failed with:

An error has occurred:
Not Found .

When I clicked second time on create it wrote that storageSystem already exists.

I will attach those screenshots right after.

After attempt from UI I see:
pbalogh@pbalogh-mac catalog-source $ oc get csv -n openshift-storage
NAME                  DISPLAY                     VERSION   REPLACES   PHASE
odf-operator.v4.9.0   OpenShift Data Foundation   4.9.0                Installing
pbalogh@pbalogh-mac catalog-source $ oc get pod -n openshift-storage
NAME                                               READY   STATUS             RESTARTS        AGE
odf-console-58d5cf48f4-5pz64                       1/1     Running            0               17m
odf-operator-controller-manager-6767dd85bb-8cfxg   1/2     CrashLoopBackOff   7 (4m11s ago)   17m

$ oc logs -n openshift-storage odf-operator-controller-manager-6767dd85bb-8cfxg -c manager
I1004 09:41:49.099822       1 request.go:655] Throttling request took 1.047414435s, request: GET:https://172.30.0.1:443/apis/snapshot.storage.k8s.io/v1?timeout=32s
2021-10-04T09:41:50.305Z	INFO	controller-runtime.metrics	metrics server is starting to listen	{"addr": "127.0.0.1:8080"}
2021-10-04T09:41:50.305Z	INFO	setup	starting console
2021-10-04T09:41:50.305Z	INFO	setup	starting manager
I1004 09:41:50.306218       1 leaderelection.go:243] attempting to acquire leader lease openshift-storage/4fd470de.openshift.io...
2021-10-04T09:41:50.306Z	INFO	controller-runtime.manager	starting metrics server	{"path": "/metrics"}
I1004 09:42:07.767788       1 leaderelection.go:253] successfully acquired lease openshift-storage/4fd470de.openshift.io
2021-10-04T09:42:07.768Z	INFO	controller-runtime.manager.controller.storagecluster	Starting EventSource	{"reconciler group": "ocs.openshift.io", "reconciler kind": "StorageCluster", "source": "kind source: /, Kind="}
2021-10-04T09:42:07.768Z	INFO	controller-runtime.manager.controller.storagesystem	Starting EventSource	{"reconciler group": "odf.openshift.io", "reconciler kind": "StorageSystem", "source": "kind source: /, Kind="}
2021-10-04T09:42:07.767Z	DEBUG	controller-runtime.manager.events	Normal	{"object": {"kind":"ConfigMap","namespace":"openshift-storage","name":"4fd470de.openshift.io","uid":"206762ea-758e-4ebc-8cdb-b2d4c36df130","apiVersion":"v1","resourceVersion":"49395"}, "reason": "LeaderElection", "message": "odf-operator-controller-manager-6767dd85bb-8cfxg_4b010bd0-74ba-4e66-b2de-48c2fd90f54f became leader"}
2021-10-04T09:42:07.768Z	DEBUG	controller-runtime.manager.events	Normal	{"object": {"kind":"Lease","namespace":"openshift-storage","name":"4fd470de.openshift.io","uid":"08ba0d51-338c-4bc7-a1b0-3aa74d4ff834","apiVersion":"coordination.k8s.io/v1","resourceVersion":"49396"}, "reason": "LeaderElection", "message": "odf-operator-controller-manager-6767dd85bb-8cfxg_4b010bd0-74ba-4e66-b2de-48c2fd90f54f became leader"}
2021-10-04T09:42:07.939Z	INFO	controller-runtime.manager.controller.storagesystem	Starting EventSource	{"reconciler group": "odf.openshift.io", "reconciler kind": "StorageSystem", "source": "kind source: /, Kind="}
I1004 09:42:08.818951       1 request.go:655] Throttling request took 1.044025849s, request: GET:https://172.30.0.1:443/apis/node.k8s.io/v1beta1?timeout=32s
2021-10-04T09:42:10.069Z	ERROR	controller-runtime.source	if kind is a CRD, it should be installed before calling Start	{"kind": "StorageCluster.ocs.openshift.io", "error": "no matches for kind \"StorageCluster\" in version \"ocs.openshift.io/v1\""}
github.com/go-logr/zapr.(*zapLogger).Error
	/remote-source/deps/gomod/pkg/mod/github.com/go-logr/zapr.0/zapr.go:132
sigs.k8s.io/controller-runtime/pkg/log.(*DelegatingLogger).Error
	/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.3/pkg/log/deleg.go:144
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start
	/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.3/pkg/source/source.go:117
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1
	/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.3/pkg/internal/controller/controller.go:167
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start
	/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.3/pkg/internal/controller/controller.go:223
sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).startRunnable.func1
	/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.3/pkg/manager/internal.go:681
2021-10-04T09:42:10.069Z	ERROR	controller-runtime.manager	error received after stop sequence was engaged	{"error": "Timeout: failed waiting for *v1alpha1.Subscription Informer to sync"}
github.com/go-logr/zapr.(*zapLogger).Error
	/remote-source/deps/gomod/pkg/mod/github.com/go-logr/zapr.0/zapr.go:132
sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).engageStopProcedure.func1
	/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.3/pkg/manager/internal.go:530
2021-10-04T09:42:10.114Z	ERROR	setup	problem running manager	{"error": "no matches for kind \"StorageCluster\" in version \"ocs.openshift.io/v1\""}
github.com/go-logr/zapr.(*zapLogger).Error
	/remote-source/deps/gomod/pkg/mod/github.com/go-logr/zapr.0/zapr.go:132
sigs.k8s.io/controller-runtime/pkg/log.(*DelegatingLogger).Error
	/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.3/pkg/log/deleg.go:144
main.main
	/remote-source/app/main.go:150
runtime.main
	/usr/lib/golang/src/runtime/proc.go:225

From UI as storageSystem exists, so I cannot create storageCluster separately or don't now how.

Version of all relevant components (if applicable):
quay.io/rhceph-dev/ocs-registry:4.9.0-166.ci
OCP: 4.9.0-0.nightly-2021-10-01-202059


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

1
Can this issue reproducible?
yes

Can this issue reproduce from the UI?
yes

If this is a regression, please provide more details to justify this:
yes

Steps to Reproduce:
1. Install 4.9.0-166.ci build
2.
3.


Actual results:
It has several problems to install storageSystem + also sorageCluster is not installed

Expected results:


Additional info:

Comment 5 Mudit Agarwal 2021-10-04 14:31:58 UTC


*** This bug has been marked as a duplicate of bug 2009531 ***