1867024 – [ocs-operator] operator v4.6.0-519.ci is in Installing state

Bug 1867024 - [ocs-operator] operator v4.6.0-519.ci is in Installing state

Summary: [ocs-operator] operator v4.6.0-519.ci is in Installing state

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	rook
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	OCS 4.6.0
Assignee:	Travis Nielsen
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-08-07 07:43 UTC by Persona non grata
Modified:	2020-12-17 06:24 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-12-17 06:23:13 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift rook pull 103	None	closed	Sync from upstream release-1.4 to downstream release-4.6	2020-10-29 06:33:48 UTC
Github	rook rook pull 6046	None	closed	ceph: re-add missing RBAC to CSV	2020-10-29 06:33:48 UTC
Red Hat Product Errata	RHSA-2020:5605	None	None	None	2020-12-17 06:24:28 UTC

Description Persona non grata 2020-08-07 07:43:35 UTC

Description of problem (please be detailed as possible and provide log
snippests):
operator v4.6.0-519.ci is in Installing state
$ oc get csv
NAME                         DISPLAY                       VERSION        REPLACES   PHASE
ocs-operator.v4.6.0-519.ci   OpenShift Container Storage   4.6.0-519.ci              Installing


logs: https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/10647/

Version of all relevant components (if applicable):


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Cluster deployment failed due to ocs-operator


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?
1/1


Steps to Reproduce:
1. Install ocs/ocp(4.6,4.6) with ocs-operator



Actual results:
ocs-operator failed to  move to 'Succeeded' phase

Expected results:
ocs-operator should be 'Installed'

Comment 3 Vijay Avuthu 2020-08-07 07:58:20 UTC

> Events for operator

$ oc describe csv ocs-operator.v4.6.0-519.ci


Events:
  Type     Reason               Age                   From                        Message
  ----     ------               ----                  ----                        -------
  Normal   RequirementsUnknown  50m (x3 over 50m)     operator-lifecycle-manager  requirements not yet checked
  Normal   RequirementsNotMet   50m (x2 over 50m)     operator-lifecycle-manager  one or more requirements couldn't be found
  Normal   InstallWaiting       50m                   operator-lifecycle-manager  installing: waiting for deployment rook-ceph-operator to become ready: Waiting for rollout to finish: 0 of 1 updated replicas are available...
  Normal   InstallSucceeded     49m (x2 over 49m)     operator-lifecycle-manager  install strategy completed with no errors
  Warning  ComponentUnhealthy   49m (x2 over 49m)     operator-lifecycle-manager  installing: waiting for deployment ocs-operator to become ready: Waiting for rollout to finish: 0 of 1 updated replicas are available...
  Normal   AllRequirementsMet   49m (x4 over 50m)     operator-lifecycle-manager  all requirements found, attempting install
  Normal   InstallSucceeded     49m (x4 over 50m)     operator-lifecycle-manager  waiting for install components to report healthy
  Normal   InstallWaiting       49m (x3 over 50m)     operator-lifecycle-manager  installing: waiting for deployment ocs-operator to become ready: Waiting for rollout to finish: 0 of 1 updated replicas are available...
  Normal   NeedsReinstall       44m (x3 over 49m)     operator-lifecycle-manager  installing: waiting for deployment ocs-operator to become ready: Waiting for rollout to finish: 0 of 1 updated replicas are available...
  Warning  InstallCheckFailed   4m18s (x17 over 44m)  operator-lifecycle-manager  install timeout


> pods

$ oc get pods
NAME                                 READY   STATUS    RESTARTS   AGE
noobaa-operator-8bbdb49b9-jfglj      1/1     Running   0          47m
ocs-operator-577696f445-s7tl6        0/1     Running   0          47m
rook-ceph-operator-8ff886855-htz6t   1/1     Running   0          47m

Comment 4 Mudit Agarwal 2020-08-07 08:57:10 UTC

Should this be a proposed blocker for 4.5? We have hit this in 4.5

Comment 5 Vijay Avuthu 2020-08-07 09:13:05 UTC

(In reply to Mudit Agarwal from comment #4)
> Should this be a proposed blocker for 4.5? We have hit this in 4.5

Deployment passed with OCP 4.5 + OCS 4.5 ( https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/10648/consoleFull )

Deployment failed with OCP 4.5 + OCS 4.6 ( eng job: https://ceph-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/ocs-ci/545/consoleFull )

Deploymnet failed with OCP 4.6 + OCS 4.6 ( https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/10647/ )

From the above, we can see issue with OCS 4.6

Comment 6 Mudit Agarwal 2020-08-07 09:16:41 UTC

Thanks, moved it to 4.6

Comment 7 Jose A. Rivera 2020-08-10 13:37:11 UTC

The StorageCluster is reporting "CephCluster not reporting status". Looking at the Rook-Ceph logs, we seem to have a problem with ServiceAccount permissions: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/sshreeka-aws02/sshreeka-aws02_20200807T054732/logs/failed_testcase_ocs_logs_1596779805/deployment_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-052faa341918e6f7f6543c26c9dd820aee27b1139f51e772302e978652e2b2a3/ceph/namespaces/openshift-storage/pods/rook-ceph-operator-8ff886855-htz6t/rook-ceph-operator/rook-ceph-operator/logs/current.log

Travis, can you provide more insight? I can't seem to find information about ServiceAccounts in the must-gather...

Comment 8 Travis Nielsen 2020-08-10 16:21:08 UTC

@Jose It seems none of the RBAC was applied for the operator to have access to the CRDs. Nothing seems to be working in the operator. 
The OCS Operator is picking up the latest rook v1.4.0 now, right? It smells like it could be related to the change of RBAC to remove the aggregate rules:
https://github.com/rook/rook/pull/5970/commits/5b4d2c8cbc8832d40db7802bf3043fe798166131

Is there something in the CSV generation since that change?

Comment 9 Sébastien Han 2020-08-10 16:43:54 UTC

OCS-op just merged the rebase on Rook-Ceph 1.4 today, so this issue might go away with today's build.
So let's try with another build soon and this might go away...

Thanks.

Comment 10 Travis Nielsen 2020-08-11 15:06:42 UTC

@Shreekar Do you have the full must gather for the failed cluster? Or a cluster that is still running with this issue? I'd like to look at the ClusterRoles and other RBAC that were generated in the 4.6 cluster.

Comment 11 Travis Nielsen 2020-08-11 17:16:32 UTC

Found the issue, the service account names in the CSV was not being properly generated since the aggregated rules were removed.
https://github.com/rook/rook/pull/6046

Comment 12 Travis Nielsen 2020-08-11 17:28:32 UTC

The fix has been merged to the downstream branch, it will be picked up in the next 4.6 build.
https://github.com/openshift/rook/pull/103

Comment 15 Elad 2020-09-24 13:48:33 UTC

OCS 4.6 deployment works well (v4.6.0-97.ci)

Comment 18 errata-xmlrpc 2020-12-17 06:23:13 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.6.0 security, bug fix, enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5605

Note You need to log in before you can comment on or make changes to this bug.