Bug 1867024

Summary: [ocs-operator] operator v4.6.0-519.ci is in Installing state
Product: [Red Hat Storage] Red Hat OpenShift Container Storage Reporter: Persona non grata <nobody+410372>
Component: rookAssignee: Travis Nielsen <tnielsen>
Status: CLOSED ERRATA QA Contact: Elad <ebenahar>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 4.6CC: ebenahar, madam, muagarwa, ocs-bugs, owasserm, ratamir, shan, sostapov, tnielsen, vavuthu
Target Milestone: ---Keywords: Automation, AutomationBlocker, Regression
Target Release: OCS 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-12-17 06:23:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Persona non grata 2020-08-07 07:43:35 UTC
Description of problem (please be detailed as possible and provide log
snippests):
operator v4.6.0-519.ci is in Installing state
$ oc get csv
NAME                         DISPLAY                       VERSION        REPLACES   PHASE
ocs-operator.v4.6.0-519.ci   OpenShift Container Storage   4.6.0-519.ci              Installing


logs: https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/10647/

Version of all relevant components (if applicable):


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Cluster deployment failed due to ocs-operator


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?
1/1


Steps to Reproduce:
1. Install ocs/ocp(4.6,4.6) with ocs-operator



Actual results:
ocs-operator failed to  move to 'Succeeded' phase

Expected results:
ocs-operator should be 'Installed'

Comment 3 Vijay Avuthu 2020-08-07 07:58:20 UTC
> Events for operator

$ oc describe csv ocs-operator.v4.6.0-519.ci


Events:
  Type     Reason               Age                   From                        Message
  ----     ------               ----                  ----                        -------
  Normal   RequirementsUnknown  50m (x3 over 50m)     operator-lifecycle-manager  requirements not yet checked
  Normal   RequirementsNotMet   50m (x2 over 50m)     operator-lifecycle-manager  one or more requirements couldn't be found
  Normal   InstallWaiting       50m                   operator-lifecycle-manager  installing: waiting for deployment rook-ceph-operator to become ready: Waiting for rollout to finish: 0 of 1 updated replicas are available...
  Normal   InstallSucceeded     49m (x2 over 49m)     operator-lifecycle-manager  install strategy completed with no errors
  Warning  ComponentUnhealthy   49m (x2 over 49m)     operator-lifecycle-manager  installing: waiting for deployment ocs-operator to become ready: Waiting for rollout to finish: 0 of 1 updated replicas are available...
  Normal   AllRequirementsMet   49m (x4 over 50m)     operator-lifecycle-manager  all requirements found, attempting install
  Normal   InstallSucceeded     49m (x4 over 50m)     operator-lifecycle-manager  waiting for install components to report healthy
  Normal   InstallWaiting       49m (x3 over 50m)     operator-lifecycle-manager  installing: waiting for deployment ocs-operator to become ready: Waiting for rollout to finish: 0 of 1 updated replicas are available...
  Normal   NeedsReinstall       44m (x3 over 49m)     operator-lifecycle-manager  installing: waiting for deployment ocs-operator to become ready: Waiting for rollout to finish: 0 of 1 updated replicas are available...
  Warning  InstallCheckFailed   4m18s (x17 over 44m)  operator-lifecycle-manager  install timeout


> pods

$ oc get pods
NAME                                 READY   STATUS    RESTARTS   AGE
noobaa-operator-8bbdb49b9-jfglj      1/1     Running   0          47m
ocs-operator-577696f445-s7tl6        0/1     Running   0          47m
rook-ceph-operator-8ff886855-htz6t   1/1     Running   0          47m

Comment 4 Mudit Agarwal 2020-08-07 08:57:10 UTC
Should this be a proposed blocker for 4.5? We have hit this in 4.5

Comment 5 Vijay Avuthu 2020-08-07 09:13:05 UTC
(In reply to Mudit Agarwal from comment #4)
> Should this be a proposed blocker for 4.5? We have hit this in 4.5

Deployment passed with OCP 4.5 + OCS 4.5 ( https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/10648/consoleFull )

Deployment failed with OCP 4.5 + OCS 4.6 ( eng job: https://ceph-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/ocs-ci/545/consoleFull )

Deploymnet failed with OCP 4.6 + OCS 4.6 ( https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/10647/ )

From the above, we can see issue with OCS 4.6

Comment 6 Mudit Agarwal 2020-08-07 09:16:41 UTC
Thanks, moved it to 4.6

Comment 7 Jose A. Rivera 2020-08-10 13:37:11 UTC
The StorageCluster is reporting "CephCluster not reporting status". Looking at the Rook-Ceph logs, we seem to have a problem with ServiceAccount permissions: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/sshreeka-aws02/sshreeka-aws02_20200807T054732/logs/failed_testcase_ocs_logs_1596779805/deployment_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-052faa341918e6f7f6543c26c9dd820aee27b1139f51e772302e978652e2b2a3/ceph/namespaces/openshift-storage/pods/rook-ceph-operator-8ff886855-htz6t/rook-ceph-operator/rook-ceph-operator/logs/current.log

Travis, can you provide more insight? I can't seem to find information about ServiceAccounts in the must-gather...

Comment 8 Travis Nielsen 2020-08-10 16:21:08 UTC
@Jose It seems none of the RBAC was applied for the operator to have access to the CRDs. Nothing seems to be working in the operator. 
The OCS Operator is picking up the latest rook v1.4.0 now, right? It smells like it could be related to the change of RBAC to remove the aggregate rules:
https://github.com/rook/rook/pull/5970/commits/5b4d2c8cbc8832d40db7802bf3043fe798166131

Is there something in the CSV generation since that change?

Comment 9 Sébastien Han 2020-08-10 16:43:54 UTC
OCS-op just merged the rebase on Rook-Ceph 1.4 today, so this issue might go away with today's build.
So let's try with another build soon and this might go away...

Thanks.

Comment 10 Travis Nielsen 2020-08-11 15:06:42 UTC
@Shreekar Do you have the full must gather for the failed cluster? Or a cluster that is still running with this issue? I'd like to look at the ClusterRoles and other RBAC that were generated in the 4.6 cluster.

Comment 11 Travis Nielsen 2020-08-11 17:16:32 UTC
Found the issue, the service account names in the CSV was not being properly generated since the aggregated rules were removed.
https://github.com/rook/rook/pull/6046

Comment 12 Travis Nielsen 2020-08-11 17:28:32 UTC
The fix has been merged to the downstream branch, it will be picked up in the next 4.6 build.
https://github.com/openshift/rook/pull/103

Comment 15 Elad 2020-09-24 13:48:33 UTC
OCS 4.6 deployment works well (v4.6.0-97.ci)

Comment 18 errata-xmlrpc 2020-12-17 06:23:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.6.0 security, bug fix, enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5605