Description of problem (please be detailed as possible and provide log snippests): operator v4.6.0-519.ci is in Installing state $ oc get csv NAME DISPLAY VERSION REPLACES PHASE ocs-operator.v4.6.0-519.ci OpenShift Container Storage 4.6.0-519.ci Installing logs: https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/10647/ Version of all relevant components (if applicable): Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Cluster deployment failed due to ocs-operator Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 2 Can this issue reproducible? 1/1 Steps to Reproduce: 1. Install ocs/ocp(4.6,4.6) with ocs-operator Actual results: ocs-operator failed to move to 'Succeeded' phase Expected results: ocs-operator should be 'Installed'
> Events for operator $ oc describe csv ocs-operator.v4.6.0-519.ci Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal RequirementsUnknown 50m (x3 over 50m) operator-lifecycle-manager requirements not yet checked Normal RequirementsNotMet 50m (x2 over 50m) operator-lifecycle-manager one or more requirements couldn't be found Normal InstallWaiting 50m operator-lifecycle-manager installing: waiting for deployment rook-ceph-operator to become ready: Waiting for rollout to finish: 0 of 1 updated replicas are available... Normal InstallSucceeded 49m (x2 over 49m) operator-lifecycle-manager install strategy completed with no errors Warning ComponentUnhealthy 49m (x2 over 49m) operator-lifecycle-manager installing: waiting for deployment ocs-operator to become ready: Waiting for rollout to finish: 0 of 1 updated replicas are available... Normal AllRequirementsMet 49m (x4 over 50m) operator-lifecycle-manager all requirements found, attempting install Normal InstallSucceeded 49m (x4 over 50m) operator-lifecycle-manager waiting for install components to report healthy Normal InstallWaiting 49m (x3 over 50m) operator-lifecycle-manager installing: waiting for deployment ocs-operator to become ready: Waiting for rollout to finish: 0 of 1 updated replicas are available... Normal NeedsReinstall 44m (x3 over 49m) operator-lifecycle-manager installing: waiting for deployment ocs-operator to become ready: Waiting for rollout to finish: 0 of 1 updated replicas are available... Warning InstallCheckFailed 4m18s (x17 over 44m) operator-lifecycle-manager install timeout > pods $ oc get pods NAME READY STATUS RESTARTS AGE noobaa-operator-8bbdb49b9-jfglj 1/1 Running 0 47m ocs-operator-577696f445-s7tl6 0/1 Running 0 47m rook-ceph-operator-8ff886855-htz6t 1/1 Running 0 47m
Should this be a proposed blocker for 4.5? We have hit this in 4.5
(In reply to Mudit Agarwal from comment #4) > Should this be a proposed blocker for 4.5? We have hit this in 4.5 Deployment passed with OCP 4.5 + OCS 4.5 ( https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/10648/consoleFull ) Deployment failed with OCP 4.5 + OCS 4.6 ( eng job: https://ceph-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/ocs-ci/545/consoleFull ) Deploymnet failed with OCP 4.6 + OCS 4.6 ( https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/10647/ ) From the above, we can see issue with OCS 4.6
Thanks, moved it to 4.6
The StorageCluster is reporting "CephCluster not reporting status". Looking at the Rook-Ceph logs, we seem to have a problem with ServiceAccount permissions: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/sshreeka-aws02/sshreeka-aws02_20200807T054732/logs/failed_testcase_ocs_logs_1596779805/deployment_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-052faa341918e6f7f6543c26c9dd820aee27b1139f51e772302e978652e2b2a3/ceph/namespaces/openshift-storage/pods/rook-ceph-operator-8ff886855-htz6t/rook-ceph-operator/rook-ceph-operator/logs/current.log Travis, can you provide more insight? I can't seem to find information about ServiceAccounts in the must-gather...
@Jose It seems none of the RBAC was applied for the operator to have access to the CRDs. Nothing seems to be working in the operator. The OCS Operator is picking up the latest rook v1.4.0 now, right? It smells like it could be related to the change of RBAC to remove the aggregate rules: https://github.com/rook/rook/pull/5970/commits/5b4d2c8cbc8832d40db7802bf3043fe798166131 Is there something in the CSV generation since that change?
OCS-op just merged the rebase on Rook-Ceph 1.4 today, so this issue might go away with today's build. So let's try with another build soon and this might go away... Thanks.
@Shreekar Do you have the full must gather for the failed cluster? Or a cluster that is still running with this issue? I'd like to look at the ClusterRoles and other RBAC that were generated in the 4.6 cluster.
Found the issue, the service account names in the CSV was not being properly generated since the aggregated rules were removed. https://github.com/rook/rook/pull/6046
The fix has been merged to the downstream branch, it will be picked up in the next 4.6 build. https://github.com/openshift/rook/pull/103
OCS 4.6 deployment works well (v4.6.0-97.ci)
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.6.0 security, bug fix, enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5605