Description of problem (please be detailed as possible and provide log snippests): ---------------------------------------------------------------------- OCS ocs-operator.v4.5.0-449.ci + Independent Mode install : The OCS operator CSV is stuck in Installing state. Following are some of the observations: 1. The storagecluster is in ready state and the cephcluster is also in HEALTH_OK state. 2. The CSV could be in installing state due to known problem of absence of a default backingstore( Bug 1847875), but since the ocs-operator pod logs are not explicitly blaming on this, raised a separate BZ to ascertain the root cause Reason for a separate BZ: ******************************** With older OCS builds, even in the absence of backingstore(and bucketclass in rejected state), the CSV for OCS operator was in Succeeded state. hence want to reconfirm if the issue is indeed due to absence of noobaa backingstore OR some other problem as well. 3. Some outputs for reference are added in the additional Information. $ oc get csv NAME DISPLAY VERSION REPLACES PHASE awss3operator.1.0.1 AWS S3 Operator 1.0.1 awss3operator.1.0.0 Succeeded ocs-operator.v4.5.0-449.ci OpenShift Container Storage 4.5.0-449.ci Installing $ oc describe csv ocs-operator.v4.5.0-449.ci -n openshift-storage ---- ------ ---- ---- ------- Normal RequirementsUnknown 109m (x2 over 109m) operator-lifecycle-manager requirements not yet checked Normal RequirementsNotMet 109m (x2 over 109m) operator-lifecycle-manager one or more requirements couldn't be found Normal InstallWaiting 108m (x2 over 108m) operator-lifecycle-manager installing: waiting for deployment rook-ceph-operator to become ready: Waiting for rollout to finish: 0 of 1 updated replicas are available... Normal InstallSucceeded 107m operator-lifecycle-manager install strategy completed with no errors Warning ComponentUnhealthy 20m (x2 over 20m) operator-lifecycle-manager installing: waiting for deployment ocs-operator to become ready: Waiting for rollout to finish: 0 of 1 updated replicas are available... Normal InstallWaiting 15m (x5 over 108m) operator-lifecycle-manager installing: waiting for deployment ocs-operator to become ready: Waiting for rollout to finish: 0 of 1 updated replicas are available... Normal NeedsReinstall 10m (x6 over 20m) operator-lifecycle-manager installing: waiting for deployment ocs-operator to become ready: Waiting for rollout to finish: 0 of 1 updated replicas are available... Normal AllRequirementsMet 10m (x8 over 108m) operator-lifecycle-manager all requirements found, attempting install Normal InstallSucceeded 10m (x6 over 108m) operator-lifecycle-manager waiting for install components to report healthy Warning InstallCheckFailed 6s (x7 over 15m) operator-lifecycle-manager install timeout Version of all relevant components (if applicable): ---------------------------------------------------------------------- OCS = ocs-operator.v4.5.0-449.ci OCP = 4.5.0-0.nightly-2020-06-17-001505 External Cluster (RHCS) = RHCS 4.1 = 14.2.8-59.el8cp Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? ---------------------------------------------------------------------- Yes Is there any workaround available to the best of your knowledge? ---------------------------------------------------------------------- Not that I am aware of Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? ---------------------------------------------------------------------- 3 Can this issue reproducible? ---------------------------------------------------------------------- CSV stuck in Installing phase: Tried once with v4.5.0-449 No default backingstore - Yes reproduced on all independnet mode clusters. Can this issue reproduce from the UI? ---------------------------------------------------------------------- The OCS operator was installed via UI If this is a regression, please provide more details to justify this: ---------------------------------------------------------------------- No.. Independent mode is a new feature of OCS 4.5 Steps to Reproduce: ---------------------------------------------------------------------- 1. Create an OCP 4.5 cluster on BM 2. Create an RHCS cluster with latest 4.1 with alteast 3 nodes 3. From UI, Install OCS in Independent mode. Official docs not ready yet, but steps can be found here [1] 4. Check the status of CSV for ocs operator 5. check the noobaa backingstore and bucketclass. commands added in Additional info Actual results: ---------------------------------------------------------------------- The CSV is stuck in Installing Phase with no clear indication to what is the error (atleast what i could find). it could be due to noobaa, but better to have a confirmation. Expected results: ---------------------------------------------------------------------- The CSV should be in Succeeded state if the install completes.
The status on the CephCluster CR shows that everything is healthy. @Jose, what determines the CSV state? status: ceph: health: HEALTH_OK lastChanged: "2020-06-17T15:19:30Z" lastChecked: "2020-06-17T15:36:41Z" previousHealth: HEALTH_WARN conditions: - lastHeartbeatTime: "2020-06-17T15:07:18Z" lastTransitionTime: "2020-06-17T15:07:18Z" status: "False" type: Failure - lastHeartbeatTime: "2020-06-17T15:07:18Z" lastTransitionTime: "2020-06-17T15:07:18Z" status: "False" type: Ignored - lastHeartbeatTime: "2020-06-17T15:07:18Z" lastTransitionTime: "2020-06-17T15:07:18Z" status: "False" type: Upgrading - lastHeartbeatTime: "2020-06-17T15:07:18Z" lastTransitionTime: "2020-06-17T15:07:18Z" message: Cluster is connecting reason: ClusterConnecting status: "True" type: Connecting - lastHeartbeatTime: "2020-06-17T15:07:21Z" lastTransitionTime: "2020-06-17T15:07:21Z" message: Cluster connected successfully reason: ClusterConnected status: "True" type: Connected message: Cluster connected successfully phase: Connected state: Connected
Among other things, the StorageCluster conditions must all be healthy: conditions: - lastHeartbeatTime: "2020-06-17T15:37:28Z" lastTransitionTime: "2020-06-17T15:07:19Z" message: Reconcile completed successfully reason: ReconcileCompleted status: "True" type: ReconcileComplete - lastHeartbeatTime: "2020-06-17T15:07:19Z" lastTransitionTime: "2020-06-17T15:07:19Z" message: CephCluster resource is not reporting status reason: CephClusterStatus status: "False" type: Available - lastHeartbeatTime: "2020-06-17T15:14:36Z" lastTransitionTime: "2020-06-17T15:07:19Z" message: Waiting on Nooba instance to finish initialization reason: NoobaaInitializing status: "True" type: Progressing - lastHeartbeatTime: "2020-06-17T15:37:28Z" lastTransitionTime: "2020-06-17T15:07:22Z" message: 'External CephCluster Unknown Condition: Cluster connected successfully' reason: ExternalClusterStateUnknownCondition status: "True" type: Degraded - lastHeartbeatTime: "2020-06-17T15:07:19Z" lastTransitionTime: "2020-06-17T15:07:19Z" message: CephCluster resource is not reporting status reason: CephClusterStatus status: "False" type: Upgradeable - lastHeartbeatTime: "2020-06-17T15:07:21Z" lastTransitionTime: "2020-06-17T15:07:20Z" message: 'External CephCluster is trying to connect: Cluster is connecting' reason: ExternalClusterStateConnecting status: "True" type: ExternalClusterConnecting - lastHeartbeatTime: "2020-06-17T15:07:21Z" lastTransitionTime: "2020-06-17T15:07:20Z" message: 'External CephCluster is trying to connect: Cluster is connecting' reason: ExternalClusterStateConnecting status: "False" type: ExternalClusterConnected For some reason, it is still in ExternalClusterStateConnecting. Indeed, that's what the ocs-operator logs show: 2020-06-17T15:07:21.149289186Z {"level":"info","ts":"2020-06-17T15:07:21.149Z","logger":"controller_storagecluster","msg":"Waiting for the external ceph cluster to be connected before starting noobaa","Request.Namespace":"openshift-storage","Request.Name":"ocs-independent-storagecluster"} So I'm not sure what's going on here.
@Seb, could you tell what's going on here?
Acking for 4.5. I think we need to address this, one way or another. Changing component to unclassified, since we don't know yet where the problem lies. Assigning to Seb for better visibility, for doing more analysis.
As far as I can see, the CephCluster has the correct status since it's "Connected" but the operator shows otherwise... Digging into the logs, I can also verify that the rook-ceph is happy: op-config: CephCluster "openshift-storage" status: "Connected". "Cluster connected successfully" Could it be that noobaa has been deployed but is failing? Things are clear on the Rook-Ceph side, so someone with a better knowledge on ocs-op and noobaa should look into it. Thanks.
(In reply to leseb from comment #10) > As far as I can see, the CephCluster has the correct status since it's > "Connected" but the operator shows otherwise... > Digging into the logs, I can also verify that the rook-ceph is happy: > > op-config: CephCluster "openshift-storage" status: "Connected". "Cluster > connected successfully" > > Could it be that noobaa has been deployed but is failing? > > Things are clear on the Rook-Ceph side, so someone with a better knowledge > on ocs-op and noobaa should look into it. > Thanks. @Nimrod, could you(r team) check if there's something going on in the noobaa side? The last comment from qe indicates that there's a problem with the bucketclass and backingstore.
Default BackingStore has a problem, see https://bugzilla.redhat.com/show_bug.cgi?id=1854768
(In reply to Nimrod Becker from comment #13) > Default BackingStore has a problem, see > https://bugzilla.redhat.com/show_bug.cgi?id=1854768 What does this mean for this BZ? Is it a duplicate?
I think it is, we can wait for a couple of hours to verify deployment is passing
https://github.com/openshift/ocs-operator/pull/627 RFC PR
Backport PR: https://github.com/openshift/ocs-operator/pull/628
Backport PR has merged.
https://ceph-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/OCS%20Build%20Pipeline%204.5/62/ 4.5.0-484.ci has the fix
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Container Storage 4.5.0 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:3754
Removing AutomationBackLog keyword. This will be covered in installation phase of all automated tier runs.