Bug 1848387
| Summary: | Independent mode OCS 4.5(v4.5.0-449): CSV is stuck in Installing state and never reached Succeeded state | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Container Storage | Reporter: | Neha Berry <nberry> |
| Component: | ocs-operator | Assignee: | Michael Adam <madam> |
| Status: | CLOSED ERRATA | QA Contact: | Sidhant Agrawal <sagrawal> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.5 | CC: | bniver, gmeno, jarrpa, jijoy, kramdoss, madam, nbecker, ocs-bugs, rgeorge, sagrawal, shan, sostapov |
| Target Milestone: | --- | Keywords: | AutomationTriaged |
| Target Release: | OCS 4.5.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | v4.5.0-484.ci | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-09-15 10:17:44 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Neha Berry
2020-06-18 09:32:47 UTC
The status on the CephCluster CR shows that everything is healthy.
@Jose, what determines the CSV state?
status:
ceph:
health: HEALTH_OK
lastChanged: "2020-06-17T15:19:30Z"
lastChecked: "2020-06-17T15:36:41Z"
previousHealth: HEALTH_WARN
conditions:
- lastHeartbeatTime: "2020-06-17T15:07:18Z"
lastTransitionTime: "2020-06-17T15:07:18Z"
status: "False"
type: Failure
- lastHeartbeatTime: "2020-06-17T15:07:18Z"
lastTransitionTime: "2020-06-17T15:07:18Z"
status: "False"
type: Ignored
- lastHeartbeatTime: "2020-06-17T15:07:18Z"
lastTransitionTime: "2020-06-17T15:07:18Z"
status: "False"
type: Upgrading
- lastHeartbeatTime: "2020-06-17T15:07:18Z"
lastTransitionTime: "2020-06-17T15:07:18Z"
message: Cluster is connecting
reason: ClusterConnecting
status: "True"
type: Connecting
- lastHeartbeatTime: "2020-06-17T15:07:21Z"
lastTransitionTime: "2020-06-17T15:07:21Z"
message: Cluster connected successfully
reason: ClusterConnected
status: "True"
type: Connected
message: Cluster connected successfully
phase: Connected
state: Connected
Among other things, the StorageCluster conditions must all be healthy:
conditions:
- lastHeartbeatTime: "2020-06-17T15:37:28Z"
lastTransitionTime: "2020-06-17T15:07:19Z"
message: Reconcile completed successfully
reason: ReconcileCompleted
status: "True"
type: ReconcileComplete
- lastHeartbeatTime: "2020-06-17T15:07:19Z"
lastTransitionTime: "2020-06-17T15:07:19Z"
message: CephCluster resource is not reporting status
reason: CephClusterStatus
status: "False"
type: Available
- lastHeartbeatTime: "2020-06-17T15:14:36Z"
lastTransitionTime: "2020-06-17T15:07:19Z"
message: Waiting on Nooba instance to finish initialization
reason: NoobaaInitializing
status: "True"
type: Progressing
- lastHeartbeatTime: "2020-06-17T15:37:28Z"
lastTransitionTime: "2020-06-17T15:07:22Z"
message: 'External CephCluster Unknown Condition: Cluster connected successfully'
reason: ExternalClusterStateUnknownCondition
status: "True"
type: Degraded
- lastHeartbeatTime: "2020-06-17T15:07:19Z"
lastTransitionTime: "2020-06-17T15:07:19Z"
message: CephCluster resource is not reporting status
reason: CephClusterStatus
status: "False"
type: Upgradeable
- lastHeartbeatTime: "2020-06-17T15:07:21Z"
lastTransitionTime: "2020-06-17T15:07:20Z"
message: 'External CephCluster is trying to connect: Cluster is connecting'
reason: ExternalClusterStateConnecting
status: "True"
type: ExternalClusterConnecting
- lastHeartbeatTime: "2020-06-17T15:07:21Z"
lastTransitionTime: "2020-06-17T15:07:20Z"
message: 'External CephCluster is trying to connect: Cluster is connecting'
reason: ExternalClusterStateConnecting
status: "False"
type: ExternalClusterConnected
For some reason, it is still in ExternalClusterStateConnecting. Indeed, that's what the ocs-operator logs show:
2020-06-17T15:07:21.149289186Z {"level":"info","ts":"2020-06-17T15:07:21.149Z","logger":"controller_storagecluster","msg":"Waiting for the external ceph cluster to be connected before starting noobaa","Request.Namespace":"openshift-storage","Request.Name":"ocs-independent-storagecluster"}
So I'm not sure what's going on here.
@Seb, could you tell what's going on here? Acking for 4.5. I think we need to address this, one way or another. Changing component to unclassified, since we don't know yet where the problem lies. Assigning to Seb for better visibility, for doing more analysis. As far as I can see, the CephCluster has the correct status since it's "Connected" but the operator shows otherwise... Digging into the logs, I can also verify that the rook-ceph is happy: op-config: CephCluster "openshift-storage" status: "Connected". "Cluster connected successfully" Could it be that noobaa has been deployed but is failing? Things are clear on the Rook-Ceph side, so someone with a better knowledge on ocs-op and noobaa should look into it. Thanks. (In reply to leseb from comment #10) > As far as I can see, the CephCluster has the correct status since it's > "Connected" but the operator shows otherwise... > Digging into the logs, I can also verify that the rook-ceph is happy: > > op-config: CephCluster "openshift-storage" status: "Connected". "Cluster > connected successfully" > > Could it be that noobaa has been deployed but is failing? > > Things are clear on the Rook-Ceph side, so someone with a better knowledge > on ocs-op and noobaa should look into it. > Thanks. @Nimrod, could you(r team) check if there's something going on in the noobaa side? The last comment from qe indicates that there's a problem with the bucketclass and backingstore. Default BackingStore has a problem, see https://bugzilla.redhat.com/show_bug.cgi?id=1854768 (In reply to Nimrod Becker from comment #13) > Default BackingStore has a problem, see > https://bugzilla.redhat.com/show_bug.cgi?id=1854768 What does this mean for this BZ? Is it a duplicate? I think it is, we can wait for a couple of hours to verify deployment is passing Backport PR: https://github.com/openshift/ocs-operator/pull/628 Backport PR has merged. https://ceph-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/OCS%20Build%20Pipeline%204.5/62/ 4.5.0-484.ci has the fix Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Container Storage 4.5.0 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:3754 Removing AutomationBackLog keyword. This will be covered in installation phase of all automated tier runs. |