Bug 1925369
| Summary: | [External Mode] Fails if Ceph Monitors are not listening to v1 type port 6789 and make sure both v1 and v2 are present | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Container Storage | Reporter: | Randy Martinez <r.martinez> |
| Component: | rook | Assignee: | arun kumar mohan <amohan> |
| Status: | VERIFIED --- | QA Contact: | Neha Berry <nberry> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.6 | CC: | amohan, muagarwa, shan, sostapov, tdesala, tnielsen |
| Target Milestone: | --- | Keywords: | AutomationBackLog |
| Target Release: | OCS 4.8.0 | Flags: | shmohan:
needinfo?
(amohan) |
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | 4.8.0-416.ci | Doc Type: | Bug Fix |
| Doc Text: |
Cause: OCS External-mode Cluster Deployment won't progress if we have an external RHCS cluster whose MON leader is configured only to `v2` port (ie; port: 3300) and `v1` port (ie; port: 6789) is *not* enabled.
Consequence: External cluster connection from OCP won't progress and will be in stalled state.
Fix: this is a workaround, while running the external python script, to generate JSON output, the script will raise an exception if only 'v2' mon-port is enabled and provide user the info to enable 'v1' type port as well.
Result: As the script itself won't work until the client/user has enabled the v1 type port, user will be forced to enable 'v1' port for the script's successful completion.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | Type: | Bug | |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Sorry, I missed that the must-gather was attached to the case. Looking into it, the only other thing I was able to glean was a lot of spam in the rook-ceph-operator logs of these lines: 2021-02-03T16:52:49.759915754Z 2021-02-03 16:52:49.759851 E | ceph-object-controller: failed to delete object from bucket. RequestError: send request failed 2021-02-03T16:52:49.759915754Z caused by: Delete "http://rook-ceph-rgw-ocs-external-storagecluster-cephobjectstore.openshift-storage.svc.cluster.local:80/rook-ceph-bucket-checker-8e04981c-ca71-4ec0-ac38-ac183454e508/rookHealthCheckTestObject": read tcp 10.129.2.21:46518->172.30.18.19:80: read: connection reset by peer So it looks like we may just have connectivity issues to the external Ceph cluster. Still not quite sure how to proceed or troubleshoot. Randy, is the external cluster a standard RHCS deployment with ceph-ansible? What bugs me is this sentence "Specifically, they mentioned that the MONs were _not_ listening on v1(:6789), and only v2(:3300) messenger protocol. Their assumption being that v1 was not needed", so if you could clarify. Thanks I'm still not sure if this is a valid bug or not, but it seems like a misconfiguration issue and not a hard blocker for external mode. Moving this to OCS 4.8. Hi Randy, that sounds good. I'm moving this to Rook, Arun will take care of this. Thanks! Arun can you take a look? Arun any update? Sorry Travis, not working on this. Will take this up next week... Arun Can we get this in the next day to be in time for dev freeze? It looks small, thanks. PR raised: https://github.com/rook/rook/pull/8083 Travis / Sebastian please take a look. Providing the doc text. Please take a look. |
we'll need more information, preferably a full ocs-must-gather on the affected cluster. With what's available right now, I see the following weirdness in the CephCluster CR: status: ceph: details: MANY_OBJECTS_PER_PG: message: 1 pools have many more objects per pg than average severity: HEALTH_WARN POOL_TOO_MANY_PGS: message: 1 pools have too many placement groups severity: HEALTH_WARN health: HEALTH_WARN lastChecked: "2021-01-29T19:12:34Z" conditions: - lastHeartbeatTime: "2021-01-27T18:23:17Z" lastTransitionTime: "2021-01-27T18:23:17Z" message: Cluster is connecting reason: ClusterConnecting status: "True" type: Connecting - lastHeartbeatTime: "2021-01-27T18:23:23Z" lastTransitionTime: "2021-01-27T18:23:23Z" message: Cluster connected successfully reason: ClusterConnected status: "True" type: Connected First we have the Ceph errors, which I'm not sure how to interpret but may be a problem. Second, both the Connecting and Connected conditions are True, which I'm not sure is valid. Travis or Seb, can you weigh in?