Bug 1925369 - [External Mode] Fails if Ceph Monitors are not listening to v1 type port 6789 and make sure both v1 and v2 are present [NEEDINFO]
Summary: [External Mode] Fails if Ceph Monitors are not listening to v1 type port 6789...
Keywords:
Status: VERIFIED
Alias: None
Product: Red Hat OpenShift Container Storage
Classification: Red Hat Storage
Component: rook
Version: 4.6
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: OCS 4.8.0
Assignee: arun kumar mohan
QA Contact: Neha Berry
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-02-05 02:37 UTC by Randy Martinez
Modified: 2023-08-03 08:31 UTC (History)
6 users (show)

Fixed In Version: 4.8.0-416.ci
Doc Type: Bug Fix
Doc Text:
Cause: OCS External-mode Cluster Deployment won't progress if we have an external RHCS cluster whose MON leader is configured only to `v2` port (ie; port: 3300) and `v1` port (ie; port: 6789) is *not* enabled. Consequence: External cluster connection from OCP won't progress and will be in stalled state. Fix: this is a workaround, while running the external python script, to generate JSON output, the script will raise an exception if only 'v2' mon-port is enabled and provide user the info to enable 'v1' type port as well. Result: As the script itself won't work until the client/user has enabled the v1 type port, user will be forced to enable 'v1' port for the script's successful completion.
Clone Of:
Environment:
Last Closed:
Embargoed:
shmohan: needinfo? (amohan)


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift rook pull 249 0 None open Bug 1925369: ceph: mons should enable v1 address type when enabling v2 2021-06-09 16:44:07 UTC

Comment 6 Jose A. Rivera 2021-02-08 15:15:02 UTC
we'll need more information, preferably a full ocs-must-gather on the affected cluster.

With what's available right now, I see the following weirdness in the CephCluster CR:

status:
  ceph:
    details:
      MANY_OBJECTS_PER_PG:
        message: 1 pools have many more objects per pg than average
        severity: HEALTH_WARN
      POOL_TOO_MANY_PGS:
        message: 1 pools have too many placement groups
        severity: HEALTH_WARN
    health: HEALTH_WARN
    lastChecked: "2021-01-29T19:12:34Z"
  conditions:
  - lastHeartbeatTime: "2021-01-27T18:23:17Z"
    lastTransitionTime: "2021-01-27T18:23:17Z"
    message: Cluster is connecting
    reason: ClusterConnecting
    status: "True"
    type: Connecting
  - lastHeartbeatTime: "2021-01-27T18:23:23Z"
    lastTransitionTime: "2021-01-27T18:23:23Z"
    message: Cluster connected successfully
    reason: ClusterConnected
    status: "True"
    type: Connected

First we have the Ceph errors, which I'm not sure how to interpret but may be a problem. Second, both the Connecting and Connected conditions are True, which I'm not sure is valid. Travis or Seb, can you weigh in?

Comment 7 Jose A. Rivera 2021-02-08 17:26:34 UTC
Sorry, I missed that the must-gather was attached to the case.

Looking into it, the only other thing I was able to glean was a lot of spam in the rook-ceph-operator logs of these lines:

2021-02-03T16:52:49.759915754Z 2021-02-03 16:52:49.759851 E | ceph-object-controller: failed to delete object from bucket. RequestError: send request failed
2021-02-03T16:52:49.759915754Z caused by: Delete "http://rook-ceph-rgw-ocs-external-storagecluster-cephobjectstore.openshift-storage.svc.cluster.local:80/rook-ceph-bucket-checker-8e04981c-ca71-4ec0-ac38-ac183454e508/rookHealthCheckTestObject": read tcp 10.129.2.21:46518->172.30.18.19:80: read: connection reset by peer

So it looks like we may just have connectivity issues to the external Ceph cluster. Still not quite sure how to proceed or troubleshoot.

Comment 9 Sébastien Han 2021-02-10 17:45:27 UTC
Randy, is the external cluster a standard RHCS deployment with ceph-ansible?

What bugs me is this sentence "Specifically, they mentioned that the MONs were _not_ listening on v1(:6789), and only v2(:3300) messenger protocol. Their assumption being that v1 was not needed", so if you could clarify.
Thanks

Comment 11 Jose A. Rivera 2021-02-15 15:58:06 UTC
I'm still not sure if this is a valid bug or not, but it seems like a misconfiguration issue and not a hard blocker for external mode. Moving this to OCS 4.8.

Comment 14 Sébastien Han 2021-03-16 09:57:54 UTC
Hi Randy, that sounds good. I'm moving this to Rook, Arun will take care of this.
Thanks!

Comment 15 Travis Nielsen 2021-05-11 15:09:35 UTC
Arun can you take a look?

Comment 16 Travis Nielsen 2021-05-17 15:53:37 UTC
Arun any update?

Comment 17 arun kumar mohan 2021-05-21 11:02:11 UTC
Sorry Travis, not working on this. Will take this up next week...

Comment 18 Travis Nielsen 2021-06-07 16:02:29 UTC
Arun Can we get this in the next day to be in time for dev freeze? It looks small, thanks.

Comment 19 arun kumar mohan 2021-06-08 18:33:18 UTC
PR raised: https://github.com/rook/rook/pull/8083

Travis / Sebastian please take a look.

Comment 23 arun kumar mohan 2021-06-21 10:57:41 UTC
Providing the doc text. Please take a look.


Note You need to log in before you can comment on or make changes to this bug.