Bug 1925369

Summary: [External Mode] Fails if Ceph Monitors are not listening to v1 type port 6789 and make sure both v1 and v2 are present
Product: [Red Hat Storage] Red Hat OpenShift Container Storage Reporter: Randy Martinez <r.martinez>
Component: rookAssignee: arun kumar mohan <amohan>
Status: VERIFIED --- QA Contact: Neha Berry <nberry>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.6CC: amohan, muagarwa, shan, sostapov, tdesala, tnielsen
Target Milestone: ---Keywords: AutomationBackLog
Target Release: OCS 4.8.0Flags: shmohan: needinfo? (amohan)
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 4.8.0-416.ci Doc Type: Bug Fix
Doc Text:
Cause: OCS External-mode Cluster Deployment won't progress if we have an external RHCS cluster whose MON leader is configured only to `v2` port (ie; port: 3300) and `v1` port (ie; port: 6789) is *not* enabled. Consequence: External cluster connection from OCP won't progress and will be in stalled state. Fix: this is a workaround, while running the external python script, to generate JSON output, the script will raise an exception if only 'v2' mon-port is enabled and provide user the info to enable 'v1' type port as well. Result: As the script itself won't work until the client/user has enabled the v1 type port, user will be forced to enable 'v1' port for the script's successful completion.
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Comment 6 Jose A. Rivera 2021-02-08 15:15:02 UTC
we'll need more information, preferably a full ocs-must-gather on the affected cluster.

With what's available right now, I see the following weirdness in the CephCluster CR:

status:
  ceph:
    details:
      MANY_OBJECTS_PER_PG:
        message: 1 pools have many more objects per pg than average
        severity: HEALTH_WARN
      POOL_TOO_MANY_PGS:
        message: 1 pools have too many placement groups
        severity: HEALTH_WARN
    health: HEALTH_WARN
    lastChecked: "2021-01-29T19:12:34Z"
  conditions:
  - lastHeartbeatTime: "2021-01-27T18:23:17Z"
    lastTransitionTime: "2021-01-27T18:23:17Z"
    message: Cluster is connecting
    reason: ClusterConnecting
    status: "True"
    type: Connecting
  - lastHeartbeatTime: "2021-01-27T18:23:23Z"
    lastTransitionTime: "2021-01-27T18:23:23Z"
    message: Cluster connected successfully
    reason: ClusterConnected
    status: "True"
    type: Connected

First we have the Ceph errors, which I'm not sure how to interpret but may be a problem. Second, both the Connecting and Connected conditions are True, which I'm not sure is valid. Travis or Seb, can you weigh in?

Comment 7 Jose A. Rivera 2021-02-08 17:26:34 UTC
Sorry, I missed that the must-gather was attached to the case.

Looking into it, the only other thing I was able to glean was a lot of spam in the rook-ceph-operator logs of these lines:

2021-02-03T16:52:49.759915754Z 2021-02-03 16:52:49.759851 E | ceph-object-controller: failed to delete object from bucket. RequestError: send request failed
2021-02-03T16:52:49.759915754Z caused by: Delete "http://rook-ceph-rgw-ocs-external-storagecluster-cephobjectstore.openshift-storage.svc.cluster.local:80/rook-ceph-bucket-checker-8e04981c-ca71-4ec0-ac38-ac183454e508/rookHealthCheckTestObject": read tcp 10.129.2.21:46518->172.30.18.19:80: read: connection reset by peer

So it looks like we may just have connectivity issues to the external Ceph cluster. Still not quite sure how to proceed or troubleshoot.

Comment 9 Sébastien Han 2021-02-10 17:45:27 UTC
Randy, is the external cluster a standard RHCS deployment with ceph-ansible?

What bugs me is this sentence "Specifically, they mentioned that the MONs were _not_ listening on v1(:6789), and only v2(:3300) messenger protocol. Their assumption being that v1 was not needed", so if you could clarify.
Thanks

Comment 11 Jose A. Rivera 2021-02-15 15:58:06 UTC
I'm still not sure if this is a valid bug or not, but it seems like a misconfiguration issue and not a hard blocker for external mode. Moving this to OCS 4.8.

Comment 14 Sébastien Han 2021-03-16 09:57:54 UTC
Hi Randy, that sounds good. I'm moving this to Rook, Arun will take care of this.
Thanks!

Comment 15 Travis Nielsen 2021-05-11 15:09:35 UTC
Arun can you take a look?

Comment 16 Travis Nielsen 2021-05-17 15:53:37 UTC
Arun any update?

Comment 17 arun kumar mohan 2021-05-21 11:02:11 UTC
Sorry Travis, not working on this. Will take this up next week...

Comment 18 Travis Nielsen 2021-06-07 16:02:29 UTC
Arun Can we get this in the next day to be in time for dev freeze? It looks small, thanks.

Comment 19 arun kumar mohan 2021-06-08 18:33:18 UTC
PR raised: https://github.com/rook/rook/pull/8083

Travis / Sebastian please take a look.

Comment 23 arun kumar mohan 2021-06-21 10:57:41 UTC
Providing the doc text. Please take a look.