Bug 1925369

Summary:	[External Mode] Fails if Ceph Monitors are not listening to v1 type port 6789 and make sure both v1 and v2 are present
Product:	[Red Hat Storage] Red Hat OpenShift Container Storage	Reporter:	Randy Martinez <r.martinez>
Component:	rook	Assignee:	arun kumar mohan <amohan>
Status:	VERIFIED ---	QA Contact:	Neha Berry <nberry>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.6	CC:	amohan, muagarwa, shan, sostapov, tdesala, tnielsen
Target Milestone:	---	Keywords:	AutomationBackLog
Target Release:	OCS 4.8.0	Flags:	shmohan: needinfo? (amohan)
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	4.8.0-416.ci	Doc Type:	Bug Fix
Doc Text:	Cause: OCS External-mode Cluster Deployment won't progress if we have an external RHCS cluster whose MON leader is configured only to `v2` port (ie; port: 3300) and `v1` port (ie; port: 6789) is not enabled. Consequence: External cluster connection from OCP won't progress and will be in stalled state. Fix: this is a workaround, while running the external python script, to generate JSON output, the script will raise an exception if only 'v2' mon-port is enabled and provide user the info to enable 'v1' type port as well. Result: As the script itself won't work until the client/user has enabled the v1 type port, user will be forced to enable 'v1' port for the script's successful completion.	Story Points:	---
Clone Of:		Environment:
Last Closed:		Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Comment 6 Jose A. Rivera 2021-02-08 15:15:02 UTC

we'll need more information, preferably a full ocs-must-gather on the affected cluster.

With what's available right now, I see the following weirdness in the CephCluster CR:

status:
  ceph:
    details:
      MANY_OBJECTS_PER_PG:
        message: 1 pools have many more objects per pg than average
        severity: HEALTH_WARN
      POOL_TOO_MANY_PGS:
        message: 1 pools have too many placement groups
        severity: HEALTH_WARN
    health: HEALTH_WARN
    lastChecked: "2021-01-29T19:12:34Z"
  conditions:
  - lastHeartbeatTime: "2021-01-27T18:23:17Z"
    lastTransitionTime: "2021-01-27T18:23:17Z"
    message: Cluster is connecting
    reason: ClusterConnecting
    status: "True"
    type: Connecting
  - lastHeartbeatTime: "2021-01-27T18:23:23Z"
    lastTransitionTime: "2021-01-27T18:23:23Z"
    message: Cluster connected successfully
    reason: ClusterConnected
    status: "True"
    type: Connected

First we have the Ceph errors, which I'm not sure how to interpret but may be a problem. Second, both the Connecting and Connected conditions are True, which I'm not sure is valid. Travis or Seb, can you weigh in?

Comment 7 Jose A. Rivera 2021-02-08 17:26:34 UTC

Sorry, I missed that the must-gather was attached to the case.

Looking into it, the only other thing I was able to glean was a lot of spam in the rook-ceph-operator logs of these lines:

2021-02-03T16:52:49.759915754Z 2021-02-03 16:52:49.759851 E | ceph-object-controller: failed to delete object from bucket. RequestError: send request failed
2021-02-03T16:52:49.759915754Z caused by: Delete "http://rook-ceph-rgw-ocs-external-storagecluster-cephobjectstore.openshift-storage.svc.cluster.local:80/rook-ceph-bucket-checker-8e04981c-ca71-4ec0-ac38-ac183454e508/rookHealthCheckTestObject": read tcp 10.129.2.21:46518->172.30.18.19:80: read: connection reset by peer

So it looks like we may just have connectivity issues to the external Ceph cluster. Still not quite sure how to proceed or troubleshoot.

Comment 9 Sébastien Han 2021-02-10 17:45:27 UTC

Randy, is the external cluster a standard RHCS deployment with ceph-ansible?

What bugs me is this sentence "Specifically, they mentioned that the MONs were _not_ listening on v1(:6789), and only v2(:3300) messenger protocol. Their assumption being that v1 was not needed", so if you could clarify.
Thanks

Comment 11 Jose A. Rivera 2021-02-15 15:58:06 UTC

I'm still not sure if this is a valid bug or not, but it seems like a misconfiguration issue and not a hard blocker for external mode. Moving this to OCS 4.8.

Comment 14 Sébastien Han 2021-03-16 09:57:54 UTC

Hi Randy, that sounds good. I'm moving this to Rook, Arun will take care of this.
Thanks!

Comment 15 Travis Nielsen 2021-05-11 15:09:35 UTC

Arun can you take a look?

Comment 16 Travis Nielsen 2021-05-17 15:53:37 UTC

Arun any update?

Comment 17 arun kumar mohan 2021-05-21 11:02:11 UTC

Sorry Travis, not working on this. Will take this up next week...

Comment 18 Travis Nielsen 2021-06-07 16:02:29 UTC

Arun Can we get this in the next day to be in time for dev freeze? It looks small, thanks.

Comment 19 arun kumar mohan 2021-06-08 18:33:18 UTC

PR raised: https://github.com/rook/rook/pull/8083

Travis / Sebastian please take a look.

Comment 23 arun kumar mohan 2021-06-21 10:57:41 UTC

Providing the doc text. Please take a look.