1925369 – [External Mode] Fails if Ceph Monitors are not listening to v1 type port 6789 and make sure both v1 and v2 are present

Bug 1925369 - [External Mode] Fails if Ceph Monitors are not listening to v1 type port 6789 and make sure both v1 and v2 are present [NEEDINFO]

Summary: [External Mode] Fails if Ceph Monitors are not listening to v1 type port 6789...

Keywords:
Status:	VERIFIED
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	rook
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	OCS 4.8.0
Assignee:	arun kumar mohan
QA Contact:	Neha Berry
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-02-05 02:37 UTC by Randy Martinez
Modified:	2024-03-25 18:07 UTC (History)
CC List:	6 users (show)
Fixed In Version:	4.8.0-416.ci
Doc Type:	Bug Fix
Doc Text:	Cause: OCS External-mode Cluster Deployment won't progress if we have an external RHCS cluster whose MON leader is configured only to `v2` port (ie; port: 3300) and `v1` port (ie; port: 6789) is not enabled. Consequence: External cluster connection from OCP won't progress and will be in stalled state. Fix: this is a workaround, while running the external python script, to generate JSON output, the script will raise an exception if only 'v2' mon-port is enabled and provide user the info to enable 'v1' type port as well. Result: As the script itself won't work until the client/user has enabled the v1 type port, user will be forced to enable 'v1' port for the script's successful completion.
Clone Of:
Environment:
Last Closed:
Embargoed:
Dependent Products:
Flags:	shmohan: needinfo? (amohan)

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift rook pull 249	0	None	open	Bug 1925369: ceph: mons should enable v1 address type when enabling v2	2021-06-09 16:44:07 UTC

Comment 6 Jose A. Rivera 2021-02-08 15:15:02 UTC

we'll need more information, preferably a full ocs-must-gather on the affected cluster.

With what's available right now, I see the following weirdness in the CephCluster CR:

status:
  ceph:
    details:
      MANY_OBJECTS_PER_PG:
        message: 1 pools have many more objects per pg than average
        severity: HEALTH_WARN
      POOL_TOO_MANY_PGS:
        message: 1 pools have too many placement groups
        severity: HEALTH_WARN
    health: HEALTH_WARN
    lastChecked: "2021-01-29T19:12:34Z"
  conditions:
  - lastHeartbeatTime: "2021-01-27T18:23:17Z"
    lastTransitionTime: "2021-01-27T18:23:17Z"
    message: Cluster is connecting
    reason: ClusterConnecting
    status: "True"
    type: Connecting
  - lastHeartbeatTime: "2021-01-27T18:23:23Z"
    lastTransitionTime: "2021-01-27T18:23:23Z"
    message: Cluster connected successfully
    reason: ClusterConnected
    status: "True"
    type: Connected

First we have the Ceph errors, which I'm not sure how to interpret but may be a problem. Second, both the Connecting and Connected conditions are True, which I'm not sure is valid. Travis or Seb, can you weigh in?

Comment 7 Jose A. Rivera 2021-02-08 17:26:34 UTC

Sorry, I missed that the must-gather was attached to the case.

Looking into it, the only other thing I was able to glean was a lot of spam in the rook-ceph-operator logs of these lines:

2021-02-03T16:52:49.759915754Z 2021-02-03 16:52:49.759851 E | ceph-object-controller: failed to delete object from bucket. RequestError: send request failed
2021-02-03T16:52:49.759915754Z caused by: Delete "http://rook-ceph-rgw-ocs-external-storagecluster-cephobjectstore.openshift-storage.svc.cluster.local:80/rook-ceph-bucket-checker-8e04981c-ca71-4ec0-ac38-ac183454e508/rookHealthCheckTestObject": read tcp 10.129.2.21:46518->172.30.18.19:80: read: connection reset by peer

So it looks like we may just have connectivity issues to the external Ceph cluster. Still not quite sure how to proceed or troubleshoot.

Comment 9 Sébastien Han 2021-02-10 17:45:27 UTC

Randy, is the external cluster a standard RHCS deployment with ceph-ansible?

What bugs me is this sentence "Specifically, they mentioned that the MONs were _not_ listening on v1(:6789), and only v2(:3300) messenger protocol. Their assumption being that v1 was not needed", so if you could clarify.
Thanks

Comment 11 Jose A. Rivera 2021-02-15 15:58:06 UTC

I'm still not sure if this is a valid bug or not, but it seems like a misconfiguration issue and not a hard blocker for external mode. Moving this to OCS 4.8.

Comment 14 Sébastien Han 2021-03-16 09:57:54 UTC

Hi Randy, that sounds good. I'm moving this to Rook, Arun will take care of this.
Thanks!

Comment 15 Travis Nielsen 2021-05-11 15:09:35 UTC

Arun can you take a look?

Comment 16 Travis Nielsen 2021-05-17 15:53:37 UTC

Arun any update?

Comment 17 arun kumar mohan 2021-05-21 11:02:11 UTC

Sorry Travis, not working on this. Will take this up next week...

Comment 18 Travis Nielsen 2021-06-07 16:02:29 UTC

Arun Can we get this in the next day to be in time for dev freeze? It looks small, thanks.

Comment 19 arun kumar mohan 2021-06-08 18:33:18 UTC

PR raised: https://github.com/rook/rook/pull/8083

Travis / Sebastian please take a look.

Comment 23 arun kumar mohan 2021-06-21 10:57:41 UTC

Providing the doc text. Please take a look.

Note You need to log in before you can comment on or make changes to this bug.