we'll need more information, preferably a full ocs-must-gather on the affected cluster. With what's available right now, I see the following weirdness in the CephCluster CR: status: ceph: details: MANY_OBJECTS_PER_PG: message: 1 pools have many more objects per pg than average severity: HEALTH_WARN POOL_TOO_MANY_PGS: message: 1 pools have too many placement groups severity: HEALTH_WARN health: HEALTH_WARN lastChecked: "2021-01-29T19:12:34Z" conditions: - lastHeartbeatTime: "2021-01-27T18:23:17Z" lastTransitionTime: "2021-01-27T18:23:17Z" message: Cluster is connecting reason: ClusterConnecting status: "True" type: Connecting - lastHeartbeatTime: "2021-01-27T18:23:23Z" lastTransitionTime: "2021-01-27T18:23:23Z" message: Cluster connected successfully reason: ClusterConnected status: "True" type: Connected First we have the Ceph errors, which I'm not sure how to interpret but may be a problem. Second, both the Connecting and Connected conditions are True, which I'm not sure is valid. Travis or Seb, can you weigh in?
Sorry, I missed that the must-gather was attached to the case. Looking into it, the only other thing I was able to glean was a lot of spam in the rook-ceph-operator logs of these lines: 2021-02-03T16:52:49.759915754Z 2021-02-03 16:52:49.759851 E | ceph-object-controller: failed to delete object from bucket. RequestError: send request failed 2021-02-03T16:52:49.759915754Z caused by: Delete "http://rook-ceph-rgw-ocs-external-storagecluster-cephobjectstore.openshift-storage.svc.cluster.local:80/rook-ceph-bucket-checker-8e04981c-ca71-4ec0-ac38-ac183454e508/rookHealthCheckTestObject": read tcp 10.129.2.21:46518->172.30.18.19:80: read: connection reset by peer So it looks like we may just have connectivity issues to the external Ceph cluster. Still not quite sure how to proceed or troubleshoot.
Randy, is the external cluster a standard RHCS deployment with ceph-ansible? What bugs me is this sentence "Specifically, they mentioned that the MONs were _not_ listening on v1(:6789), and only v2(:3300) messenger protocol. Their assumption being that v1 was not needed", so if you could clarify. Thanks
I'm still not sure if this is a valid bug or not, but it seems like a misconfiguration issue and not a hard blocker for external mode. Moving this to OCS 4.8.
Hi Randy, that sounds good. I'm moving this to Rook, Arun will take care of this. Thanks!
Arun can you take a look?
Arun any update?
Sorry Travis, not working on this. Will take this up next week...
Arun Can we get this in the next day to be in time for dev freeze? It looks small, thanks.
PR raised: https://github.com/rook/rook/pull/8083 Travis / Sebastian please take a look.
Providing the doc text. Please take a look.