Description of problem (please be detailed as possible and provide log snippests): ---------------------------------------------------------------------- As part of Independent Mode install in OCP UI, one gets the option to download the ceph-external-cluster-details-exporter.py and run on external RHCS cluster to collect the config details. Command: python3 ceph-external-cluster-details-exporter.py --rgw-endpoint <RGW endpoint:8080> --rbd-data-pool-name <pool name> Observation: 1. Even if one provides an invalid RGW endpoint IP, the json is generated and on uploading the same in UI, the StorageCLuster gets created. 2. if one provides an incorrect pool name, the script throws error (as expected) >> Excecution Failed: The provided 'rbd-data-pool-name': cbp, don't exists Issue: 1. Unlike handling of incorrect pool name, the script doesnt thrown error when an incorrect RGW endpoint is provided. This should be handled as well AFAIK, the information for the RGW endpoint can also be validated from the ceph cluster, similar to the block pool name Version of all relevant components (if applicable): ---------------------------------------------------------------------- $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.5.0-0.nightly-2020-06-17-001505 True False 7d6h Cluster version is 4.5.0-0.nightly-2020-06-17-001505 $ oc get csv -n openshift-storage NAME DISPLAY VERSION REPLACES PHASE awss3operator.1.0.1 AWS S3 Operator 1.0.1 awss3operator.1.0.0 Succeeded lib-bucket-provisioner.v1.0.0 lib-bucket-provisioner 1.0.0 Succeeded ocs-operator.v4.5.0-460.ci OpenShift Container Storage 4.5.0-460.ci Installing Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? ---------------------------------------------------------------------- The RGW SC has incorrect IP address. Is there any workaround available to the best of your knowledge? ---------------------------------------------------------------------- Not known Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? ---------------------------------------------------------------------- 3 Can this issue reproducible? ---------------------------------------------------------------------- Yes Can this issue reproduce from the UI? ---------------------------------------------------------------------- yes If this is a regression, please provide more details to justify this: ---------------------------------------------------------------------- No. Independent Mode is a new feature Steps to Reproduce: ---------------------------------------------------------------------- 1. Install OCP 4.5 2. Using deploy-with-olm.yaml, create the ocs-catalogsource with latest OCS 4.5 build 3. Install Subscription from OperatorHub-> RHOCS Operator-> Install Subscription 4. In openshift-storage namespace; Navigate to Installed Operators->OCS Operator->StorageCluster-> and click on Create OCS Cluster Service 5. Select Independent Mode 6. Download the script from the hyperlink below: Connect to external cluster Download ceph-external-cluster-details-exporter.py script and run on the RHCS cluster, then upload the results(JSON) in the External cluster metadata field. Download Script 7. On an external RHCS cluster, execute the script, but provide an incorrect RGW-endpoint IP address, e.g. 1.2.3.4 python3 ceph-external-cluster-details-exporter.py --rgw-endpoint 1.2.3.4:8080 --rbd-data-pool-name cbp-bm17 8. The json is generated. Upload the incorrect json in the UI and click on Create. The StorageCLuster gets created with an incorrect RGW endpoint in the RGW StorageCLass Actual results: ---------------------------------------------------------------------- the storage Cluster gets created and the RGW SC is created with an incorrect RGW endpoint IP Expected results: ---------------------------------------------------------------------- Some validation should be in place to fail the execution of the script if incorrect RGW-endpoint IP address is provided. Additional info: ---------------------------------------------------------------------- $ oc get sc ocs-independent-storagecluster-ceph-rgw -o yaml apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: creationTimestamp: "2020-06-24T18:06:16Z" managedFields: - apiVersion: storage.k8s.io/v1 fieldsType: FieldsV1 fieldsV1: f:parameters: .: {} f:endpoint: {} f:objectStoreNamespace: {} f:region: {} f:provisioner: {} f:reclaimPolicy: {} f:volumeBindingMode: {} manager: ocs-operator operation: Update time: "2020-06-24T18:06:16Z" name: ocs-independent-storagecluster-ceph-rgw resourceVersion: "7072876" selfLink: /apis/storage.k8s.io/v1/storageclasses/ocs-independent-storagecluster-ceph-rgw uid: 387fe9c2-8dd7-491d-82d7-e465469c0dc3 parameters: endpoint: 1.2.3.4:8080 objectStoreNamespace: openshift-storage region: us-east-1 provisioner: openshift-storage.ceph.rook.io/bucket reclaimPolicy: Delete volumeBindingMode: Immediate _____________________________________________ Wed Jun 24 18:17:01 UTC 2020 cluster: id: fe01cf06-8c2b-4e5b-9fea-8a6a8e402b88 health: HEALTH_OK services: mon: 3 daemons, quorum dell-r730-031,dell-r730-037,dell-r730-044 (age 2w) mgr: dell-r730-037(active, since 2d) mds: cephfs:1 {0=dell-r730-037=up:active} 1 up:standby osd: 9 osds: 9 up (since 2w), 9 in (since 2w) rgw: 1 daemon active (dell-r730-031.rgw0) <<<----- RGW endpoint's hostname task status: scrub status: mds.dell-r730-037: idle data: pools: 12 pools, 488 pgs objects: 146.81k objects, 533 GiB usage: 1.6 TiB used, 3.3 TiB / 4.9 TiB avail pgs: 488 active+clean io: client: 36 KiB/s rd, 4.4 MiB/s wr, 9 op/s rd, 153 op/s wr
This script is not in ocs-operator. Or in rook. Not sure where it is... :-o
Seb, I think you were involved in creating this script. Can you provide clarity?
Neha, We can not use the DNS that might be reported from the service map, it is unreliable and has no guarantee that the containers can resolve it, hence using an IP. As far as the validation, we can implement a validation beforehand or the UI can do it. As I said earlier, the more validation we do, the more bugs we might introduce. Arun, please have a look at implementing an HTTP check against the given endpoint, for now, assume HTTP only. Moving to 4.6.
Neha, I have a PR for this. If you make it a blocker we can proceed and include it in 4.5, otherwise it will be in 4.6.
(In reply to leseb from comment #7) > Neha, I have a PR for this. If you make it a blocker we can proceed and > include it in 4.5, otherwise it will be in 4.6. +1. Thanks a lot Sebastien. Proposing as a blocker for OCS 4.5 as this fix will help a lot in mitigating incorrect RGW related issues. See comment#7 as well.
4.5.0-508.ci contains the fix
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Container Storage 4.5.0 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:3754
As the bug is fixed and doesn't require any other info...