Description of problem (please be detailed as possible and provide log snippests): While using both private link and non private link clusters, the ocs-provider-server service tries to come up on the non private subnets of the VPC. This would mean that the endpoint will be exposed and from outside the subnets we can ping the endpoint. The AWS ELB created is of type Classic which doesn't support private link clusters. So we need to move to Network Load Balancer and use a internal facing load balancer so that it's only accessible from within the VPC. We need to add annotations to the service as aws controller looks at the annotation to reconcile the service. More info: https://docs.google.com/document/d/10J-J8EuDm8Q-ZMtY0A3mtmHOx8Xvhn-i28faxfWZwts/edit?usp=sharing Version of all relevant components (if applicable): Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. 2. 3. Actual results: ocs provider server should be deployed on private subnets Expected results: ocs provider server is deployed on public subnets Additional info:
The ocs-operator should not be responsible for adding annotations based on the cloud provider, instead the loadbalancer should be created externally based on the cloud provider and the type of network (private/public), hence adding a new field in StorageCluster CR to toggle between the service type.
Giving devel ack on Rewant request
The PR solves the issue we were facing with managed service by having a new field in storageCluster to toggle the nodePort/LoadBalancer ServiceType, the default being NodePort when if the field is not set. We need to create provider clusters and verify that provider server comes with both loadBalancer and nodePort depending on the field being set on storageCluster. For the backPorts, we also need to test when the operator is being updated from 4.10 to 4.11 the ocs provider service is of the same type.
I tested the BZ with the following steps: 1. Deploy an AWS 4.14 cluster without ODF. 2. Disable the default Red-hat operator: $ oc patch operatorhub.config.openshift.io/cluster -p='{"spec":{"sources":[{"disabled":true,"name":"redhat-operators"}]}}' --type=merge 3. Get and apply ICPS from catalog image using the commands(in my local): $ oc image extract --filter-by-os linux/amd64 --registry-config /home/ikave/IBMProjects/ocs-ci/data/pull-secret quay.io/rhceph-dev/ocs-registry:4.14.0-67 --confirm --path /icsp.yaml:/home/ikave/IBMProjects/ocs-ci/iscp $ oc apply -f ~/IBMProjects/ocs-ci/iscp/icsp.yaml 5. Wait for the MachineConfigPool to be ready. $ oc get MachineConfigPool worker 6. Create the Namespace, CatalogSource, and Subscription using the Yaml file above: https://bugzilla.redhat.com/show_bug.cgi?id=2212773#c7. $ oc apply -f ~/Downloads/deploy-with-olm.yaml 7. Wait until the ocs-operator pod is ready in the openshift-namespace. 8. Create the Storagecluster using the Yaml file above: https://bugzilla.redhat.com/show_bug.cgi?id=2212773#c8. (If there is an issue with Noobaa CRDs, we may also need to apply this Yaml file https://raw.githubusercontent.com/red-hat-storage/mcg-osd-deployer/1eec1147b1ae70e938fa42dabc60453b8cd9449b/shim/crds/noobaa.noobaa.io.yaml). You can see that there is a new filed "providerAPIServerServiceType: LoadBalancer". 9. Check that we can see the correct service type of LoadBalancer: $ oc get service NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE ocs-metrics-exporter ClusterIP 172.30.155.99 <none> 8080/TCP,8081/TCP 16m ocs-provider-server LoadBalancer 172.30.193.179 a3e18905048ed4a9587ec6f3e0975705-962669330.us-east-2.elb.amazonaws.com 50051:31659/TCP 16m rook-ceph-mgr ClusterIP 172.30.183.31 <none> 9283/TCP 4m38s 10. Edit the ocs-storagecluster and change the value of "providerAPIServerServiceType" to NodePort. Check the service type again: $ oc get service NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE ocs-metrics-exporter ClusterIP 172.30.155.99 <none> 8080/TCP,8081/TCP 24m ocs-provider-server NodePort 172.30.193.179 <none> 50051:31659/TCP 24m rook-ceph-mgr ClusterIP 172.30.183.31 <none> 9283/TCP 12m 11. Edit the ocs-storagecluster and change the value of "providerAPIServerServiceType" to a dummy value "foo". Check the service type again: $ oc get service NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE ocs-metrics-exporter ClusterIP 172.30.155.99 <none> 8080/TCP,8081/TCP 24m ocs-provider-server NodePort 172.30.193.179 <none> 50051:31659/TCP 24m rook-ceph-mgr ClusterIP 172.30.183.31 <none> 9283/TCP 12m 12. Check the ocs-operator logs and see the expected error: {"level":"error","ts":"2023-07-17T13:00:08Z","msg":"Reconciler error","controller":"storagecluster","controllerGroup":"ocs.openshift.io","controllerKind":"StorageCluster","StorageCluster":{"name":"ocs-storagecluster","namespace":"openshift-storage"},"namespace":"openshift-storage","name":"ocs-storagecluster","reconcileID":"682f4750-9382-479f-8ebc-09a30152411d","error":"providerAPIServer only supports service of type NodePort and LoadBalancer" 13. Last check that I performed was to check that the default value is NodePort. I deployed the Storagecluster Yaml file above without the field "providerAPIServerServiceType" and checked the service again: $ oc get service NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE ocs-provider-server NodePort 172.30.110.82 <none> 50051:31659/TCP 10s Additional info: Versions: OC version: Client Version: 4.10.24 Server Version: 4.14.0-0.nightly-2023-07-17-215017 Kubernetes Version: v1.27.3+4aaeaec OCS version: ocs-operator.v4.14.0-67.stable OpenShift Container Storage 4.14.0-67.stable Succeeded Cluster version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.14.0-0.nightly-2023-07-17-215017 True False 71m Cluster version is 4.14.0-0.nightly-2023-07-17-215017 Link to the Jenkins slave: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/27017/
According to the comment above, I am moving the BZ to Verified.
We found an issue with the new ocs 4.11.10-1 image. In the case of NodePort Service, the ocs operator pod keeps requiring. We need to fix this from the ocs-operator side, backport the fix and test it again.
I tested the BZ with the following steps: 1. Deploy an AWS 4.14 cluster without ODF. 2. Disable the default Red-hat operator: $ oc patch operatorhub.config.openshift.io/cluster -p='{"spec":{"sources":[{"disabled":true,"name":"redhat-operators"}]}}' --type=merge 3. Get and apply ICPS from catalog image using the commands(in my local): $ oc image extract --filter-by-os linux/amd64 --registry-config ~/IBMProjects/ocs-ci/data/pull-secret quay.io/rhceph-dev/ocs-registry:latest-stable-4.14.0 --confirm --path /icsp.yaml:~/IBMProjects/ocs-ci/icsp $ oc apply -f ~/IBMProjects/ocs-ci/icsp/icsp.yaml 5. Wait for the MachineConfigPool to be ready. $ oc get MachineConfigPool worker 6. Create the Namespace, CatalogSource, and Subscription using the Yaml file above: https://bugzilla.redhat.com/show_bug.cgi?id=2212773#c7 (I have just changed the ocs image to latest-stable-4.14.0). $ oc apply -f ~/Downloads/deploy-with-olm.yaml 7. Wait until the ocs-operator pod is ready in the openshift-namespace. $ oc get pods NAME READY STATUS RESTARTS AGE ocs-metrics-exporter-8599f5b6bf-qnk2r 1/1 Running 0 70s ocs-operator-76fcdcfcff-bwv29 1/1 Running 0 70s rook-ceph-operator-5b4dc48dbb-f82rb 1/1 Running 0 70s 8. Create the Storagecluster using the Yaml file above: https://bugzilla.redhat.com/show_bug.cgi?id=2212773#c8. 9. Apply the Yaml files //raw.githubusercontent.com/red-hat-storage/mcg-osd-deployer/1eec1147b1ae70e938fa42dabc60453b8cd9449b/shim/crds/noobaa.noobaa.io.yaml, and https://raw.githubusercontent.com/noobaa/noobaa-operator/5.12/deploy/obc/objectbucket.io_objectbucketclaims_crd.yaml(in order to get errors raised in the ocs-operator logs and rook-ceph-operator logs). 10. Modify the AWS security groups according to the ODF to ODF deployment doc. 11. Check that all the pods are up and running. 12. Check that the storagecluster is Ready and Cephcluster health is OK: $ oc get storageclusters.ocs.openshift.io NAME AGE PHASE EXTERNAL CREATED AT VERSION ocs-storagecluster 118m Ready 2023-08-21T09:26:45Z 4.14.0 $ oc get storageclusters.ocs.openshift.io NAME AGE PHASE EXTERNAL CREATED AT VERSION ocs-storagecluster 121m Ready 2023-08-21T09:26:45Z 4.14.0 13. Check the of the ocs-operator logs, and the rook-ceph-operator logs and see there are no errors. $ oc logs ocs-operator-76fcdcfcff-bwv29 | tail -n 2 {"level":"info","ts":"2023-08-21T09:34:30Z","logger":"controllers.StorageCluster","msg":"Reconciling metrics exporter service","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster","NamespacedName":{"name":"ocs-metrics-exporter","namespace":"openshift-storage"}} {"level":"info","ts":"2023-08-21T09:34:30Z","logger":"controllers.StorageCluster","msg":"Reconciling metrics exporter service monitor","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster","NamespacedName":{"name":"ocs-metrics-exporter","namespace":"openshift-storage"}} $ oc logs rook-ceph-operator-5b4dc48dbb-5p9bm | tail -n 7 2023-08-21 11:12:47.690184 I | cephclient: setting pool property "pg_num" to "512" on pool "ocs-storagecluster-cephfilesystem-ssd" 2023-08-21 11:12:48.712565 I | cephclient: setting pool property "pgp_num" to "512" on pool "ocs-storagecluster-cephfilesystem-ssd" 2023-08-21 11:12:49.736640 I | cephclient: setting pool property "target_size_ratio" to "0.49" on pool "ocs-storagecluster-cephfilesystem-ssd" 2023-08-21 11:12:50.758865 I | cephclient: reconciling replicated pool ocs-storagecluster-cephfilesystem-ssd succeeded 2023-08-21 11:12:51.379599 I | cephclient: creating filesystem "ocs-storagecluster-cephfilesystem" with metadata pool "ocs-storagecluster-cephfilesystem-metadata" and data pools [ocs-storagecluster-cephfilesystem-ssd] 2023-08-21 11:12:53.065330 I | ceph-file-controller: created filesystem "ocs-storagecluster-cephfilesystem" on 1 data pool(s) and metadata pool "ocs-storagecluster-cephfilesystem-metadata" 2023-08-21 11:12:53.065348 I | cephclient: setting allow_standby_replay for filesystem "ocs-storagecluster-cephfilesystem" 14. Check the service and see that the provider server is NodePort as this is the default value: $ oc get service NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE ocs-metrics-exporter ClusterIP 172.30.154.212 <none> 8080/TCP,8081/TCP 18s ocs-provider-server NodePort 172.30.40.20 <none> 50051:31659/TCP 29s 15. Edit the ocs-storagecluster and change the value of "providerAPIServerServiceType" to LoadBalancer. Check the pods again and verify they are running. Check again that the storagecluster is Ready and Cephcluster health is OK. 16. Check the service type again: $ oc get service NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE ocs-metrics-exporter ClusterIP 172.30.154.212 <none> 8080/TCP,8081/TCP 111m ocs-provider-server LoadBalancer 172.30.40.20 a9c305336ab2b4e89aea9d30d58562d8-2092462897.us-east-2.elb.amazonaws.com 50051:31659/TCP 111m rook-ceph-exporter ClusterIP 172.30.41.217 <none> 9926/TCP 65m rook-ceph-mgr ClusterIP 172.30.194.243 <none> 9283/TCP 83m 17. Edit the ocs-storagecluster and change the value of "providerAPIServerServiceType" to NodePort. Check the service again (Also check again that the storagecluster is Ready and Cephcluster health is OK): $ oc get service. NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE ocs-metrics-exporter ClusterIP 172.30.154.212 <none> 8080/TCP,8081/TCP 114m ocs-provider-server NodePort 172.30.40.20 <none> 50051:31659/TCP 114m rook-ceph-exporter ClusterIP 172.30.41.217 <none> 9926/TCP 67m rook-ceph-mgr ClusterIP 172.30.194.243 <none> 9283/TCP 86m 18. Edit the ocs-storagecluster and change the value of "providerAPIServerServiceType" to a dummy value "foo". Check the ocs-operator logs and see the expected error: $ oc logs ocs-operator-76fcdcfcff-bwv29 | tail -n 2 {"level":"error","ts":"2023-08-21T11:33:27Z","logger":"controllers.StorageCluster","msg":"Failed to create/update service, Requested ServiceType is","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster","ServiceType":"foo","error":"providerAPIServer only supports service of type NodePort and LoadBalancer","stacktrace":"github.com/red-hat-storage/ocs-operator/v4/controllers/storagecluster.(*ocsProviderServer).createService\n\t/remote-source/app/controllers/storagecluster/provider_server.go:152\ngithub.com/red-hat-storage/ocs-operator/v4/controllers/storagecluster.(*ocsProviderServer).ensureCreated\n\t/remote-source/app/controllers/storagecluster/provider_server.go:54\ngithub.com/red-hat-storage/ocs-operator/v4/controllers/storagecluster.(*StorageClusterReconciler).reconcilePhases\n\t/remote-source/app/controllers/storagecluster/reconcile.go:427\ngithub.com/red-hat-storage/ocs-operator/v4/controllers/storagecluster.(*StorageClusterReconciler).Reconcile\n\t/remote-source/app/controllers/storagecluster/reconcile.go:166\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:118\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:314\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:265\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:226"} {"level":"error","ts":"2023-08-21T11:33:27Z","msg":"Reconciler error","controller":"storagecluster","controllerGroup":"ocs.openshift.io","controllerKind":"StorageCluster","StorageCluster":{"name":"ocs-storagecluster","namespace":"openshift-storage"},"namespace":"openshift-storage","name":"ocs-storagecluster","reconcileID":"4eb95453-5396-4eb2-9de2-2d09a1fafc89","error":"providerAPIServer only supports service of type NodePort and LoadBalancer","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:324\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:265\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:226"} Also, I checked the Cephcluster and saw the error status: $ oc get storageclusters.ocs.openshift.io NAME AGE PHASE EXTERNAL CREATED AT VERSION ocs-storagecluster 125m Error 2023-08-21T09:26:45Z 4.14.0 19. Edit the ocs-storagecluster and change the value of "providerAPIServerServiceType" to LoadBalancer. Check the service again (Also check again that the storagecluster is Ready, Cephcluster health is OK, and ocs-operator logs look fine): $ oc get service NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE ocs-metrics-exporter ClusterIP 172.30.154.212 <none> 8080/TCP,8081/TCP 119m ocs-provider-server NodePort 172.30.40.20 <none> 50051:31659/TCP 120m rook-ceph-exporter ClusterIP 172.30.41.217 <none> 9926/TCP 73m rook-ceph-mgr ClusterIP 172.30.194.243 <none> 9283/TCP 92m Additional info: Link to the Jenkins job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/28346/ Versions: OC version: Client Version: 4.10.24 Server Version: 4.14.0-0.nightly-2023-08-11-055332 Kubernetes Version: v1.27.4+deb2c60 OCS version: ocs-operator.v4.14.0-111.stable OpenShift Container Storage 4.14.0-111.stable Succeeded Cluster version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.14.0-0.nightly-2023-08-11-055332 True False 3h16m Cluster version is 4.14.0-0.nightly-2023-08-11-055332 Rook version: rook: v4.14.0-0.d91e8e4302ce15be8846b0017d9b55167f45cf6d go: go1.20.5 Ceph version: ceph version 17.2.6-107.el9cp (4079b48a400e4d23864de0da6d093e200038d7fb) quincy (stable)
There is one mistake in the comment above in step 19 - The output should be: $ oc get service NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE ocs-metrics-exporter ClusterIP 172.30.154.212 <none> 8080/TCP,8081/TCP 121m ocs-provider-server LoadBalancer 172.30.40.20 a9c305336ab2b4e89aea9d30d58562d8-856144711.us-east-2.elb.amazonaws.com 50051:31659/TCP 121m rook-ceph-exporter ClusterIP 172.30.41.217 <none> 9926/TCP 74m rook-ceph-mgr ClusterIP 172.30.194.243 <none> 9283/TCP 93m
According to the last three comments above, I am moving the BZ to Verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.14.0 security, enhancement & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:6832