Bug 2212773

Summary: OCS Provider Server service comes up on public subnets
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Rewant <resoni>
Component: ocs-operatorAssignee: Rewant <resoni>
Status: MODIFIED --- QA Contact: Elad <ebenahar>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.13CC: ikave, jijoy, muagarwa, nigoyal, odf-bz-bot
Target Milestone: ---Keywords: AutomationBackLog
Target Release: ODF 4.14.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 4.14.0-60 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 2213114 2213117 2218863 2218867 (view as bug list) Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2213114, 2213117, 2218863, 2218867    

Description Rewant 2023-06-06 10:12:33 UTC
Description of problem (please be detailed as possible and provide log
snippests):

While using both private link and non private link clusters, the ocs-provider-server service tries to come up on the  non private subnets of the VPC. This would mean that the endpoint will be exposed and from outside the subnets we can ping the endpoint.

The AWS ELB created is of type Classic which doesn't support private link clusters.
So we need to move to Network Load Balancer and use a internal facing load balancer so that it's only accessible from within the VPC.

We need to add annotations to the service as aws controller looks at the annotation to reconcile the service.

More info: https://docs.google.com/document/d/10J-J8EuDm8Q-ZMtY0A3mtmHOx8Xvhn-i28faxfWZwts/edit?usp=sharing

Version of all relevant components (if applicable):


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.
2.
3.


Actual results:
ocs provider server should be deployed on private subnets

Expected results:
ocs provider server is deployed on public subnets

Additional info:

Comment 2 Rewant 2023-06-19 05:33:46 UTC
The ocs-operator should not be responsible for adding annotations based on the cloud provider, instead the loadbalancer should be created externally based on the cloud provider and the type of network (private/public), hence adding a new field in StorageCluster CR to toggle between the service type.

Comment 3 Nitin Goyal 2023-07-03 10:55:40 UTC
Giving devel ack on Rewant request

Comment 6 Rewant 2023-07-04 05:00:29 UTC
The PR solves the issue we were facing with managed service by having a new field in storageCluster to toggle the nodePort/LoadBalancer ServiceType, the default being NodePort when if the field is not set.

We need to create provider clusters and verify that provider server comes with both loadBalancer and nodePort depending on the field being set on storageCluster.



For the backPorts, we also need to test when the operator is being updated from 4.10 to 4.11 the ocs provider service is of the same type.

Comment 9 Itzhak 2023-07-18 09:40:02 UTC
I tested the BZ with the following steps:

1. Deploy an AWS 4.14 cluster without ODF.

2. Disable the default Red-hat operator:
$ oc patch operatorhub.config.openshift.io/cluster -p='{"spec":{"sources":[{"disabled":true,"name":"redhat-operators"}]}}' --type=merge

3. Get and apply ICPS from catalog image using the commands(in my local): 
$ oc image extract --filter-by-os linux/amd64 --registry-config /home/ikave/IBMProjects/ocs-ci/data/pull-secret quay.io/rhceph-dev/ocs-registry:4.14.0-67 --confirm --path /icsp.yaml:/home/ikave/IBMProjects/ocs-ci/iscp
$ oc apply -f ~/IBMProjects/ocs-ci/iscp/icsp.yaml

5. Wait for the MachineConfigPool to be ready. 
$ oc get MachineConfigPool worker

6. Create the Namespace, CatalogSource, and Subscription using the Yaml file above: https://bugzilla.redhat.com/show_bug.cgi?id=2212773#c7.
$ oc apply -f ~/Downloads/deploy-with-olm.yaml

7. Wait until the ocs-operator pod is ready in the openshift-namespace.
8. Create the Storagecluster using the Yaml file above: https://bugzilla.redhat.com/show_bug.cgi?id=2212773#c8.
(If there is an issue with Noobaa CRDs, we may also need to apply this Yaml file https://raw.githubusercontent.com/red-hat-storage/mcg-osd-deployer/1eec1147b1ae70e938fa42dabc60453b8cd9449b/shim/crds/noobaa.noobaa.io.yaml). You can see that there is a new filed 
"providerAPIServerServiceType: LoadBalancer".

9. Check that we can see the correct service type of LoadBalancer:
$ oc get service
NAME                   TYPE           CLUSTER-IP       EXTERNAL-IP                                                              PORT(S)             AGE
ocs-metrics-exporter   ClusterIP      172.30.155.99    <none>                                                                   8080/TCP,8081/TCP   16m
ocs-provider-server    LoadBalancer   172.30.193.179   a3e18905048ed4a9587ec6f3e0975705-962669330.us-east-2.elb.amazonaws.com   50051:31659/TCP     16m
rook-ceph-mgr          ClusterIP      172.30.183.31    <none>                                                                   9283/TCP            4m38s


10. Edit the ocs-storagecluster and change the value of "providerAPIServerServiceType" to NodePort. Check the service type again:
$ oc get service
NAME                   TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)             AGE
ocs-metrics-exporter   ClusterIP   172.30.155.99    <none>        8080/TCP,8081/TCP   24m
ocs-provider-server    NodePort    172.30.193.179   <none>        50051:31659/TCP     24m
rook-ceph-mgr          ClusterIP   172.30.183.31    <none>        9283/TCP            12m 

11. Edit the ocs-storagecluster and change the value of "providerAPIServerServiceType" to a dummy value "foo". Check the service type again:
$ oc get service
NAME                   TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)             AGE
ocs-metrics-exporter   ClusterIP   172.30.155.99    <none>        8080/TCP,8081/TCP   24m
ocs-provider-server    NodePort    172.30.193.179   <none>        50051:31659/TCP     24m
rook-ceph-mgr          ClusterIP   172.30.183.31    <none>        9283/TCP            12m

12. Check the ocs-operator logs and see the expected error:
{"level":"error","ts":"2023-07-17T13:00:08Z","msg":"Reconciler error","controller":"storagecluster","controllerGroup":"ocs.openshift.io","controllerKind":"StorageCluster","StorageCluster":{"name":"ocs-storagecluster","namespace":"openshift-storage"},"namespace":"openshift-storage","name":"ocs-storagecluster","reconcileID":"682f4750-9382-479f-8ebc-09a30152411d","error":"providerAPIServer only supports service of type NodePort and LoadBalancer"

13. Last check that I performed was to check that the default value is NodePort. I deployed the Storagecluster Yaml file above without the field "providerAPIServerServiceType" and checked the service again:
$ oc get service
NAME                  TYPE       CLUSTER-IP      EXTERNAL-IP   PORT(S)           AGE
ocs-provider-server   NodePort   172.30.110.82   <none>        50051:31659/TCP   10s


Additional info:

Versions:

OC version:
Client Version: 4.10.24
Server Version: 4.14.0-0.nightly-2023-07-17-215017
Kubernetes Version: v1.27.3+4aaeaec

OCS version:
ocs-operator.v4.14.0-67.stable   OpenShift Container Storage   4.14.0-67.stable              Succeeded

Cluster version
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.14.0-0.nightly-2023-07-17-215017   True        False         71m     Cluster version is 4.14.0-0.nightly-2023-07-17-215017

Link to the Jenkins slave: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/27017/

Comment 10 Itzhak 2023-07-18 09:42:08 UTC
According to the comment above, I am moving the BZ to Verified.

Comment 12 Itzhak 2023-08-16 09:28:47 UTC
We found an issue with the new ocs 4.11.10-1 image. In the case of NodePort Service, the ocs operator pod keeps requiring. 
We need to fix this from the ocs-operator side, backport the fix and test it again.