Bug 2213117

Summary: [Backport to 4.11.z]OCS Provider Server service comes up on public subnets
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Jilju Joy <jijoy>
Component: ocs-operatorAssignee: Rewant <resoni>
Status: ON_QA --- QA Contact: Itzhak <ikave>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.11CC: ebenahar, ikave, nigoyal, odf-bz-bot, resoni, sheggodu
Target Milestone: ---Keywords: AutomationBackLog
Target Release: ODF 4.11.10   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 4.11.10-3 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 2212773 Environment:
Last Closed: Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2212773, 2218863, 2218867    
Bug Blocks: 2213114    

Description Jilju Joy 2023-06-07 06:58:27 UTC
+++ This bug was initially created as a clone of Bug #2212773 +++

Description of problem (please be detailed as possible and provide log
snippests):

While using both private link and non private link clusters, the ocs-provider-server service tries to come up on the  non private subnets of the VPC. This would mean that the endpoint will be exposed and from outside the subnets we can ping the endpoint.

The AWS ELB created is of type Classic which doesn't support private link clusters.
So we need to move to Network Load Balancer and use a internal facing load balancer so that it's only accessible from within the VPC.

We need to add annotations to the service as aws controller looks at the annotation to reconcile the service.

More info: https://docs.google.com/document/d/10J-J8EuDm8Q-ZMtY0A3mtmHOx8Xvhn-i28faxfWZwts/edit?usp=sharing

Version of all relevant components (if applicable):


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.
2.
3.


Actual results:
ocs provider server should be deployed on private subnets

Expected results:
ocs provider server is deployed on public subnets

Additional info:

--- Additional comment from RHEL Program Management on 2023-06-06 10:12:39 UTC ---

This bug having no release flag set previously, is now set with release flag 'odf‑4.13.0' to '?', and so is being proposed to be fixed at the ODF 4.13.0 release. Note that the 3 Acks (pm_ack, devel_ack, qa_ack), if any previously set while release flag was missing, have now been reset since the Acks are to be set against a release flag.

Comment 2 Nitin Goyal 2023-07-03 10:56:12 UTC
Giving devel ack on Rewant request

Comment 9 Itzhak 2023-07-31 12:45:48 UTC
I tested the BZ with the following steps:

1. Deploy an AWS 4.11 cluster without ODF.

2. Disable the default Red-hat operator:
$ oc patch operatorhub.config.openshift.io/cluster -p='{"spec":{"sources":[{"disabled":true,"name":"redhat-operators"}]}}' --type=merge

3. Get and apply ICPS from catalog image using the commands(in my local): 
$ oc image extract --filter-by-os linux/amd64 --registry-config ~/IBMProjects/ocs-ci/data/pull-secret quay.io/rhceph-dev/ocs-registry:latest-stable-4.11.10 --confirm --path /icsp.yaml:~/IBMProjects/ocs-ci/icsp
$ oc apply -f ~/IBMProjects/ocs-ci/icsp/icsp.yaml

5. Wait for the MachineConfigPool to be ready. 
$ oc get MachineConfigPool worker

6. Create the Namespace, CatalogSource, and Subscription using the Yaml file above: https://bugzilla.redhat.com/show_bug.cgi?id=2213117#c7.
$ oc apply -f ~/Downloads/deploy-with-olm.yaml

7. Wait until the ocs-operator pod is ready in the openshift-namespace.
8. Create the Storagecluster using the Yaml file above: https://bugzilla.redhat.com/show_bug.cgi?id=2213117#c8.
(If there is an issue with Noobaa CRDs, we may also need to apply this Yaml file https://raw.githubusercontent.com/red-hat-storage/mcg-osd-deployer/1eec1147b1ae70e938fa42dabc60453b8cd9449b/shim/crds/noobaa.noobaa.io.yaml). The field 'providerAPIServer' is empty in this Yaml file. 

9. Check the pods: 
$ oc get pods
NAME                                    READY   STATUS    RESTARTS   AGE
noobaa-operator-76464cdd89-k6nst        1/1     Running   0          18m
ocs-metrics-exporter-77c684f4ff-w2mtv   1/1     Running   0          7m48s
ocs-operator-85b8d66b86-wpg49           1/1     Running   0          18m
rook-ceph-operator-764f4cbc78-66bm9     1/1     Running   0          18m


10. Check the service and see that the provider server is NodePort as this is the default value:
$ oc get service
NAME                      TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)           AGE
noobaa-operator-service   ClusterIP   172.30.37.82     <none>        443/TCP           12m
ocs-provider-server       NodePort    172.30.137.131   <none>        50051:31659/TCP   4m41s

11. Edit the ocs-storagecluster and change the value of "providerAPIServerServiceType" to LoadBalancer. Check the pods again:
$ oc get pods
NAME                                    READY   STATUS    RESTARTS   AGE
noobaa-operator-76464cdd89-k6nst        1/1     Running   0          21m
ocs-metrics-exporter-77c684f4ff-w2mtv   1/1     Running   0          10m
ocs-operator-85b8d66b86-wpg49           1/1     Running   0          21m
ocs-provider-server-5d7f659cf-t42ms     1/1     Running   0          12s
rook-ceph-operator-764f4cbc78-66bm9     1/1     Running   0          21m

12. Check the service type again:
$ oc get service
NAME                      TYPE           CLUSTER-IP       EXTERNAL-IP                                                               PORT(S)           AGE
noobaa-operator-service   ClusterIP      172.30.37.82     <none>                                                                    443/TCP           17m
ocs-provider-server       LoadBalancer   172.30.137.131   a8225434e09614eda83bfe20c964088b-2002374777.us-east-2.elb.amazonaws.com   50051:31659/TCP   9m16s

13. Edit the ocs-storagecluster and change the value of "providerAPIServerServiceType" to NodePort. Check the service again:
$ oc get service
NAME                      TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)           AGE
noobaa-operator-service   ClusterIP   172.30.37.82     <none>        443/TCP           57m
ocs-provider-server       NodePort    172.30.137.131   <none>        50051:31659/TCP   49m

14. Edit the ocs-storagecluster and change the value of "providerAPIServerServiceType" to a dummy value "someValue". Check the ocs-operator logs and see the expected error: 
$ oc logs ocs-operator-85b8d66b86-wpg49 | tail -n 1
{"level":"error","ts":1690802919.280578,"logger":"controller.storagecluster","msg":"Reconciler error","reconciler group":"ocs.openshift.io","reconciler kind":"StorageCluster","name":"ocs-storagecluster","namespace":"openshift-storage","error":"providerAPIServer only supports service of type NodePort and LoadBalancer","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227"}

15. Edit the ocs-storagecluster and change the value of "providerAPIServerServiceType" to LoadBalancer. Check the service again:
$ oc get service
NAME                      TYPE           CLUSTER-IP       EXTERNAL-IP                                                               PORT(S)           AGE
noobaa-operator-service   ClusterIP      172.30.37.82     <none>                                                                    443/TCP           131m
ocs-provider-server       LoadBalancer   172.30.137.131   a8225434e09614eda83bfe20c964088b-1866348329.us-east-2.elb.amazonaws.com   50051:31659/TCP   123m

Additional info:

Link to the Jenkins job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/27592/.

Versions:
OC version:
Client Version: 4.10.24
Server Version: 4.11.0-0.nightly-2023-07-29-013834
Kubernetes Version: v1.24.15+a9da4a8

OCS version:
ocs-operator.v4.11.10   OpenShift Container Storage   4.11.10   ocs-operator.v4.11.9   Succeeded

Cluster version
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2023-07-29-013834   True        False         74m     Cluster version is 4.11.0-0.nightly-2023-07-29-013834

Comment 12 Itzhak 2023-08-16 09:31:34 UTC
We found an issue with the new ocs 4.11.10-1 image. In the case of NodePort Service, the ocs operator pod keeps requiring. 
We need to fix this from the ocs-operator side, backport the fix and test it again.