2213114 – [Backport to 4.12.z]OCS Provider Server service comes up on public subnets

Bug 2213114 - [Backport to 4.12.z]OCS Provider Server service comes up on public subnets

Summary: [Backport to 4.12.z]OCS Provider Server service comes up on public subnets

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ocs-operator
Sub Component:
Version:	4.12
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	ODF 4.12.6
Assignee:	Rewant
QA Contact:	Itzhak
Docs Contact:
URL:
Whiteboard:
Depends On:	2212773 2213117 2218863 2218867
Blocks:
TreeView+	depends on / blocked

Reported:	2023-06-07 06:57 UTC by Jilju Joy
Modified:	2023-08-23 14:12 UTC (History)
CC List:	6 users (show)
Fixed In Version:	4.12.6-3
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	2212773
Environment:
Last Closed:	2023-08-23 14:12:45 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	red-hat-storage ocs-operator pull 2103	None	open	Bug 2213114:[release-4.12] Toggle between OCS Provider Server Service Type	2023-07-28 05:30:25 UTC
Github	red-hat-storage ocs-operator pull 2139	None	open	Bug 2213114:[release-4.12] fix provAPIServer when NodePort svc is stuck waiting for lb endpoint	2023-08-16 13:53:35 UTC
Red Hat Product Errata	RHBA-2023:4718	None	None	None	2023-08-23 14:12:54 UTC

Description Jilju Joy 2023-06-07 06:57:34 UTC

+++ This bug was initially created as a clone of Bug #2212773 +++

Description of problem (please be detailed as possible and provide log
snippests):

While using both private link and non private link clusters, the ocs-provider-server service tries to come up on the  non private subnets of the VPC. This would mean that the endpoint will be exposed and from outside the subnets we can ping the endpoint.

The AWS ELB created is of type Classic which doesn't support private link clusters.
So we need to move to Network Load Balancer and use a internal facing load balancer so that it's only accessible from within the VPC.

We need to add annotations to the service as aws controller looks at the annotation to reconcile the service.

More info: https://docs.google.com/document/d/10J-J8EuDm8Q-ZMtY0A3mtmHOx8Xvhn-i28faxfWZwts/edit?usp=sharing

Version of all relevant components (if applicable):


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.
2.
3.


Actual results:
ocs provider server should be deployed on private subnets

Expected results:
ocs provider server is deployed on public subnets

Additional info:

--- Additional comment from RHEL Program Management on 2023-06-06 10:12:39 UTC ---

This bug having no release flag set previously, is now set with release flag 'odf‑4.13.0' to '?', and so is being proposed to be fixed at the ODF 4.13.0 release. Note that the 3 Acks (pm_ack, devel_ack, qa_ack), if any previously set while release flag was missing, have now been reset since the Acks are to be set against a release flag.

Comment 2 Nitin Goyal 2023-07-03 10:56:02 UTC

Giving devel ack on Rewant request

Comment 11 Itzhak 2023-08-01 11:20:07 UTC

I tested the BZ with the following steps:

1. Deploy an AWS 4.12 cluster without ODF.

2. Disable the default Red-hat operator:
$ oc patch operatorhub.config.openshift.io/cluster -p='{"spec":{"sources":[{"disabled":true,"name":"redhat-operators"}]}}' --type=merge

3. Get and apply ICPS from catalog image using the commands(in my local): 
$ oc image extract --filter-by-os linux/amd64 --registry-config ~/IBMProjects/ocs-ci/data/pull-secret quay.io/rhceph-dev/ocs-registry:latest-stable-4.12 --confirm --path /icsp.yaml:~/IBMProjects/ocs-ci/icsp
$ oc apply -f ~/IBMProjects/ocs-ci/icsp/icsp.yaml

5. Wait for the MachineConfigPool to be ready. 
$ oc get MachineConfigPool worker

6. Create the Namespace, CatalogSource, and Subscription using the Yaml file above: https://bugzilla.redhat.com/show_bug.cgi?id=2213114#c9.
$ oc apply -f ~/Downloads/deploy-with-olm.yaml

7. Wait until the ocs-operator pod is ready in the openshift-namespace.
8. Create the Storagecluster using the Yaml file above: https://bugzilla.redhat.com/show_bug.cgi?id=2213114#c10.
(If there is an issue with Noobaa CRDs, we may also need to apply this Yaml file https://raw.githubusercontent.com/red-hat-storage/mcg-osd-deployer/1eec1147b1ae70e938fa42dabc60453b8cd9449b/shim/crds/noobaa.noobaa.io.yaml). The field 'providerAPIServer' is empty in this Yaml file. 

9. Check the pods: 
$ oc get pods
NAME                                    READY   STATUS    RESTARTS        AGE
ocs-metrics-exporter-7567744868-zs9tm   1/1     Running   0               20m
ocs-operator-7b866f884c-zftfx           1/1     Running   6 (4m58s ago)   20m
rook-ceph-operator-7f74bd847c-2mptm     1/1     Running   6 (4m37s ago)   20m

10. Check the service and see that the provider server is NodePort as this is the default value:
$ oc get service
NAME                  TYPE       CLUSTER-IP    EXTERNAL-IP   PORT(S)           AGE
ocs-provider-server   NodePort   172.30.45.3   <none>        50051:31659/TCP   32s

11. Edit the ocs-storagecluster and change the value of "providerAPIServerServiceType" to LoadBalancer. Check the pods again:
 oc get pods
NAME                                    READY   STATUS             RESTARTS        AGE
ocs-metrics-exporter-7567744868-zs9tm   1/1     Running            0               38m
ocs-operator-7b866f884c-8j9b6           1/1     Running            4 (3m54s ago)   12m
ocs-provider-server-55cc5d648d-w7g7t    1/1     Running            0               2m31s
rook-ceph-operator-7f74bd847c-2mptm     0/1     CrashLoopBackOff   8 (3m25s ago)   38m

12. Check the service type again:
$ oc get service
NAME                   TYPE           CLUSTER-IP      EXTERNAL-IP                                                              PORT(S)             AGE
ocs-metrics-exporter   ClusterIP      172.30.39.247   <none>                                                                   8080/TCP,8081/TCP   24s
ocs-provider-server    LoadBalancer   172.30.45.3     afb0aa7144fa2419bb35319aeb119af9-558404978.us-east-2.elb.amazonaws.com   50051:31659/TCP     10m

13. Edit the ocs-storagecluster and change the value of "providerAPIServerServiceType" to NodePort. Check the service again:
$ oc get service
NAME                   TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)             AGE
ocs-metrics-exporter   ClusterIP   172.30.39.247   <none>        8080/TCP,8081/TCP   2m4s
ocs-provider-server    NodePort    172.30.45.3     <none>        50051:31659/TCP     12m

14. Edit the ocs-storagecluster and change the value of "providerAPIServerServiceType" to a dummy value "bar". Check the ocs-operator logs and see the expected error: 
$ oc logs ocs-operator-7b866f884c-8j9b6 | tail -n 1
{"level":"error","ts":1690887618.9425237,"msg":"Reconciler error","controller":"storagecluster","controllerGroup":"ocs.openshift.io","controllerKind":"StorageCluster","storageCluster":{"name":"ocs-storagecluster","namespace":"openshift-storage"},"namespace":"openshift-storage","name":"ocs-storagecluster","reconcileID":"778a3caf-9fef-4737-88cb-58ee8330d9fb","error":"providerAPIServer only supports service of type NodePort and LoadBalancer","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:234"}


15. Edit the ocs-storagecluster and change the value of "providerAPIServerServiceType" to LoadBalancer. Check the service again:
$ oc get service
NAME                   TYPE           CLUSTER-IP      EXTERNAL-IP                                                              PORT(S)             AGE
ocs-metrics-exporter   ClusterIP      172.30.39.247   <none>                                                                   8080/TCP,8081/TCP   29m
ocs-provider-server    LoadBalancer   172.30.45.3     afb0aa7144fa2419bb35319aeb119af9-181750681.us-east-2.elb.amazonaws.com   50051:31659/TCP     39m
rook-ceph-mgr          ClusterIP      172.30.184.49   <none>                                                                   9283/TCP            17m


Additional info:

Link to the Jenkins job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/27647/.

Versions:

OC version:
Client Version: 4.10.24
Server Version: 4.12.0-0.nightly-2023-07-31-091252
Kubernetes Version: v1.25.11+1485cc9

OCS version:
ocs-operator.v4.12.6-rhodf   OpenShift Container Storage   4.12.6-rhodf   ocs-operator.v4.12.5-rhodf   Succeeded

Cluster version
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.12.0-0.nightly-2023-07-31-091252   True        False         81m     Cluster version is 4.12.0-0.nightly-2023-07-31-091252

Rook version:
rook: v4.12.6-0.bc1e9806c3281090b58872e303e947ff5437c078
go: go1.18.10

Comment 14 Itzhak 2023-08-16 09:30:58 UTC

We found an issue with the new ocs 4.11.10-1 image. In the case of NodePort Service, the ocs operator pod keeps requiring. 
We need to fix this from the ocs-operator side, backport the fix and test it again.

Comment 16 Itzhak 2023-08-22 09:44:55 UTC

I tested the BZ with the following steps:

1. Deploy an AWS 4.12 cluster without ODF.

2. Disable the default Red-hat operator:
$ oc patch operatorhub.config.openshift.io/cluster -p='{"spec":{"sources":[{"disabled":true,"name":"redhat-operators"}]}}' --type=merge

3. Get and apply ICPS from catalog image using the commands(in my local): 
$ oc image extract --filter-by-os linux/amd64 --registry-config ~/IBMProjects/ocs-ci/data/pull-secret quay.io/rhceph-dev/ocs-registry:latest-stable-4.12 --confirm --path /icsp.yaml:~/IBMProjects/ocs-ci/icsp
$ oc apply -f ~/IBMProjects/ocs-ci/icsp/icsp.yaml

5. Wait for the MachineConfigPool to be ready. 
$ oc get MachineConfigPool worker

6. Create the Namespace, CatalogSource, and Subscription using the Yaml file above: https://bugzilla.redhat.com/show_bug.cgi?id=2213114#c9.
$ oc apply -f ~/Downloads/deploy-with-olm.yaml

7. Wait until the ocs-operator pod is ready in the openshift-namespace.
$ oc get pods
NAME                                    READY   STATUS    RESTARTS   AGE
ocs-metrics-exporter-5b4b5d9f4b-t66lw   1/1     Running   0          36s
ocs-operator-858f54566c-hv9pj           1/1     Running   0          36s
rook-ceph-operator-5b8b94cb94-zsh5x     1/1     Running   0          36s

8. Modify the AWS security groups according to the ODF to ODF deployment doc.

9. Create the Storagecluster using the Yaml file above: https://bugzilla.redhat.com/show_bug.cgi?id=2213114#c10.
10. Apply the Yaml files //raw.githubusercontent.com/red-hat-storage/mcg-osd-deployer/1eec1147b1ae70e938fa42dabc60453b8cd9449b/shim/crds/noobaa.noobaa.io.yaml, https://raw.githubusercontent.com/noobaa/noobaa-operator/5.12/deploy/obc/objectbucket.io_objectbucketclaims_crd.yaml, and https://bugzilla.redhat.com/show_bug.cgi?id=2213114#c15.

11. Check that all the pods are up and running. 
12. Check that the storagecluster is Ready and Cephcluster health is OK:
$ oc get storageclusters.ocs.openshift.io 
NAME                 AGE   PHASE   EXTERNAL   CREATED AT             VERSION
ocs-storagecluster   13m   Ready              2023-08-22T08:38:18Z   4.12.0
ikave:ocs-ci$ oc get cephclusters.ceph.rook.io 
NAME                             DATADIRHOSTPATH   MONCOUNT   AGE   PHASE   MESSAGE                        HEALTH      EXTERNAL
ocs-storagecluster-cephcluster   /var/lib/rook     3          12m   Ready   Cluster created successfully   HEALTH_OK   


13. Check the of the ocs-operator logs, and the rook-ceph-operator logs and see there are no errors(You may need to restart the rook-ceph-operator pod, if you see errors in the rook-ceph-operator logs):
$ oc logs ocs-operator-85c94d545f-fb5s2  | tail -n 2
{"level":"info","ts":1692693602.5201685,"logger":"controllers.StorageCluster","msg":"Reconciling metrics exporter service","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster","NamespacedName":{"namespace":"openshift-storage","name":"ocs-metrics-exporter"}}
{"level":"info","ts":1692693602.5277436,"logger":"controllers.StorageCluster","msg":"Reconciling metrics exporter service monitor","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster","NamespacedName":{"namespace":"openshift-storage","name":"ocs-metrics-exporter"}}
$ oc logs rook-ceph-operator-7755f57f79-7ngz7  | tail -n 5
2023-08-22 08:55:33.255091 I | op-osd: assigning osd 7 topology affinity to "topology.kubernetes.io/zone=us-east-2a"
2023-08-22 08:55:34.065612 I | cephclient: successfully disallowed pre-pacific osds and enabled all new pacific-only functionality
2023-08-22 08:55:34.065635 I | op-osd: finished running OSDs in namespace "openshift-storage"
2023-08-22 08:55:34.065641 I | ceph-cluster-controller: done reconciling ceph cluster in namespace "openshift-storage"
2023-08-22 08:55:34.095609 I | ceph-cluster-controller: reporting cluster telemetry


14. Check the service and see that the provider server is NodePort as this is the default value:
$ oc get service
NAME                   TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)             AGE
ocs-metrics-exporter   ClusterIP   172.30.77.14     <none>        8080/TCP,8081/TCP   15m
ocs-provider-server    NodePort    172.30.216.248   <none>        50051:31659/TCP     15m
rook-ceph-mgr          ClusterIP   172.30.60.146    <none>        9283/TCP            10m


15. Edit the ocs-storagecluster and change the value of "providerAPIServerServiceType" to LoadBalancer. Check the pods again and verify they are running.
Check again that the storagecluster is Ready and Cephcluster health is OK.

16. Check the service type again:
$ oc get service
NAME                   TYPE           CLUSTER-IP       EXTERNAL-IP                                                              PORT(S)             AGE
ocs-metrics-exporter   ClusterIP      172.30.77.14     <none>                                                                   8080/TCP,8081/TCP   16m
ocs-provider-server    LoadBalancer   172.30.216.248   a320e7cc2295847f8be6b80625533c90-891016187.us-east-2.elb.amazonaws.com   50051:31659/TCP     16m
rook-ceph-mgr          ClusterIP      172.30.60.146    <none>                                                                   9283/TCP            11m


17. Edit the ocs-storagecluster and change the value of "providerAPIServerServiceType" to a dummy value "NodeP". Check the ocs-operator logs and see the expected error: 
$ oc logs ocs-operator-85c94d545f-fb5s2  | tail -n 1
{"level":"error","ts":1692694739.9339497,"msg":"Reconciler error","controller":"storagecluster","controllerGroup":"ocs.openshift.io","controllerKind":"StorageCluster","storageCluster":{"name":"ocs-storagecluster","namespace":"openshift-storage"},"namespace":"openshift-storage","name":"ocs-storagecluster","reconcileID":"d297a4ab-13dc-4eba-b0f9-be3c8a5174d6","error":"providerAPIServer only supports service of type NodePort and LoadBalancer","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:234"}

Also, I checked the Cephcluster and saw the error status:
$ oc get storageclusters.ocs.openshift.io 
NAME                 AGE   PHASE   EXTERNAL   CREATED AT             VERSION
ocs-storagecluster   20m   Error 


18. Edit the ocs-storagecluster and change the value of "providerAPIServerServiceType" to NodePort. Check the service again (Also check again that the storagecluster is Ready and Cephcluster health is OK):
$ oc get service
NAME                   TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)             AGE
ocs-metrics-exporter   ClusterIP   172.30.77.14     <none>        8080/TCP,8081/TCP   22m
ocs-provider-server    NodePort    172.30.216.248   <none>        50051:31659/TCP     22m
rook-ceph-mgr          ClusterIP   172.30.60.146    <none>        9283/TCP            17m


19. Edit the ocs-storagecluster and change the value of "providerAPIServerServiceType" to LoadBalancer. Check the service again (Also check again that the storagecluster is Ready, Cephcluster health is OK, and ocs-operator logs look fine):
$ oc get service
NAME                   TYPE           CLUSTER-IP       EXTERNAL-IP                                                               PORT(S)             AGE
ocs-metrics-exporter   ClusterIP      172.30.77.14     <none>                                                                    8080/TCP,8081/TCP   23m
ocs-provider-server    LoadBalancer   172.30.216.248   a320e7cc2295847f8be6b80625533c90-1681517453.us-east-2.elb.amazonaws.com   50051:31659/TCP     23m
rook-ceph-mgr          ClusterIP      172.30.60.146    <none>                                                                    9283/TCP            18m

$ oc get storageclusters.ocs.openshift.io; oc get cephclusters.ceph.rook.io 
NAME                 AGE   PHASE   EXTERNAL   CREATED AT             VERSION
ocs-storagecluster   25m   Ready              2023-08-22T08:38:18Z   4.12.0
NAME                             DATADIRHOSTPATH   MONCOUNT   AGE   PHASE   MESSAGE                        HEALTH      EXTERNAL
ocs-storagecluster-cephcluster   /var/lib/rook     3          24m   Ready   Cluster created successfully   HEALTH_OK   


Additional info:

Link to the Jenkins job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/28413/.

Versions:

OC version:
Client Version: 4.10.24
Server Version: 4.12.0-0.nightly-2023-08-17-171705
Kubernetes Version: v1.25.11+1485cc9

OCS version:
ocs-operator.v4.12.6-rhodf   OpenShift Container Storage   4.12.6-rhodf   ocs-operator.v4.12.5-rhodf   Succeeded

Cluster version
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.12.0-0.nightly-2023-08-17-171705   True        False         99m     Cluster version is 4.12.0-0.nightly-2023-08-17-171705

Rook version:
rook: v4.12.6-0.bc1e9806c3281090b58872e303e947ff5437c078
go: go1.18.10

Ceph version:
ceph version 16.2.10-187.el8cp (5d6355e2bccd18b5c6457a34cb666d773f21823d) pacific (stable)

Comment 20 errata-xmlrpc 2023-08-23 14:12:45 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Data Foundation 4.12.6 Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:4718

Note You need to log in before you can comment on or make changes to this bug.