+++ This bug was initially created as a clone of Bug #2212773 +++ Description of problem (please be detailed as possible and provide log snippests): While using both private link and non private link clusters, the ocs-provider-server service tries to come up on the non private subnets of the VPC. This would mean that the endpoint will be exposed and from outside the subnets we can ping the endpoint. The AWS ELB created is of type Classic which doesn't support private link clusters. So we need to move to Network Load Balancer and use a internal facing load balancer so that it's only accessible from within the VPC. We need to add annotations to the service as aws controller looks at the annotation to reconcile the service. More info: https://docs.google.com/document/d/10J-J8EuDm8Q-ZMtY0A3mtmHOx8Xvhn-i28faxfWZwts/edit?usp=sharing Version of all relevant components (if applicable): Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. 2. 3. Actual results: ocs provider server should be deployed on private subnets Expected results: ocs provider server is deployed on public subnets Additional info: --- Additional comment from RHEL Program Management on 2023-06-06 10:12:39 UTC --- This bug having no release flag set previously, is now set with release flag 'odf‑4.13.0' to '?', and so is being proposed to be fixed at the ODF 4.13.0 release. Note that the 3 Acks (pm_ack, devel_ack, qa_ack), if any previously set while release flag was missing, have now been reset since the Acks are to be set against a release flag.
Giving devel ack on Rewant request
I tested the BZ with the following steps: 1. Deploy an AWS 4.11 cluster without ODF. 2. Disable the default Red-hat operator: $ oc patch operatorhub.config.openshift.io/cluster -p='{"spec":{"sources":[{"disabled":true,"name":"redhat-operators"}]}}' --type=merge 3. Get and apply ICPS from catalog image using the commands(in my local): $ oc image extract --filter-by-os linux/amd64 --registry-config ~/IBMProjects/ocs-ci/data/pull-secret quay.io/rhceph-dev/ocs-registry:latest-stable-4.11.10 --confirm --path /icsp.yaml:~/IBMProjects/ocs-ci/icsp $ oc apply -f ~/IBMProjects/ocs-ci/icsp/icsp.yaml 5. Wait for the MachineConfigPool to be ready. $ oc get MachineConfigPool worker 6. Create the Namespace, CatalogSource, and Subscription using the Yaml file above: https://bugzilla.redhat.com/show_bug.cgi?id=2213117#c7. $ oc apply -f ~/Downloads/deploy-with-olm.yaml 7. Wait until the ocs-operator pod is ready in the openshift-namespace. 8. Create the Storagecluster using the Yaml file above: https://bugzilla.redhat.com/show_bug.cgi?id=2213117#c8. (If there is an issue with Noobaa CRDs, we may also need to apply this Yaml file https://raw.githubusercontent.com/red-hat-storage/mcg-osd-deployer/1eec1147b1ae70e938fa42dabc60453b8cd9449b/shim/crds/noobaa.noobaa.io.yaml). The field 'providerAPIServer' is empty in this Yaml file. 9. Check the pods: $ oc get pods NAME READY STATUS RESTARTS AGE noobaa-operator-76464cdd89-k6nst 1/1 Running 0 18m ocs-metrics-exporter-77c684f4ff-w2mtv 1/1 Running 0 7m48s ocs-operator-85b8d66b86-wpg49 1/1 Running 0 18m rook-ceph-operator-764f4cbc78-66bm9 1/1 Running 0 18m 10. Check the service and see that the provider server is NodePort as this is the default value: $ oc get service NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE noobaa-operator-service ClusterIP 172.30.37.82 <none> 443/TCP 12m ocs-provider-server NodePort 172.30.137.131 <none> 50051:31659/TCP 4m41s 11. Edit the ocs-storagecluster and change the value of "providerAPIServerServiceType" to LoadBalancer. Check the pods again: $ oc get pods NAME READY STATUS RESTARTS AGE noobaa-operator-76464cdd89-k6nst 1/1 Running 0 21m ocs-metrics-exporter-77c684f4ff-w2mtv 1/1 Running 0 10m ocs-operator-85b8d66b86-wpg49 1/1 Running 0 21m ocs-provider-server-5d7f659cf-t42ms 1/1 Running 0 12s rook-ceph-operator-764f4cbc78-66bm9 1/1 Running 0 21m 12. Check the service type again: $ oc get service NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE noobaa-operator-service ClusterIP 172.30.37.82 <none> 443/TCP 17m ocs-provider-server LoadBalancer 172.30.137.131 a8225434e09614eda83bfe20c964088b-2002374777.us-east-2.elb.amazonaws.com 50051:31659/TCP 9m16s 13. Edit the ocs-storagecluster and change the value of "providerAPIServerServiceType" to NodePort. Check the service again: $ oc get service NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE noobaa-operator-service ClusterIP 172.30.37.82 <none> 443/TCP 57m ocs-provider-server NodePort 172.30.137.131 <none> 50051:31659/TCP 49m 14. Edit the ocs-storagecluster and change the value of "providerAPIServerServiceType" to a dummy value "someValue". Check the ocs-operator logs and see the expected error: $ oc logs ocs-operator-85b8d66b86-wpg49 | tail -n 1 {"level":"error","ts":1690802919.280578,"logger":"controller.storagecluster","msg":"Reconciler error","reconciler group":"ocs.openshift.io","reconciler kind":"StorageCluster","name":"ocs-storagecluster","namespace":"openshift-storage","error":"providerAPIServer only supports service of type NodePort and LoadBalancer","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227"} 15. Edit the ocs-storagecluster and change the value of "providerAPIServerServiceType" to LoadBalancer. Check the service again: $ oc get service NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE noobaa-operator-service ClusterIP 172.30.37.82 <none> 443/TCP 131m ocs-provider-server LoadBalancer 172.30.137.131 a8225434e09614eda83bfe20c964088b-1866348329.us-east-2.elb.amazonaws.com 50051:31659/TCP 123m Additional info: Link to the Jenkins job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/27592/. Versions: OC version: Client Version: 4.10.24 Server Version: 4.11.0-0.nightly-2023-07-29-013834 Kubernetes Version: v1.24.15+a9da4a8 OCS version: ocs-operator.v4.11.10 OpenShift Container Storage 4.11.10 ocs-operator.v4.11.9 Succeeded Cluster version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2023-07-29-013834 True False 74m Cluster version is 4.11.0-0.nightly-2023-07-29-013834
We found an issue with the new ocs 4.11.10-1 image. In the case of NodePort Service, the ocs operator pod keeps requiring. We need to fix this from the ocs-operator side, backport the fix and test it again.
I tested the BZ with the following steps: 1. Deploy an AWS 4.11 cluster without ODF. 2. Disable the default Red-hat operator: $ oc patch operatorhub.config.openshift.io/cluster -p='{"spec":{"sources":[{"disabled":true,"name":"redhat-operators"}]}}' --type=merge 3. Get and apply ICPS from catalog image using the commands(in my local): $ oc image extract --filter-by-os linux/amd64 --registry-config ~/IBMProjects/ocs-ci/data/pull-secret quay.io/rhceph-dev/ocs-registry:latest-stable-4.11.10 --confirm --path /icsp.yaml:~/IBMProjects/ocs-ci/icsp $ oc apply -f ~/IBMProjects/ocs-ci/icsp/icsp.yaml 5. Wait for the MachineConfigPool to be ready. $ oc get MachineConfigPool worker 6. Create the Namespace, CatalogSource, and Subscription using the Yaml file above: https://bugzilla.redhat.com/show_bug.cgi?id=2213117#c7. $ oc apply -f ~/Downloads/deploy-with-olm.yaml 7. Wait until the ocs-operator pod is ready in the openshift-namespace. $ oc get pods NAME READY STATUS RESTARTS AGE ocs-metrics-exporter-5b4b5d9f4b-t66lw 1/1 Running 0 36s ocs-operator-858f54566c-hv9pj 1/1 Running 0 36s rook-ceph-operator-5b8b94cb94-zsh5x 1/1 Running 0 36s 8. Modify the AWS security groups according to the ODF to ODF deployment doc. 9. Create the Storagecluster using the Yaml file above: https://bugzilla.redhat.com/show_bug.cgi?id=2213117#c8(I just modified the field "name: default-ocs-storage-class" to "name: gp2", as we didn't have the storageclass 'default-ocs-storage-clas' in the cluster). 10. Apply the Yaml files //raw.githubusercontent.com/red-hat-storage/mcg-osd-deployer/1eec1147b1ae70e938fa42dabc60453b8cd9449b/shim/crds/noobaa.noobaa.io.yaml, https://raw.githubusercontent.com/noobaa/noobaa-operator/5.12/deploy/obc/objectbucket.io_objectbucketclaims_crd.yaml, and https://bugzilla.redhat.com/show_bug.cgi?id=2213117#c14. 11. Check that all the pods are up and running. 12. Check that the storagecluster is Ready and Cephcluster health is OK: $ oc get storageclusters.ocs.openshift.io NAME AGE PHASE EXTERNAL CREATED AT VERSION ocs-storagecluster 38m Ready 2023-08-22T10:25:57Z 4.11.0 $ oc get storageclusters.ocs.openshift.io; oc get cephclusters.ceph.rook.io NAME AGE PHASE EXTERNAL CREATED AT VERSION ocs-storagecluster 68m Ready 2023-08-22T10:25:57Z 4.11.0 NAME DATADIRHOSTPATH MONCOUNT AGE PHASE MESSAGE HEALTH EXTERNAL ocs-storagecluster-cephcluster /var/lib/rook 3 34m Ready Cluster created successfully HEALTH_OK 13. Check the ocs-operator logs, and the rook-ceph-operator logs and see there are no errors(You may need to restart the rook-ceph-operator pod, if you see errors in the rook-ceph-operator logs): $ oc logs ocs-operator-6f6cff7688-h4wpn | tail -n 3 {"level":"info","ts":1692702234.812269,"logger":"controllers.StorageCluster","msg":"No component operator reported negatively.","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster"} {"level":"info","ts":1692702234.8209746,"logger":"controllers.StorageCluster","msg":"Reconciling metrics exporter service","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster","NamespacedName":{"namespace":"openshift-storage","name":"ocs-metrics-exporter"}} {"level":"info","ts":1692702234.8274703,"logger":"controllers.StorageCluster","msg":"Reconciling metrics exporter service monitor","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster","NamespacedName":{"namespace":"openshift-storage","name":"ocs-metrics-exporter"}} $ oc logs rook-ceph-operator-6bf77dcb78-pg6fx | tail -n 3 2023-08-22 11:04:19.600216 I | clusterdisruption-controller: deleting temporary blocking pdb with "rook-ceph-osd-zone-us-east-2b" with maxUnavailable=0 for "zone" failure domain "us-east-2b" 2023-08-22 11:04:19.611403 I | clusterdisruption-controller: deleting temporary blocking pdb with "rook-ceph-osd-zone-us-east-2c" with maxUnavailable=0 for "zone" failure domain "us-east-2c" 2023-08-22 11:04:38.844935 I | op-mon: checking if multiple mons are on the same node 14. Check the service and see that the provider server is NodePort as this is the default value: $ oc get service NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE noobaa-operator-service ClusterIP 172.30.167.20 <none> 443/TCP 17m ocs-metrics-exporter ClusterIP 172.30.11.159 <none> 8080/TCP,8081/TCP 5m8s ocs-provider-server NodePort 172.30.74.46 <none> 50051:31659/TCP 38m rook-ceph-mgr ClusterIP 172.30.138.60 <none> 9283/TCP 89s 15. Edit the ocs-storagecluster and change the value of "providerAPIServerServiceType" to LoadBalancer. Check the pods again and verify they are running. Check again that the storagecluster is Ready and Cephcluster health is OK. 16. Check the service type again: $ oc get service NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE noobaa-operator-service ClusterIP 172.30.167.20 <none> 443/TCP 52m ocs-metrics-exporter ClusterIP 172.30.11.159 <none> 8080/TCP,8081/TCP 40m ocs-provider-server LoadBalancer 172.30.74.46 a1be071203a4e48e89049a36bc349f4f-31343475.us-east-2.elb.amazonaws.com 50051:31659/TCP 73m rook-ceph-mgr ClusterIP 172.30.138.60 <none> 9283/TCP 36m 17. Edit the ocs-storagecluster and change the value of "providerAPIServerServiceType" to a dummy value "SomeValue". Check the ocs-operator logs and see the expected error: $ oc logs ocs-operator-6f6cff7688-h4wpn | tail -n 1 {"level":"error","ts":1692704485.3816984,"logger":"controller.storagecluster","msg":"Reconciler error","reconciler group":"ocs.openshift.io","reconciler kind":"StorageCluster","name":"ocs-storagecluster","namespace":"openshift-storage","error":"providerAPIServer only supports service of type NodePort and LoadBalancer","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227"} Also, I checked the Cephcluster and saw the error status: $ oc get storageclusters.ocs.openshift.io NAME AGE PHASE EXTERNAL CREATED AT VERSION ocs-storagecluster 75m Error 2023-08-22T10:25:57Z 4.11.0 18. Edit the ocs-storagecluster and change the value of "providerAPIServerServiceType" to NodePort. Check the service again (Also check again that the storagecluster is Ready and Cephcluster health is OK): $ oc get service NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE noobaa-operator-service ClusterIP 172.30.167.20 <none> 443/TCP 55m ocs-metrics-exporter ClusterIP 172.30.11.159 <none> 8080/TCP,8081/TCP 42m ocs-provider-server NodePort 172.30.74.46 <none> 50051:31659/TCP 76m rook-ceph-mgr ClusterIP 172.30.138.60 <none> 9283/TCP 39m 19. Edit the ocs-storagecluster and change the value of "providerAPIServerServiceType" to LoadBalancer. Check the service again (Also check again that the storagecluster is Ready, Cephcluster health is OK, and ocs-operator logs look fine): $ oc get service NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE noobaa-operator-service ClusterIP 172.30.167.20 <none> 443/TCP 56m ocs-metrics-exporter ClusterIP 172.30.11.159 <none> 8080/TCP,8081/TCP 44m ocs-provider-server LoadBalancer 172.30.74.46 a1be071203a4e48e89049a36bc349f4f-942302348.us-east-2.elb.amazonaws.com 50051:31659/TCP 78m rook-ceph-mgr ClusterIP 172.30.138.60 <none> 9283/TCP 40m $ oc get storageclusters.ocs.openshift.io NAME AGE PHASE EXTERNAL CREATED AT VERSION ocs-storagecluster 78m Ready 2023-08-22T10:25:57Z 4.11.0 $ oc get cephclusters.ceph.rook.io NAME DATADIRHOSTPATH MONCOUNT AGE PHASE MESSAGE HEALTH EXTERNAL ocs-storagecluster-cephcluster /var/lib/rook 3 44m Ready Cluster created successfully HEALTH_OK Additional info: Link to the Jenkins job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/28414/. Versions: OC version: Client Version: 4.10.24 Server Version: 4.11.0-0.nightly-2023-08-21-030257 Kubernetes Version: v1.24.15+a9da4a8 OCS version: ocs-operator.v4.11.10 OpenShift Container Storage 4.11.10 ocs-operator.v4.11.9 Succeeded Cluster version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2023-08-21-030257 True False 4h5m Cluster version is 4.11.0-0.nightly-2023-08-21-030257 Rook version: rook: v4.11.10-0.6934e4e22735898ae2286d4b4623b80966c1bd8c go: go1.17.12 Ceph version: ceph version 16.2.10-187.el8cp (5d6355e2bccd18b5c6457a34cb666d773f21823d) pacific (stable)
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Data Foundation 4.11.10 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:4775