Bug 2218863
| Summary: | [Backport to 4.13.z]OCS Provider Server service comes up on public subnets | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Jilju Joy <jijoy> |
| Component: | ocs-operator | Assignee: | Rewant <resoni> |
| Status: | CLOSED ERRATA | QA Contact: | Itzhak <ikave> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.13 | CC: | ebenahar, ikave, muagarwa, nigoyal, odf-bz-bot, resoni, sheggodu |
| Target Milestone: | --- | ||
| Target Release: | ODF 4.13.2 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | 4.13.2-3 | Doc Type: | No Doc Update |
| Doc Text: | Story Points: | --- | |
| Clone Of: | 2212773 | Environment: | |
| Last Closed: | 2023-08-23 10:34:40 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 2212773, 2218867 | ||
| Bug Blocks: | 2213114, 2213117 | ||
|
Description
Jilju Joy
2023-06-30 10:17:19 UTC
*** Bug 2218867 has been marked as a duplicate of this bug. *** Giving devel ack on Rewant request I tested the BZ with the following steps:
1. Deploy an AWS 4.13 cluster without ODF.
2. Disable the default Red-hat operator:
$ oc patch operatorhub.config.openshift.io/cluster -p='{"spec":{"sources":[{"disabled":true,"name":"redhat-operators"}]}}' --type=merge
3. Get and apply ICPS from catalog image using the commands(in my local):
$ oc image extract --filter-by-os linux/amd64 --registry-config ~/IBMProjects/ocs-ci/data/pull-secret quay.io/rhceph-dev/ocs-registry:latest-stable-4.13 --confirm --path /icsp.yaml:~/IBMProjects/ocs-ci/icsp
$ oc apply -f ~/IBMProjects/ocs-ci/icsp/icsp.yaml
5. Wait for the MachineConfigPool to be ready.
$ oc get MachineConfigPool worker
6. Create the Namespace, CatalogSource, and Subscription using the Yaml file above: https://bugzilla.redhat.com/show_bug.cgi?id=2218863#c9.
$ oc apply -f ~/Downloads/deploy-with-olm.yaml
7. Wait until the ocs-operator pod is ready in the openshift-namespace.
8. Create the Storagecluster using the Yaml file above: https://bugzilla.redhat.com/show_bug.cgi?id=2218863#c10.
(If there is an issue with Noobaa CRDs, we may also need to apply this Yaml file https://raw.githubusercontent.com/red-hat-storage/mcg-osd-deployer/1eec1147b1ae70e938fa42dabc60453b8cd9449b/shim/crds/noobaa.noobaa.io.yaml). The field 'providerAPIServer' is empty in this Yaml file.
9. Check the pods:
$ oc get pods
NAME READY STATUS RESTARTS AGE
ocs-metrics-exporter-c67ff7957-v8p5b 1/1 Running 0 21m
ocs-operator-5f569b66cf-2dkcz 1/1 Running 5 (9m33s ago) 22m
rook-ceph-operator-7757dffc4c-ccdb5 1/1 Running 3 (77s ago) 7m42s
10. Check the service and see that the provider server is NodePort as this is the default value:
$ oc get service
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
ocs-provider-server NodePort 172.30.104.23 <none> 50051:31659/TCP 8m3s
11. Edit the ocs-storagecluster and change the value of "providerAPIServerServiceType" to LoadBalancer. Check the pods again:
$ oc get pods | grep ocs
ocs-metrics-exporter-c67ff7957-v8p5b 1/1 Running 0 25m
ocs-operator-5f569b66cf-2dkcz 1/1 Running 5 (13m ago) 26m
ocs-provider-server-599c77c4db-2jm6t 1/1 Running 0 99s
12. Check the service type again:
$ oc get service
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
ocs-metrics-exporter ClusterIP 172.30.30.232 <none> 8080/TCP,8081/TCP 2m3s
ocs-provider-server LoadBalancer 172.30.104.23 ac72c186bbc7847669404e8b05d02e19-487159474.us-east-2.elb.amazonaws.com 50051:31659/TCP 12m
13. Edit the ocs-storagecluster and change the value of "providerAPIServerServiceType" to NodePort. Check the service again:
$ oc get service
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
ocs-metrics-exporter ClusterIP 172.30.30.232 <none> 8080/TCP,8081/TCP 4m
ocs-provider-server NodePort 172.30.104.23 <none> 50051:31659/TCP 14m
14. Edit the ocs-storagecluster and change the value of "providerAPIServerServiceType" to a dummy value "someValue". Check the ocs-operator logs and see the expected error:
$ oc logs ocs-operator-5f569b66cf-2dkcz | tail -n 1
{"level":"error","ts":"2023-08-08T10:20:50Z","msg":"Reconciler error","controller":"storagecluster","controllerGroup":"ocs.openshift.io","controllerKind":"StorageCluster","StorageCluster":{"name":"ocs-storagecluster","namespace":"openshift-storage"},"namespace":"openshift-storage","name":"ocs-storagecluster","reconcileID":"c02bb1f2-ddf9-46a0-a986-bef98c60a07f","error":"providerAPIServer only supports service of type NodePort and LoadBalancer","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:274\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:235"}
15. Edit the ocs-storagecluster and change the value of "providerAPIServerServiceType" to LoadBalancer. Check the service again:
$ oc get service
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
ocs-metrics-exporter ClusterIP 172.30.30.232 <none> 8080/TCP,8081/TCP 36m
ocs-provider-server LoadBalancer 172.30.104.23 ac72c186bbc7847669404e8b05d02e19-1874304719.us-east-2.elb.amazonaws.com 50051:31659/TCP 46m
rook-ceph-exporter ClusterIP 172.30.107.181 <none> 9926/TCP 19m
rook-ceph-mgr ClusterIP 172.30.111.135 <none> 9283/TCP 19m
Additional info:
Link to the Jenkins job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/27909/
OC version:
Client Version: 4.10.24
Server Version: 4.13.0-0.nightly-2023-08-07-165810
Kubernetes Version: v1.26.6+73ac561
OCS version:
ocs-operator.v4.13.2-rhodf OpenShift Container Storage 4.13.2-rhodf ocs-operator.v4.13.1-rhodf Installing
Cluster version:
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.13.0-0.nightly-2023-08-07-165810 True False 111m Cluster version is 4.13.0-0.nightly-2023-08-07-165810
Rook version:
rook: v4.13.2-0.b57f0c7db8116e754fc77b55825d7fd75c6f1aa3
go: go1.19.10
Ceph version:
ceph version 17.2.6-100.el9cp (ea4e3ef8df2cf26540aae06479df031dcfc80343) quincy (stable)
We found an issue with the new ocs 4.11.10-1 image. In the case of NodePort Service, the ocs operator pod keeps requiring. We need to fix this from the ocs-operator side, backport the fix and test it again. I tested the BZ with the following steps:
1. Deploy an AWS 4.13 cluster without ODF.
2. Disable the default Red-hat operator:
$ oc patch operatorhub.config.openshift.io/cluster -p='{"spec":{"sources":[{"disabled":true,"name":"redhat-operators"}]}}' --type=merge
3. Get and apply ICPS from catalog image using the commands(in my local):
$ oc image extract --filter-by-os linux/amd64 --registry-config ~/IBMProjects/ocs-ci/data/pull-secret quay.io/rhceph-dev/ocs-registry:latest-stable-4.13 --confirm --path /icsp.yaml:~/IBMProjects/ocs-ci/icsp
$ oc apply -f ~/IBMProjects/ocs-ci/icsp/icsp.yaml
5. Wait for the MachineConfigPool to be ready.
$ oc get MachineConfigPool worker
6. Create the Namespace, CatalogSource, and Subscription using the Yaml file above: https://bugzilla.redhat.com/show_bug.cgi?id=2218863#c9.
$ oc apply -f ~/Downloads/deploy-with-olm.yaml
7. Wait until the ocs-operator pod is ready in the openshift-namespace.
$ oc get pods
NAME READY STATUS RESTARTS AGE
ocs-metrics-exporter-5b4b5d9f4b-t66lw 1/1 Running 0 36s
ocs-operator-858f54566c-hv9pj 1/1 Running 0 36s
rook-ceph-operator-5b8b94cb94-zsh5x 1/1 Running 0 36s
8. Create the Storagecluster using the Yaml file above: https://bugzilla.redhat.com/show_bug.cgi?id=2218863#c10.
9. Apply the Yaml files //raw.githubusercontent.com/red-hat-storage/mcg-osd-deployer/1eec1147b1ae70e938fa42dabc60453b8cd9449b/shim/crds/noobaa.noobaa.io.yaml, and https://raw.githubusercontent.com/noobaa/noobaa-operator/5.12/deploy/obc/objectbucket.io_objectbucketclaims_crd.yaml(in order to get errors raised in the ocs-operator logs and rook-ceph-operator logs).
10. Modify the AWS security groups according to the ODF to ODF deployment doc.
11. Check that all the pods are up and running.
12. Check that the storagecluster is Ready and Cephcluster health is OK:
$ oc get storageclusters.ocs.openshift.io
NAME AGE PHASE EXTERNAL CREATED AT VERSION
ocs-storagecluster 28m Ready 2023-08-21T14:22:02Z 4.13.2
$ oc get cephclusters.ceph.rook.io
NAME DATADIRHOSTPATH MONCOUNT AGE PHASE MESSAGE HEALTH EXTERNAL FSID
ocs-storagecluster-cephcluster /var/lib/rook 3 19m Ready Cluster created successfully HEALTH_OK 80c2f59c-8f2d-4cab-9365-d5cc379228a5
13. Check the of the ocs-operator logs, and the rook-ceph-operator logs and see there are no errors(You may need to restart the rook-ceph-operator pod, if you see errors in the rook-ceph-operator logs):
$ oc logs ocs-operator-858f54566c-5ghm8 | tail -n 2
{"level":"info","ts":"2023-08-21T15:15:10Z","logger":"controllers.StorageCluster","msg":"Reconciling metrics exporter service","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster","NamespacedName":{"namespace":"openshift-storage","name":"ocs-metrics-exporter"}}
{"level":"info","ts":"2023-08-21T15:15:10Z","logger":"controllers.StorageCluster","msg":"Reconciling metrics exporter service monitor","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster","NamespacedName":{"namespace":"openshift-storage","name":"ocs-metrics-exporter"}}
$ oc logs rook-ceph-operator-5b8b94cb94-fxbbl | tail -n 6
2023-08-21 15:11:35.186965 I | op-osd: finished running OSDs in namespace "openshift-storage"
2023-08-21 15:11:35.186977 I | ceph-cluster-controller: done reconciling ceph cluster in namespace "openshift-storage"
2023-08-21 15:11:35.201234 I | ceph-cluster-controller: reporting cluster telemetry
2023-08-21 15:11:36.123245 I | cephclient: reconciling replicated pool ocs-storagecluster-cephfilesystem-ssd succeeded
2023-08-21 15:11:37.122258 I | cephclient: setting allow_standby_replay for filesystem "ocs-storagecluster-cephfilesystem"
2023-08-21 15:11:52.183588 I | op-mon: checking if multiple mons are on the same node
14. Check the service and see that the provider server is NodePort as this is the default value:
$ oc get service
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
ocs-metrics-exporter ClusterIP 172.30.91.58 <none> 8080/TCP,8081/TCP 20m
ocs-provider-server NodePort 172.30.224.19 <none> 50051:31659/TCP 20m
rook-ceph-exporter ClusterIP 172.30.249.190 <none> 9926/TCP 16m
rook-ceph-mgr ClusterIP 172.30.14.168 <none> 9283/TCP 16m
15. Edit the ocs-storagecluster and change the value of "providerAPIServerServiceType" to LoadBalancer. Check the pods again and verified they are running.
Check again that the storagecluster is Ready and Cephcluster health is OK.
16. Check the service type again:
$ oc get service
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
ocs-metrics-exporter ClusterIP 172.30.91.58 <none> 8080/TCP,8081/TCP 46m
ocs-provider-server LoadBalancer 172.30.224.19 acb1009d640564952b4d03c1ac1ca8e8-1890944852.us-east-2.elb.amazonaws.com 50051:31659/TCP 46m
rook-ceph-exporter ClusterIP 172.30.249.190 <none> 9926/TCP 43m
rook-ceph-mgr ClusterIP 172.30.14.168 <none> 9283/TCP 42m
17. Edit the ocs-storagecluster and change the value of "providerAPIServerServiceType" to a dummy value "foo". Check the ocs-operator logs and see the expected error:
$ oc logs ocs-operator-858f54566c-5ghm8 | tail -n 1
{"level":"error","ts":"2023-08-21T15:22:21Z","msg":"Reconciler error","controller":"storagecluster","controllerGroup":"ocs.openshift.io","controllerKind":"StorageCluster","StorageCluster":{"name":"ocs-storagecluster","namespace":"openshift-storage"},"namespace":"openshift-storage","name":"ocs-storagecluster","reconcileID":"3b9222e9-f8dd-4db8-ba2e-2b017f29be3b","error":"providerAPIServer only supports service of type NodePort and LoadBalancer","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:274\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:235"}
Also, I checked the Cephcluster and saw the error status:
$ oc get storageclusters.ocs.openshift.io
NAME AGE PHASE EXTERNAL CREATED AT VERSION
ocs-storagecluster 60m Error 2023-08-21T14:22:02Z 4.13.2
18. Edit the ocs-storagecluster and change the value of "providerAPIServerServiceType" to NodePort. Check the service again (Also check again that the storagecluster is Ready and Cephcluster health is OK):
$ oc get service
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
ocs-metrics-exporter ClusterIP 172.30.91.58 <none> 8080/TCP,8081/TCP 52m
ocs-provider-server NodePort 172.30.224.19 <none> 50051:31659/TCP 52m
rook-ceph-exporter ClusterIP 172.30.249.190 <none> 9926/TCP 48m
rook-ceph-mgr ClusterIP 172.30.14.168 <none> 9283/TCP 48m
19. Edit the ocs-storagecluster and change the value of "providerAPIServerServiceType" to LoadBalancer. Check the service again (Also check again that the storagecluster is Ready, Cephcluster health is OK, and ocs-operator logs looks fine):
$ oc get service
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
ocs-metrics-exporter ClusterIP 172.30.91.58 <none> 8080/TCP,8081/TCP 53m
ocs-provider-server LoadBalancer 172.30.224.19 acb1009d640564952b4d03c1ac1ca8e8-277880892.us-east-2.elb.amazonaws.com 50051:31659/TCP 53m
rook-ceph-exporter ClusterIP 172.30.249.190 <none> 9926/TCP 50m
rook-ceph-mgr ClusterIP 172.30.14.168 <none> 9283/TCP 49m
Additional info:
Link to the Jenkins job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/28348/.
Versions:
OC version:
Client Version: 4.10.24
Server Version: 4.13.0-0.nightly-2023-08-11-101506
Kubernetes Version: v1.26.6+6bf3f75
OCS version:
ocs-operator.v4.13.2-rhodf OpenShift Container Storage 4.13.2-rhodf ocs-operator.v4.13.1-rhodf Succeeded
Cluster version
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.13.0-0.nightly-2023-08-11-101506 True False 6h58m Cluster version is 4.13.0-0.nightly-2023-08-11-101506
Rook version:
rook: v4.13.2-0.b57f0c7db8116e754fc77b55825d7fd75c6f1aa3
go: go1.19.10
Ceph version:
ceph version 17.2.6-100.el9cp (ea4e3ef8df2cf26540aae06479df031dcfc80343) quincy (stable)
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Data Foundation 4.13.2 security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:4716 |