Description of problem (please be detailed as possible and provide log snippests): The storage cluster phase is showing as "Error" in a provider cluster. This is causing the 'ocs-provider-qe' add-on to remain in Failed state. Tested in managed service platform. $ oc get storagecluster NAME AGE PHASE EXTERNAL CREATED AT VERSION ocs-storagecluster 160m Error 2022-05-23T08:40:11Z $ oc get cephblockpool NAME AGE cephblockpool-storageconsumer-e9000440-68c8-4750-bbc5-5d942784ffc9 107m $ oc get sc NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE gp2 (default) kubernetes.io/aws-ebs Delete WaitForFirstConsumer true 178m gp2-csi ebs.csi.aws.com Delete WaitForFirstConsumer true 178m gp3-csi ebs.csi.aws.com Delete WaitForFirstConsumer true 178m ocs-storagecluster-cephfs openshift-storage.cephfs.csi.ceph.com Delete Immediate true 150m $ oc logs ocs-operator-5985b8b5f4-g99mr --tail=20 {"level":"info","ts":1653303198.6508234,"logger":"controllers.StorageCluster","msg":"Platform is set to skip object store. Not creating a CephObjectStore.","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster","Platform":"AWS"} {"level":"info","ts":1653303198.6508474,"logger":"controllers.StorageCluster","msg":"Platform is set to skip object store. Not creating a CephObjectStoreUser.","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster","Platform":"AWS"} {"level":"info","ts":1653303198.6508522,"logger":"controllers.StorageCluster","msg":"Platform is set to skip Ceph RGW Route. Not creating a Ceph RGW Route.","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster","platform":"AWS"} {"level":"info","ts":1653303198.6509066,"logger":"controllers.StorageCluster","msg":"Waiting for CephBlockPool to be Ready. Skip reconciling StorageClass","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster","CephBlockPool":"ocs-storagecluster-cephblockpool/openshift-storage","StorageClass":"ocs-storagecluster-ceph-rbd"} {"level":"error","ts":1653303198.6609645,"logger":"controller-runtime.manager.controller.storagecluster","msg":"Reconciler error","reconciler group":"ocs.openshift.io","reconciler kind":"StorageCluster","name":"ocs-storagecluster","namespace":"openshift-storage","error":"some StorageClasses [ocs-storagecluster-ceph-rbd] were skipped while waiting for pre-requisites to be met","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:253\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:214"} {"level":"info","ts":1653303261.633716,"logger":"controllers.StorageCluster","msg":"Reconciling StorageCluster.","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster","StorageCluster":"openshift-storage/ocs-storagecluster"} {"level":"info","ts":1653303261.633753,"logger":"controllers.StorageCluster","msg":"Spec.AllowRemoteStorageConsumers is enabled. Creating Provider API resources","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster"} {"level":"info","ts":1653303261.6396916,"logger":"controllers.StorageCluster","msg":"Service create/update succeeded","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster"} {"level":"info","ts":1653303261.6398952,"logger":"controllers.StorageCluster","msg":"status.storageProviderEndpoint is updated","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster","Endpoint":"10.0.128.9:31659"} {"level":"info","ts":1653303261.646025,"logger":"controllers.StorageCluster","msg":"Deployment is running as desired","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster"} {"level":"info","ts":1653303261.6461484,"logger":"controllers.StorageCluster","msg":"Adding topology label from Node.","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster","Node":"ip-10-0-142-154.us-east-2.compute.internal","Label":"failure-domain.beta.kubernetes.io/zone","Value":"us-east-2a"} {"level":"info","ts":1653303261.6461625,"logger":"controllers.StorageCluster","msg":"Adding topology label from Node.","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster","Node":"ip-10-0-142-154.us-east-2.compute.internal","Label":"failure-domain.beta.kubernetes.io/region","Value":"us-east-2"} {"level":"info","ts":1653303261.6461685,"logger":"controllers.StorageCluster","msg":"Adding topology label from Node.","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster","Node":"ip-10-0-182-3.us-east-2.compute.internal","Label":"failure-domain.beta.kubernetes.io/zone","Value":"us-east-2b"} {"level":"info","ts":1653303261.646174,"logger":"controllers.StorageCluster","msg":"Adding topology label from Node.","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster","Node":"ip-10-0-201-124.us-east-2.compute.internal","Label":"failure-domain.beta.kubernetes.io/zone","Value":"us-east-2c"} {"level":"info","ts":1653303261.6464894,"logger":"controllers.StorageCluster","msg":"Restoring original CephFilesystem.","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster","CephFileSystem":"openshift-storage/ocs-storagecluster-cephfilesystem"} {"level":"info","ts":1653303261.6528974,"logger":"controllers.StorageCluster","msg":"Platform is set to skip object store. Not creating a CephObjectStore.","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster","Platform":"AWS"} {"level":"info","ts":1653303261.6529174,"logger":"controllers.StorageCluster","msg":"Platform is set to skip object store. Not creating a CephObjectStoreUser.","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster","Platform":"AWS"} {"level":"info","ts":1653303261.6529224,"logger":"controllers.StorageCluster","msg":"Platform is set to skip Ceph RGW Route. Not creating a Ceph RGW Route.","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster","platform":"AWS"} {"level":"info","ts":1653303261.6529758,"logger":"controllers.StorageCluster","msg":"Waiting for CephBlockPool to be Ready. Skip reconciling StorageClass","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster","CephBlockPool":"ocs-storagecluster-cephblockpool/openshift-storage","StorageClass":"ocs-storagecluster-ceph-rbd"} {"level":"error","ts":1653303261.6629755,"logger":"controller-runtime.manager.controller.storagecluster","msg":"Reconciler error","reconciler group":"ocs.openshift.io","reconciler kind":"StorageCluster","name":"ocs-storagecluster","namespace":"openshift-storage","error":"some StorageClasses [ocs-storagecluster-ceph-rbd] were skipped while waiting for pre-requisites to be met","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:253\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:214"} $ rosa list addon -c jijoy-m23-pr|grep ocs-provider-qe ocs-provider-qe Red Hat OpenShift Data Foundation Managed Service Provider (QE) failed $ ocm list clusters | grep jijoy-m23-pr 1scf5pngrsi8vmi0o4qikucri33sd6lh jijoy-m23-pr https://api.jijoy-m23-pr.41dj.s1.devshift.org:6443 4.10.13 rosa aws us-east-2 ready $ oc get csv ocs-osd-deployer.v2.0.2 NAME DISPLAY VERSION REPLACES PHASE ocs-osd-deployer.v2.0.2 OCS OSD Deployer 2.0.2 ocs-osd-deployer.v2.0.1 Failed $ oc get deployment ocs-osd-controller-manager NAME READY UP-TO-DATE AVAILABLE AGE ocs-osd-controller-manager 0/1 1 0 80m $ oc get pods -o wide | grep ocs-osd-controller-manager ocs-osd-controller-manager-6b74c4cc67-4xdjd 2/3 Running 0 80m 10.131.0.28 ip-10-0-201-124.us-east-2.compute.internal <none> <none> $ oc get managedocs managedocs -o yaml apiVersion: ocs.openshift.io/v1alpha1 kind: ManagedOCS metadata: creationTimestamp: "2022-05-23T08:39:32Z" finalizers: - managedocs.ocs.openshift.io generation: 1 name: managedocs namespace: openshift-storage resourceVersion: "47837" uid: e0153a35-52d8-4433-bfdd-cabf3d2345de spec: {} status: components: alertmanager: state: Ready prometheus: state: Ready storageCluster: state: Pending reconcileStrategy: strict $ oc describe pod ocs-osd-controller-manager-6b74c4cc67-4xdjd | grep Events -A 100 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 82m default-scheduler Successfully assigned openshift-storage/ocs-osd-controller-manager-6b74c4cc67-4xdjd to ip-10-0-201-124.us-east-2.compute.internal Normal AddedInterface 82m multus Add eth0 [10.131.0.28/23] from openshift-sdn Normal Pulled 82m kubelet Container image "quay.io/openshift/origin-kube-rbac-proxy:4.10.0" already present on machine Normal Created 82m kubelet Created container kube-rbac-proxy Normal Started 82m kubelet Started container kube-rbac-proxy Normal Pulling 82m kubelet Pulling image "quay.io/osd-addons/ocs-osd-deployer:2.0.2-2" Normal Pulled 82m kubelet Container image "quay.io/osd-addons/ocs-osd-deployer:2.0.2-2" already present on machine Normal Pulled 82m kubelet Successfully pulled image "quay.io/osd-addons/ocs-osd-deployer:2.0.2-2" in 2.480468207s Normal Created 82m kubelet Created container manager Normal Started 82m kubelet Started container manager Normal Created 82m kubelet Created container readiness-server Normal Started 82m kubelet Started container readiness-server Warning ProbeError 82m kubelet Readiness probe error: HTTP probe failed with statuscode: 500 body: Warning Unhealthy 82m kubelet Readiness probe failed: HTTP probe failed with statuscode: 500 Warning Unhealthy 81m (x6 over 82m) kubelet Readiness probe failed: HTTP probe failed with statuscode: 503 Warning ProbeError 2m25s (x543 over 82m) kubelet Readiness probe error: HTTP probe failed with statuscode: 503 body: Version of all relevant components (if applicable): $ oc get csv NAME DISPLAY VERSION REPLACES PHASE mcg-operator.v4.10.2 NooBaa Operator 4.10.2 mcg-operator.v4.10.1 Succeeded ocs-operator.v4.10.2 OpenShift Container Storage 4.10.2 ocs-operator.v4.10.1 Succeeded ocs-osd-deployer.v2.0.2 OCS OSD Deployer 2.0.2 ocs-osd-deployer.v2.0.1 Failed odf-csi-addons-operator.v4.10.2 CSI Addons 4.10.2 odf-csi-addons-operator.v4.10.1 Succeeded odf-operator.v4.10.2 OpenShift Data Foundation 4.10.2 odf-operator.v4.10.1 Succeeded ose-prometheus-operator.4.10.0 Prometheus Operator 4.10.0 ose-prometheus-operator.4.8.0 Succeeded route-monitor-operator.v0.1.408-c2256a2 Route Monitor Operator 0.1.408-c2256a2 route-monitor-operator.v0.1.406-54ff884 Succeeded $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.13 True False 166m Cluster version is 4.10.13 $ oc get csv odf-operator.v4.10.2 -o yaml| grep full_version full_version: 4.10.2-3 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yes. Storagecluster state is not reaching Ready state. Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? Yes Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Create a provider cluster Actual results: The storage cluster phase is showing as "Error" in a provider cluster. 'ocs-provider-qe' add-on is in in Failed state. Expected results: Storage cluster should be Ready. provider-qe addon installation should succeed. Additional info:
Adding Regression keyword because the installation was working with previous version - Deployer 2.0.1 with ODF 4.10.0 GA.
must-gather logs : http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-m23-pr/jijoy-m23-pr_20220523T080402/logs/testcases_1653306906/
Since the PR attached is already merged for 4.11 , should the status on the BZ be ON_QA?
Looks like this issue was fixed in deployer and nothing was required in the product. According to the chat (https://bugzilla.redhat.com/show_bug.cgi?id=2089296#c3) Jilju mentions that the issue is not even reproducible in 4.10.3 which means we don't require a bug in 4.10 and the BZ targeted for 4.10 can be closed The attached PR is not relevant for this fix and should be removed. Attached PR is for the perf BZ #2068398 For 4.10 we had a different PR/bug BZ #2078715 IMO, we should do this 1. Remove BZ link from the PR. 2. Move the current BZ to managed service and mark it ON_QA 3. Close the 4.10 BZ #2096302 Ohad - FYI - Let me know if this makes sense.
It does, with a very small correction. The bug was not fixed in the deployer it was fixed in the product as part of the fix for the pref bug. Because the pref bug had a completely diff fix for 4.10 and 4.11 the entire thing got confusing.
Ok. so no need to move this bug to MS and it can be verified along with the perf bug and 4.10 clone is not needed.
Verified in version: ODF 4.11.0-104 OCP 4.10.18 $ oc -n openshift-storage get csv NAME DISPLAY VERSION REPLACES PHASE mcg-operator.v4.11.0 NooBaa Operator 4.11.0 mcg-operator.v4.10.4 Succeeded ocs-operator.v4.11.0 OpenShift Container Storage 4.11.0 ocs-operator.v4.10.4 Succeeded ocs-osd-deployer.v2.0.2 OCS OSD Deployer 2.0.2 ocs-osd-deployer.v2.0.1 Succeeded odf-csi-addons-operator.v4.11.0 CSI Addons 4.11.0 odf-csi-addons-operator.v4.10.4 Succeeded odf-operator.v4.11.0 OpenShift Data Foundation 4.11.0 odf-operator.v4.10.2 Succeeded ose-prometheus-operator.4.10.0 Prometheus Operator 4.10.0 ose-prometheus-operator.4.8.0 Succeeded route-monitor-operator.v0.1.422-151be96 Route Monitor Operator 0.1.422-151be96 route-monitor-operator.v0.1.420-b65f47e Succeeded $ rosa list addon -c fbalak-prov27|grep ocs-provider-qe ocs-provider-qe Red Hat OpenShift Data Foundation Managed Service Provider (QE) ready $ ocm list clusters | grep fbalak-prov27 1t3h55itvjj6p8cm5hvmg9v7mjo1lceg fbalak-prov27 https://api.fbalak-prov27.be5a.s1.devshift.org:6443 4.10.18 rosa aws us-east-1 ready $ oc get deployment ocs-osd-controller-manager NAME READY UP-TO-DATE AVAILABLE AGE ocs-osd-controller-manager 1/1 1 1 26h $ oc get pods -o wide | grep ocs-osd-controller-manager ocs-osd-controller-manager-6cbb8889fc-k9bm6 3/3 Running 1 (21h ago) 21h 10.129.2.36 ip-10-0-171-213.ec2.internal <none> <none> $ oc get managedocs managedocs -o yaml apiVersion: ocs.openshift.io/v1alpha1 kind: ManagedOCS metadata: creationTimestamp: "2022-06-27T08:18:43Z" finalizers: - managedocs.ocs.openshift.io generation: 1 name: managedocs namespace: openshift-storage resourceVersion: "340704" uid: 34529f17-0e61-43a9-bceb-fbae15fdbf93 spec: {} status: components: alertmanager: state: Ready prometheus: state: Ready storageCluster: state: Ready reconcileStrategy: strict $ oc get storagecluster NAME AGE PHASE EXTERNAL CREATED AT VERSION ocs-storagecluster 26h Ready 2022-06-27T08:19:01Z $ oc get cephblockpool NAME PHASE cephblockpool-storageconsumer-7c25e752-8ce3-4470-bc36-391d2404417e Ready $ oc get sc NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE gp2 (default) kubernetes.io/aws-ebs Delete WaitForFirstConsumer true 26h gp2-csi ebs.csi.aws.com Delete WaitForFirstConsumer true 26h gp3-csi ebs.csi.aws.com Delete WaitForFirstConsumer true 26h
This fix is only required in 4.11 , since there is a different fix for 4.10.z addressed here https://bugzilla.redhat.com/show_bug.cgi?id=2078715. Hence removing the 4.10.z? flag
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.11.0 security, enhancement, & bugfix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:6156
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days