Description of problem (please be detailed as possible and provide log snippests): ODF 4.13 deployment on vSphere is failing due to rook-ceph-mon-* PVC are in pending state Version of all relevant components (if applicable): openshift installer (4.13.0-0.nightly-2023-02-13-235211) ocs-registry:4.13.0-73 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yes Is there any workaround available to the best of your knowledge? NA Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? 3/3 Can this issue reproduce from the UI? Not tried If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Install ODF 4.13 on vSphere platform 2. check storagecluster is in ready state and all PVC's are in Bound state 3. Actual results: $ oc get storagecluster NAME AGE PHASE EXTERNAL CREATED AT VERSION ocs-storagecluster 37m Progressing 2023-02-14T15:51:23Z 4.12.0 $ oc get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE rook-ceph-mon-a Pending thin-csi 36m rook-ceph-mon-b Pending thin-csi 36m rook-ceph-mon-c Pending thin-csi 36m Expected results: storagecluster should be in Ready state Additional info: From OCP 4.13, default SC in vSPhere is changed to thin-csi and we are using the same for storagecluster creation. $ oc get sc NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE ocs-storagecluster-ceph-rgw openshift-storage.ceph.rook.io/bucket Delete Immediate false 39m thin-csi (default) csi.vsphere.vmware.com Delete WaitForFirstConsumer true 55m $ oc describe storagecluster ocs-storagecluster Name: ocs-storagecluster Namespace: openshift-storage Labels: <none> Annotations: uninstall.ocs.openshift.io/cleanup-policy: delete uninstall.ocs.openshift.io/mode: graceful API Version: ocs.openshift.io/v1 Kind: StorageCluster . . . Status: Conditions: Last Heartbeat Time: 2023-02-14T16:31:57Z Last Transition Time: 2023-02-14T15:51:24Z Message: Error while reconciling: some StorageClasses were skipped while waiting for pre-requisites to be met: [ocs-storagecluster-cephfs,ocs-storagecluster-ceph-rbd] Reason: ReconcileFailed Status: False Type: ReconcileComplete Last Heartbeat Time: 2023-02-14T15:51:24Z Last Transition Time: 2023-02-14T15:51:24Z Message: Initializing StorageCluster Reason: Init Status: False Type: Available Last Heartbeat Time: 2023-02-14T15:51:24Z Last Transition Time: 2023-02-14T15:51:24Z Message: Initializing StorageCluster Reason: Init Status: True Type: Progressing Last Heartbeat Time: 2023-02-14T15:51:24Z Last Transition Time: 2023-02-14T15:51:24Z Message: Initializing StorageCluster Reason: Init Status: False Type: Degraded Last Heartbeat Time: 2023-02-14T15:51:24Z Last Transition Time: 2023-02-14T15:51:24Z Message: Initializing StorageCluster Reason: Init Status: Unknown Type: Upgradeable > $ oc get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE rook-ceph-mon-a Pending thin-csi 41m rook-ceph-mon-b Pending thin-csi 41m rook-ceph-mon-c Pending thin-csi 41m > $ oc describe pvc rook-ceph-mon-a Name: rook-ceph-mon-a Namespace: openshift-storage StorageClass: thin-csi Status: Pending Volume: Labels: app=rook-ceph-mon app.kubernetes.io/component=cephclusters.ceph.rook.io app.kubernetes.io/created-by=rook-ceph-operator app.kubernetes.io/instance=a app.kubernetes.io/managed-by=rook-ceph-operator app.kubernetes.io/name=ceph-mon app.kubernetes.io/part-of=ocs-storagecluster-cephcluster ceph-version=16.2.10-94 ceph_daemon_id=a ceph_daemon_type=mon mon=a mon_canary=true mon_cluster=openshift-storage pvc_name=rook-ceph-mon-a pvc_size=50Gi rook-version=v4.13.0-0.d94a73188db5ba4deac2618d195ddadd60212d5f rook.io/operator-namespace=openshift-storage rook_cluster=openshift-storage Annotations: volume.beta.kubernetes.io/storage-provisioner: csi.vsphere.vmware.com volume.kubernetes.io/selected-node: compute-1 volume.kubernetes.io/storage-provisioner: csi.vsphere.vmware.com Finalizers: [kubernetes.io/pvc-protection] . . . Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal WaitForFirstConsumer 41m persistentvolume-controller waiting for first consumer to be created before binding Warning ProvisioningFailed 15m (x15 over 41m) csi.vsphere.vmware.com_control-plane-0_90f251a8-6a4d-4c55-91bb-870e17f1c925 failed to provision volume with StorageClass "thin-csi": rpc error: code = DeadlineExceeded desc = context deadline exceeded Normal Provisioning 5m1s (x18 over 41m) csi.vsphere.vmware.com_control-plane-0_90f251a8-6a4d-4c55-91bb-870e17f1c925 External provisioner is provisioning volume for claim "openshift-storage/rook-ceph-mon-a" Normal ExternalProvisioning 94s (x166 over 41m) persistentvolume-controller waiting for a volume to be created, either by external provisioner "csi.vsphere.vmware.com" or manually created by system administrator > job link: https://url.corp.redhat.com/6781bd2 must gather: https://url.corp.redhat.com/542d982
The provisioner is not creating and binding the PV: waiting for a volume to be created, either by external provisioner "csi.vsphere.vmware.com" or manually created by system administrator Can you create any other test pod that provisions a volume from the thin-csi storage class? The thin-csi provisioner doesn't appear to be working.
Can you try testing on the latest 4.13 builds ex: `4.13.0-86`, there was a fix that is included in `4.13.0-85`.
@pbalogh is cluster live?
Once Nitin change the storageclass name, were able to see the error mentioned in the logs bug those errors come from k8s or ocp where you forgot to apply some security labels in a namespace, I tried applying some labels in the namespace and the error was restricted to ``` 2023-02-22 07:53:22.763567 I | op-mon: waiting for canary pod creation rook-ceph-mon-b-canary W0222 07:53:22.971775 1 warnings.go:70] would violate PodSecurity "baseline:latest": hostPath volumes (volumes "ceph-daemons-sock-dir", "rook-ceph-log", "rook-ceph-crash"), privileged (containers "mon", "log-collector" must not set securityContext.privileged=true) ```
I applied this label ```kubectl label --overwrite ns openshift-storage \ pod-security.kubernetes.io/enforce=privileged \ pod-security.kubernetes.io/warn=baseline \ pod-security.kubernetes.io/audit=baseline``` and the error was limited to one single error mentioned. Please check that
can we close this bz?
Moving to ON_QA to verify the resolution.
everything place, removing my needinfo
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Data Foundation 4.13.0 enhancement and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:3742