Created attachment 2043221 [details] not enough nodes found Description of problem: ---------- On fresh ODF deployment installed on 'odf-storage' namespace the node labels present: oc get nodes -l cluster.ocs.openshift.io/odf-storage="" NAME STATUS ROLES AGE VERSION ip-10-0-0-144.us-west-2.compute.internal Ready worker 18h v1.29.6+aba1e8d ip-10-0-0-181.us-west-2.compute.internal Ready worker 18h v1.29.6+aba1e8d ip-10-0-0-45.us-west-2.compute.internal Ready worker 21h v1.29.6+aba1e8d ip-10-0-0-70.us-west-2.compute.internal Ready worker 18h v1.29.6+aba1e8d ip-10-0-0-78.us-west-2.compute.internal Ready worker 18h v1.29.6+aba1e8d ip-10-0-0-95.us-west-2.compute.internal Ready worker 21h v1.29.6+aba1e8d That triggers an error on StorageCluster “Not enough nodes found” (screenshot added) StorageCluster in error state oc get storagecluster -A NAMESPACE NAME AGE PHASE EXTERNAL CREATED AT VERSION odf-storage ocs-storagecluster 7m25s Error 2024-07-31T15:53:39Z 4.16.0 [jenkins@temp-jagent-dosypenk-r217 terraform-vpc-example]$ oc describe storagecluster ocs-storagecluster -nodf-storage Name: ocs-storagecluster Namespace: odf-storage Labels: <none> Annotations: uninstall.ocs.openshift.io/cleanup-policy: delete uninstall.ocs.openshift.io/mode: graceful API Version: ocs.openshift.io/v1 Kind: StorageCluster Metadata: Creation Timestamp: 2024-07-31T15:53:39Z Finalizers: storagecluster.ocs.openshift.io Generation: 2 Owner References: API Version: odf.openshift.io/v1alpha1 Kind: StorageSystem Name: ocs-storagecluster-storagesystem UID: 2dee21a8-8039-4640-8fd1-9e7a669356b6 Resource Version: 101564 UID: 1d04f184-50c7-4f6f-9777-0f197a2fc1d1 Spec: Arbiter: Encryption: Key Rotation: Schedule: @weekly Kms: External Storage: Managed Resources: Ceph Block Pools: Ceph Cluster: Ceph Config: Ceph Dashboard: Ceph Filesystems: Data Pool Spec: Application: Erasure Coded: Coding Chunks: 0 Data Chunks: 0 Mirroring: Quotas: Replicated: Size: 0 Status Check: Mirror: Ceph Non Resilient Pools: Count: 1 Resources: Volume Claim Template: Metadata: Spec: Resources: Status: Ceph Object Store Users: Ceph Object Stores: Ceph RBD Mirror: Daemon Count: 1 Ceph Toolbox: Mirroring: Network: Connections: Encryption: Multi Cluster Service: Node Topologies: Resource Profile: lean Storage Device Sets: Config: Count: 1 Data PVC Template: Metadata: Spec: Access Modes: ReadWriteOnce Resources: Requests: Storage: 2Ti Storage Class Name: gp3-csi Volume Mode: Block Status: Name: ocs-deviceset-gp3-csi Placement: Portable: true Prepare Placement: Replica: 3 Resources: Status: Conditions: Last Heartbeat Time: 2024-07-31T15:53:40Z Last Transition Time: 2024-07-31T15:53:40Z Message: Version check successful Reason: VersionMatched Status: False Type: VersionMismatch Last Heartbeat Time: 2024-07-31T15:59:08Z Last Transition Time: 2024-07-31T15:53:40Z Message: Error while reconciling: Not enough nodes found: Expected 3, found 0 Reason: ReconcileFailed Status: False Type: ReconcileComplete Last Heartbeat Time: 2024-07-31T15:53:40Z Last Transition Time: 2024-07-31T15:53:40Z Message: Initializing StorageCluster Reason: Init Status: False Type: Available Last Heartbeat Time: 2024-07-31T15:53:40Z Last Transition Time: 2024-07-31T15:53:40Z Message: Initializing StorageCluster Reason: Init Status: True Type: Progressing Last Heartbeat Time: 2024-07-31T15:53:40Z Last Transition Time: 2024-07-31T15:53:40Z Message: Initializing StorageCluster Reason: Init Status: False Type: Degraded Last Heartbeat Time: 2024-07-31T15:53:40Z Last Transition Time: 2024-07-31T15:53:40Z Message: Initializing StorageCluster Reason: Init Status: Unknown Type: Upgradeable Images: Ceph: Desired Image: registry.redhat.io/rhceph/rhceph-7-rhel9@sha256:579e5358418e176194812eeab523289a0c65e366250688be3f465f1a633b026d Noobaa Core: Desired Image: registry.redhat.io/odf4/mcg-core-rhel9@sha256:5f56419be1582bf7a0ee0b9d99efae7523fbf781a88f8fe603182757a315e871 Noobaa DB: Desired Image: registry.redhat.io/rhel9/postgresql-15@sha256:5c4cad6de1b8e2537c845ef43b588a11347a3297bfab5ea611c032f866a1cb4e Kms Server Connection: Phase: Error Version: 4.16.0 Events: <none> [jenkins@temp-jagent-dosypenk-r217 terraform-vpc-example]$ oc get nodes -w NAME STATUS ROLES AGE VERSION ip-10-0-0-144.us-west-2.compute.internal Ready worker 39m v1.29.6+aba1e8d ip-10-0-0-181.us-west-2.compute.internal Ready worker 39m v1.29.6+aba1e8d ip-10-0-0-45.us-west-2.compute.internal Ready worker 3h30m v1.29.6+aba1e8d ip-10-0-0-70.us-west-2.compute.internal Ready worker 41m v1.29.6+aba1e8d ip-10-0-0-78.us-west-2.compute.internal Ready worker 43m v1.29.6+aba1e8d ip-10-0-0-95.us-west-2.compute.internal Ready worker 3h37m v1.29.6+aba1e8d --------- Workaround: oc label node -l node-role.kubernetes.io/worker cluster.ocs.openshift.io/openshift-storage="" --------- Version-Release number of selected component (if applicable): ODF full_version: 4.16.0-137 --------- How reproducible: install ODF on ROSA HCP OCP4.16 cluster Steps to Reproduce: 1. Install ODF 4.16 on ROSA HCP OCP4.16 cluster 2. 3. --------- Actual results: Storage cluster "Not enough nodes found" error. ODF installation stalls, no cephfs, rbd storage classes available Expected results: no errors. ODF is available same as ODF on regular AWS cluster --------- Additional info: ODF installation screen recording - https://drive.google.com/file/d/1y84dNkaj68rov9nbJDAlhcnXwc3cJHs_/view?usp=drive_link Storage System installation screen recording - https://drive.google.com/file/d/12KUnujZmTAAC1H0YqnhXsWjD2PtjRblW/view?usp=sharing
there is a workaround (as discussed on slack thread: https://bugzilla.redhat.com/show_bug.cgi?id=2302235#c2), IMHO this should not be a test blocker...
(In reply to Sanjal Katiyar from comment #3) > there is a workaround (as discussed on slack thread: > https://bugzilla.redhat.com/show_bug.cgi?id=2302235#c2), IMHO this should > not be a test blocker... oc label node -l node-role.kubernetes.io/worker cluster.ocs.openshift.io/openshift-storage=""
Cause: ODF is deployed in a namespace other than "openshift-storage" (ROSA use case). Consequence: UI label the nodes while StorageCluster deployment, and as per the refactoring done in 4.15 (part of support for ROSA and Multiple StorageClusters epics) UI adds a dynamic label "cluster.ocs.openshift.io/<CLUSTER_NAMESPACE>: ''" (where "CLUSTER_NAMESPACE" is the namespace where StorageCluster is getting created). This was done to incorporate possible multiple internal mode StorageClusters cases in future and remove any dependency on the term "openshift-*" for the ROSA clusters. ODF/OCS operators on the other hand are still expecting label to be static and always equal to "cluster.ocs.openshift.io/openshift-storage: ''", irrespective of where ODF is installed or StorageCluster is created. Fix: UI will now always add a static label "cluster.ocs.openshift.io/openshift-storage: ''" to the nodes and will revert back to dynamic once we really/officially support multiple internal mode clusters in the product. Result: Install should proceed now as expected. Workaround: Manually (CLI) label the worker nodes on which we want to deploy the StorageCluster related workloads (`oc label node -l node-role.kubernetes.io/worker cluster.ocs.openshift.io/openshift-storage=""`).