2302235 – [UI deployment][ODF on ROSA HCP] No correct labels on the worker nodes

Bug 2302235 - [UI deployment][ODF on ROSA HCP] No correct labels on the worker nodes

Summary: [UI deployment][ODF on ROSA HCP] No correct labels on the worker nodes

Keywords:
Status:	ON_QA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	management-console
Sub Component:
Version:	4.16
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	ODF 4.17.0
Assignee:	Sanjal Katiyar
QA Contact:	Daniel Osypenko
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2303083
TreeView+	depends on / blocked

Reported:	2024-08-01 11:20 UTC by Daniel Osypenko
Modified:	2024-10-16 16:57 UTC (History)
CC List:	6 users (show)
Fixed In Version:	4.17.0-77
Doc Type:	Bug Fix
Doc Text:	Cause: ODF is installed in a namespace other than "openshift-storage" (ROSA use case). Consequence: UI label the nodes while StorageSystem deployment and adds a dynamic label "cluster.ocs.openshift.io/<CLUSTER_NAMESPACE>: ''" (where "CLUSTER_NAMESPACE" is the namespace where StorageSystem is getting created). ODF/OCS operators on the other hand are still expecting label to be static and always equal to "cluster.ocs.openshift.io/openshift-storage: ''", irrespective of where ODF is installed or StorageSystem is deployed. Fix: UI will now always add a static label "cluster.ocs.openshift.io/openshift-storage: ''" to the nodes. Result: Install should proceed as expected now. Workaround: Label the nodes manually on which we want to deploy the StorageSystem related workloads. Example: To label all the worker nodes: `oc label node -l node-role.kubernetes.io/worker cluster.ocs.openshift.io/openshift-storage=""`. To label a specific node(s): `oc label node <NODE_NAME> cluster.ocs.openshift.io/openshift-storage=""`
Clone Of:
Clones:	2303083 (view as bug list)
Environment:
Last Closed:
Embargoed:

Attachments	(Terms of Use)
not enough nodes found (126.28 KB, image/png) 2024-08-01 11:20 UTC, Daniel Osypenko	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	red-hat-storage odf-console pull 1519	None	Merged	Fix nodes' label for ROSA clusters	2024-08-07 05:18:02 UTC
Github	red-hat-storage odf-console pull 1521	None	open	Bug 2302235: [release-4.17] Fix nodes' label for ROSA clusters	2024-08-06 12:11:44 UTC
Github	red-hat-storage odf-console pull 1522	None	open	Bug 2302235: [release-4.17-compatibility] Fix nodes' label for ROSA clusters	2024-08-06 12:11:55 UTC
Red Hat Issue Tracker	OCSBZM-8788	None	None	None	2024-08-01 11:21:46 UTC

Description Daniel Osypenko 2024-08-01 11:20:47 UTC

Created attachment 2043221 [details]
not enough nodes found

Description of problem:

----------

On fresh ODF deployment installed on 'odf-storage' namespace the node labels present:

oc get nodes -l cluster.ocs.openshift.io/odf-storage=""
NAME                                       STATUS   ROLES    AGE   VERSION
ip-10-0-0-144.us-west-2.compute.internal   Ready    worker   18h   v1.29.6+aba1e8d
ip-10-0-0-181.us-west-2.compute.internal   Ready    worker   18h   v1.29.6+aba1e8d
ip-10-0-0-45.us-west-2.compute.internal    Ready    worker   21h   v1.29.6+aba1e8d
ip-10-0-0-70.us-west-2.compute.internal    Ready    worker   18h   v1.29.6+aba1e8d
ip-10-0-0-78.us-west-2.compute.internal    Ready    worker   18h   v1.29.6+aba1e8d
ip-10-0-0-95.us-west-2.compute.internal    Ready    worker   21h   v1.29.6+aba1e8d


That triggers an error on StorageCluster
“Not enough nodes found” (screenshot added)

StorageCluster in error state
oc get storagecluster -A
NAMESPACE     NAME                 AGE     PHASE   EXTERNAL   CREATED AT             VERSION
odf-storage   ocs-storagecluster   7m25s   Error              2024-07-31T15:53:39Z   4.16.0
[jenkins@temp-jagent-dosypenk-r217 terraform-vpc-example]$ oc describe  storagecluster ocs-storagecluster -nodf-storage
Name:         ocs-storagecluster
Namespace:    odf-storage
Labels:       <none>
Annotations:  uninstall.ocs.openshift.io/cleanup-policy: delete
              uninstall.ocs.openshift.io/mode: graceful
API Version:  ocs.openshift.io/v1
Kind:         StorageCluster
Metadata:
  Creation Timestamp:  2024-07-31T15:53:39Z
  Finalizers:
    storagecluster.ocs.openshift.io
  Generation:  2
  Owner References:
    API Version:     odf.openshift.io/v1alpha1
    Kind:            StorageSystem
    Name:            ocs-storagecluster-storagesystem
    UID:             2dee21a8-8039-4640-8fd1-9e7a669356b6
  Resource Version:  101564
  UID:               1d04f184-50c7-4f6f-9777-0f197a2fc1d1
Spec:
  Arbiter:
  Encryption:
    Key Rotation:
      Schedule:  @weekly
    Kms:
  External Storage:
  Managed Resources:
    Ceph Block Pools:
    Ceph Cluster:
    Ceph Config:
    Ceph Dashboard:
    Ceph Filesystems:
      Data Pool Spec:
        Application:
        Erasure Coded:
          Coding Chunks:  0
          Data Chunks:    0
        Mirroring:
        Quotas:
        Replicated:
          Size:  0
        Status Check:
          Mirror:
    Ceph Non Resilient Pools:
      Count:  1
      Resources:
      Volume Claim Template:
        Metadata:
        Spec:
          Resources:
        Status:
    Ceph Object Store Users:
    Ceph Object Stores:
    Ceph RBD Mirror:
      Daemon Count:  1
    Ceph Toolbox:
  Mirroring:
  Network:
    Connections:
      Encryption:
    Multi Cluster Service:
  Node Topologies:
  Resource Profile:  lean
  Storage Device Sets:
    Config:
    Count:  1
    Data PVC Template:
      Metadata:
      Spec:
        Access Modes:
          ReadWriteOnce
        Resources:
          Requests:
            Storage:         2Ti
        Storage Class Name:  gp3-csi
        Volume Mode:         Block
      Status:
    Name:  ocs-deviceset-gp3-csi
    Placement:
    Portable:  true
    Prepare Placement:
    Replica:  3
    Resources:
Status:
  Conditions:
    Last Heartbeat Time:   2024-07-31T15:53:40Z
    Last Transition Time:  2024-07-31T15:53:40Z
    Message:               Version check successful
    Reason:                VersionMatched
    Status:                False
    Type:                  VersionMismatch
    Last Heartbeat Time:   2024-07-31T15:59:08Z
    Last Transition Time:  2024-07-31T15:53:40Z
    Message:               Error while reconciling: Not enough nodes found: Expected 3, found 0
    Reason:                ReconcileFailed
    Status:                False
    Type:                  ReconcileComplete
    Last Heartbeat Time:   2024-07-31T15:53:40Z
    Last Transition Time:  2024-07-31T15:53:40Z
    Message:               Initializing StorageCluster
    Reason:                Init
    Status:                False
    Type:                  Available
    Last Heartbeat Time:   2024-07-31T15:53:40Z
    Last Transition Time:  2024-07-31T15:53:40Z
    Message:               Initializing StorageCluster
    Reason:                Init
    Status:                True
    Type:                  Progressing
    Last Heartbeat Time:   2024-07-31T15:53:40Z
    Last Transition Time:  2024-07-31T15:53:40Z
    Message:               Initializing StorageCluster
    Reason:                Init
    Status:                False
    Type:                  Degraded
    Last Heartbeat Time:   2024-07-31T15:53:40Z
    Last Transition Time:  2024-07-31T15:53:40Z
    Message:               Initializing StorageCluster
    Reason:                Init
    Status:                Unknown
    Type:                  Upgradeable
  Images:
    Ceph:
      Desired Image:  registry.redhat.io/rhceph/rhceph-7-rhel9@sha256:579e5358418e176194812eeab523289a0c65e366250688be3f465f1a633b026d
    Noobaa Core:
      Desired Image:  registry.redhat.io/odf4/mcg-core-rhel9@sha256:5f56419be1582bf7a0ee0b9d99efae7523fbf781a88f8fe603182757a315e871
    Noobaa DB:
      Desired Image:  registry.redhat.io/rhel9/postgresql-15@sha256:5c4cad6de1b8e2537c845ef43b588a11347a3297bfab5ea611c032f866a1cb4e
  Kms Server Connection:
  Phase:    Error
  Version:  4.16.0
Events:     <none>
[jenkins@temp-jagent-dosypenk-r217 terraform-vpc-example]$ oc get nodes -w
NAME                                       STATUS   ROLES    AGE     VERSION
ip-10-0-0-144.us-west-2.compute.internal   Ready    worker   39m     v1.29.6+aba1e8d
ip-10-0-0-181.us-west-2.compute.internal   Ready    worker   39m     v1.29.6+aba1e8d
ip-10-0-0-45.us-west-2.compute.internal    Ready    worker   3h30m   v1.29.6+aba1e8d
ip-10-0-0-70.us-west-2.compute.internal    Ready    worker   41m     v1.29.6+aba1e8d
ip-10-0-0-78.us-west-2.compute.internal    Ready    worker   43m     v1.29.6+aba1e8d
ip-10-0-0-95.us-west-2.compute.internal    Ready    worker   3h37m   v1.29.6+aba1e8d

---------

Workaround:
oc label node -l node-role.kubernetes.io/worker cluster.ocs.openshift.io/openshift-storage=""

---------

Version-Release number of selected component (if applicable):
ODF full_version: 4.16.0-137

---------

How reproducible:
install ODF on ROSA HCP OCP4.16 cluster

Steps to Reproduce:
1. Install ODF 4.16 on ROSA HCP OCP4.16 cluster
2.
3.

---------

Actual results:
Storage cluster "Not enough nodes found" error. ODF installation stalls, no cephfs, rbd storage classes available

Expected results:
no errors. ODF is available same as ODF on regular AWS cluster  

---------

Additional info:

ODF installation screen recording -  https://drive.google.com/file/d/1y84dNkaj68rov9nbJDAlhcnXwc3cJHs_/view?usp=drive_link

Storage System installation screen recording -  https://drive.google.com/file/d/12KUnujZmTAAC1H0YqnhXsWjD2PtjRblW/view?usp=sharing

Comment 3 Sanjal Katiyar 2024-08-05 04:05:03 UTC

there is a workaround (as discussed on slack thread: https://bugzilla.redhat.com/show_bug.cgi?id=2302235#c2), IMHO this should not be a test blocker...

Comment 4 Sanjal Katiyar 2024-08-05 04:53:57 UTC

(In reply to Sanjal Katiyar from comment #3)
> there is a workaround (as discussed on slack thread:
> https://bugzilla.redhat.com/show_bug.cgi?id=2302235#c2), IMHO this should
> not be a test blocker...

oc label node -l node-role.kubernetes.io/worker cluster.ocs.openshift.io/openshift-storage=""

Comment 5 Sanjal Katiyar 2024-08-06 09:22:34 UTC

Cause:
ODF is deployed in a namespace other than "openshift-storage" (ROSA use case).

Consequence:
UI label the nodes while StorageCluster deployment, and as per the refactoring done in 4.15 (part of support for ROSA and Multiple StorageClusters epics) UI adds a dynamic label "cluster.ocs.openshift.io/<CLUSTER_NAMESPACE>: ''" (where "CLUSTER_NAMESPACE" is the namespace where StorageCluster is getting created). This was done to incorporate possible multiple internal mode StorageClusters cases in future and remove any dependency on the term "openshift-*" for the ROSA clusters.

ODF/OCS operators on the other hand are still expecting label to be static and always equal to "cluster.ocs.openshift.io/openshift-storage: ''", irrespective of where ODF is installed or StorageCluster is created.

Fix:
UI will now always add a static label "cluster.ocs.openshift.io/openshift-storage: ''" to the nodes and will revert back to dynamic once we really/officially support multiple internal mode clusters in the product.

Result:
Install should proceed now as expected.

Workaround:
Manually (CLI) label the worker nodes on which we want to deploy the StorageCluster related workloads (`oc label node -l node-role.kubernetes.io/worker cluster.ocs.openshift.io/openshift-storage=""`).

Note You need to log in before you can comment on or make changes to this bug.